Pandas and Xarray#
Here we will focus on pandas and xarray. numpy functions are widely used even within pandas and xarray.
Differences Between xarray and pandas#
Feature |
pandas |
xarray |
|---|---|---|
Data Structure |
1D (Series) & 2D (DataFrame) |
N-dimensional arrays (DataArray, Dataset) |
Indexing |
Labels for rows/columns |
Labels for multiple dimensions (multi-dimensional indexing) |
Handling of Dimensions |
Works best with tabular data |
Designed for multi-dimensional data (e.g., time, latitude, longitude) |
Performance |
Optimized for 1D & 2D operations |
Optimized for large multi-dimensional datasets |
Multi-Indexing |
Supported but can get complex |
Native multi-dimensional indexing |
NetCDF & Scientific Data Support |
No built-in support |
Built-in support for NetCDF, GRIB, and HDF5 formats (limited Zarr support) |
Aggregation & Reduction |
Groupby and pivot for summarization |
Can reduce over multiple dimensions easily |
Memory Usage |
Efficient for tabular data |
Can be optimized with chunking via Dask |
Integration with Dask |
Limited support |
Natively supports lazy loading & parallel computation |
Use Case |
Best for structured tabular data (e.g., CSV, SQL) |
Best for multi-dimensional scientific data (e.g., climate models, geospatial, satellite data) |
When to Use Pandas vs. Xarray?#
🔹 Use Pandas when:
You have structured, tabular data (like CSV, Excel, SQL).
Your data fits well in rows and columns.
You need fast operations on 1D or 2D datasets.
🔹 Use Xarray when:
Your data is multi-dimensional (e.g., climate, geospatial, satellite data).
You work with scientific datasets (NetCDF, GRIB, HDF5).
You need efficient handling of large datasets with parallel processing (via Dask).
You require native multi-dimensional indexing and operations.
Pandas#
Find the pandas reference here
A good way to learn pandas is through 10 minutes to pandas
import pandas as pd
from pandas import read_csv
import uwa
read_csv('Nearfield.csv')
| frequency | speed | bw | wl | k | ar | rnf | |
|---|---|---|---|---|---|---|---|
| 0 | 10000 | 1470 | 3 | 0.1470 | 42.742757 | 2.860016 | 43.702814 |
| 1 | 10000 | 1470 | 4 | 0.1470 | 42.742757 | 2.145202 | 24.587202 |
| 2 | 10000 | 1470 | 5 | 0.1470 | 42.742757 | 1.716358 | 15.739405 |
| 3 | 10000 | 1470 | 6 | 0.1470 | 42.742757 | 1.430498 | 10.933195 |
| 4 | 10000 | 1470 | 7 | 0.1470 | 42.742757 | 1.226344 | 8.035203 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1542 | 100000 | 1470 | 15 | 0.0147 | 427.427572 | 0.057357 | 0.175773 |
| 1543 | 100000 | 1470 | 16 | 0.0147 | 427.427572 | 0.053794 | 0.154610 |
| 1544 | 100000 | 1470 | 17 | 0.0147 | 427.427572 | 0.050651 | 0.137070 |
| 1545 | 100000 | 1470 | 18 | 0.0147 | 427.427572 | 0.047858 | 0.122372 |
| 1546 | 100000 | 1470 | 19 | 0.0147 | 427.427572 | 0.045361 | 0.109933 |
1547 rows × 7 columns
Read csv#
nf_df = pd.read_csv('Nearfield.csv')
Creating a DataFrame in Pandas#
DataFrames are pretty much the same as dataframes in R basically the tabular form of Python dictionaries and can be create quickly:
pd.DataFrame({'First':[1,2,3],
"Second":["One", "Two", "Three"]})
| First | Second | |
|---|---|---|
| 0 | 1 | One |
| 1 | 2 | Two |
| 2 | 3 | Three |
We can use the uwa package to generate results in a loop and combine the resuls dict into a dataframe:
nfs = pd.DataFrame([ #convert to dataframe
uwa.AcousticWave(speed =1470, frequency=f * 1000, bw=b).__dict__ #extract all attributes as dict
for f in range(10,101) #loop over a range of frequencies
for b in range(3,20) #loop over a range of beam widths
])
#save the dataframe as csv
nfs.to_csv("Nearfield.csv", index=False)
#let's see the Dataframe
nfs
| frequency | speed | bw | wl | k | ar | rnf | |
|---|---|---|---|---|---|---|---|
| 0 | 10000 | 1470 | 3 | 0.1470 | 42.742757 | 2.860016 | 43.702814 |
| 1 | 10000 | 1470 | 4 | 0.1470 | 42.742757 | 2.145202 | 24.587202 |
| 2 | 10000 | 1470 | 5 | 0.1470 | 42.742757 | 1.716358 | 15.739405 |
| 3 | 10000 | 1470 | 6 | 0.1470 | 42.742757 | 1.430498 | 10.933195 |
| 4 | 10000 | 1470 | 7 | 0.1470 | 42.742757 | 1.226344 | 8.035203 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1542 | 100000 | 1470 | 15 | 0.0147 | 427.427572 | 0.057357 | 0.175773 |
| 1543 | 100000 | 1470 | 16 | 0.0147 | 427.427572 | 0.053794 | 0.154610 |
| 1544 | 100000 | 1470 | 17 | 0.0147 | 427.427572 | 0.050651 | 0.137070 |
| 1545 | 100000 | 1470 | 18 | 0.0147 | 427.427572 | 0.047858 | 0.122372 |
| 1546 | 100000 | 1470 | 19 | 0.0147 | 427.427572 | 0.045361 | 0.109933 |
1547 rows × 7 columns
nfs.describe()
| frequency | speed | bw | wl | k | ar | rnf | |
|---|---|---|---|---|---|---|---|
| count | 1547.000000 | 1547.0 | 1547.000000 | 1547.000000 | 1547.000000 | 1547.000000 | 1547.000000 |
| mean | 55000.000000 | 1470.0 | 11.000000 | 0.038097 | 235.085165 | 0.268131 | 2.062818 |
| std | 26276.345124 | 0.0 | 4.900564 | 0.028463 | 112.312344 | 0.289890 | 3.820341 |
| min | 10000.000000 | 1470.0 | 3.000000 | 0.014700 | 42.742757 | 0.045361 | 0.109933 |
| 25% | 32000.000000 | 1470.0 | 7.000000 | 0.018846 | 136.776823 | 0.099356 | 0.335185 |
| 50% | 55000.000000 | 1470.0 | 11.000000 | 0.026727 | 235.085165 | 0.170922 | 0.746784 |
| 75% | 78000.000000 | 1470.0 | 15.000000 | 0.045937 | 333.393506 | 0.319617 | 2.000565 |
| max | 100000.000000 | 1470.0 | 19.000000 | 0.147000 | 427.427572 | 2.860016 | 43.702814 |
Pandas is useful to read tabular files like csv files as pd.read_csv(filename) and supports a long list of other file formats.
Pandas Dataframes use indexing a lot and MultiIndexes are common:
nfs = nfs.set_index(['frequency', 'bw'])
nfs
| speed | wl | k | ar | rnf | ||
|---|---|---|---|---|---|---|
| frequency | bw | |||||
| 10000 | 3 | 1470 | 0.1470 | 42.742757 | 2.860016 | 43.702814 |
| 4 | 1470 | 0.1470 | 42.742757 | 2.145202 | 24.587202 | |
| 5 | 1470 | 0.1470 | 42.742757 | 1.716358 | 15.739405 | |
| 6 | 1470 | 0.1470 | 42.742757 | 1.430498 | 10.933195 | |
| 7 | 1470 | 0.1470 | 42.742757 | 1.226344 | 8.035203 | |
| ... | ... | ... | ... | ... | ... | ... |
| 100000 | 15 | 1470 | 0.0147 | 427.427572 | 0.057357 | 0.175773 |
| 16 | 1470 | 0.0147 | 427.427572 | 0.053794 | 0.154610 | |
| 17 | 1470 | 0.0147 | 427.427572 | 0.050651 | 0.137070 | |
| 18 | 1470 | 0.0147 | 427.427572 | 0.047858 | 0.122372 | |
| 19 | 1470 | 0.0147 | 427.427572 | 0.045361 | 0.109933 |
1547 rows × 5 columns
Data Wrangling#
There a loads of built-in functions in pandas to manipulate data frames, including summarising, handling missing data, combining data, reshaping data or grouping data, and more. Cheat sheets can be very useful to get an overview
Subsetting data#
´loc´ and ´iloc´#
´loc´ df.loc[ Subset dataframe by condition and select columns
nfs = nfs.reset_index()
nfs.loc[nfs.frequency == 38000, ['wl','k','rnf']]
| wl | k | rnf | |
|---|---|---|---|
| 476 | 0.038684 | 162.422477 | 11.500741 |
| 477 | 0.038684 | 162.422477 | 6.470316 |
| 478 | 0.038684 | 162.422477 | 4.141949 |
| 479 | 0.038684 | 162.422477 | 2.877157 |
| 480 | 0.038684 | 162.422477 | 2.114527 |
| 481 | 0.038684 | 162.422477 | 1.619552 |
| 482 | 0.038684 | 162.422477 | 1.280198 |
| 483 | 0.038684 | 162.422477 | 1.037461 |
| 484 | 0.038684 | 162.422477 | 0.857863 |
| 485 | 0.038684 | 162.422477 | 0.721265 |
| 486 | 0.038684 | 162.422477 | 0.614959 |
| 487 | 0.038684 | 162.422477 | 0.530609 |
| 488 | 0.038684 | 162.422477 | 0.462560 |
| 489 | 0.038684 | 162.422477 | 0.406868 |
| 490 | 0.038684 | 162.422477 | 0.360711 |
| 491 | 0.038684 | 162.422477 | 0.322032 |
| 492 | 0.038684 | 162.422477 | 0.289298 |
Subset dataframe by multiple conditions:
nf_df.loc[(nf_df.frequency == 10000) & (nf_df.bw > 10)]
| frequency | speed | bw | wl | k | ar | rnf | |
|---|---|---|---|---|---|---|---|
| 8 | 10000 | 1470 | 11 | 0.147 | 42.742757 | 0.781114 | 3.259881 |
| 9 | 10000 | 1470 | 12 | 0.147 | 42.742757 | 0.716231 | 2.740806 |
| 10 | 10000 | 1470 | 13 | 0.147 | 42.742757 | 0.661346 | 2.336845 |
| 11 | 10000 | 1470 | 14 | 0.147 | 42.742757 | 0.614318 | 2.016316 |
| 12 | 10000 | 1470 | 15 | 0.147 | 42.742757 | 0.573574 | 1.757730 |
| 13 | 10000 | 1470 | 16 | 0.147 | 42.742757 | 0.537938 | 1.546097 |
| 14 | 10000 | 1470 | 17 | 0.147 | 42.742757 | 0.506507 | 1.370703 |
| 15 | 10000 | 1470 | 18 | 0.147 | 42.742757 | 0.478581 | 1.223721 |
| 16 | 10000 | 1470 | 19 | 0.147 | 42.742757 | 0.453605 | 1.099332 |
Subset by index
nf_df.iloc[0,:]
frequency 10000.000000
speed 1470.000000
bw 3.000000
wl 0.147000
k 42.742757
ar 2.860016
rnf 43.702814
Name: 0, dtype: float64
nf_df.iloc[0:10,:]
| frequency | speed | bw | wl | k | ar | rnf | |
|---|---|---|---|---|---|---|---|
| 0 | 10000 | 1470 | 3 | 0.147 | 42.742757 | 2.860016 | 43.702814 |
| 1 | 10000 | 1470 | 4 | 0.147 | 42.742757 | 2.145202 | 24.587202 |
| 2 | 10000 | 1470 | 5 | 0.147 | 42.742757 | 1.716358 | 15.739405 |
| 3 | 10000 | 1470 | 6 | 0.147 | 42.742757 | 1.430498 | 10.933195 |
| 4 | 10000 | 1470 | 7 | 0.147 | 42.742757 | 1.226344 | 8.035203 |
| 5 | 10000 | 1470 | 8 | 0.147 | 42.742757 | 1.073255 | 6.154296 |
| 6 | 10000 | 1470 | 9 | 0.147 | 42.742757 | 0.954210 | 4.864754 |
| 7 | 10000 | 1470 | 10 | 0.147 | 42.742757 | 0.858997 | 3.942352 |
| 8 | 10000 | 1470 | 11 | 0.147 | 42.742757 | 0.781114 | 3.259881 |
| 9 | 10000 | 1470 | 12 | 0.147 | 42.742757 | 0.716231 | 2.740806 |
Summarise all columns by mean for each frequency:
nf_df.groupby('frequency').mean()
| speed | bw | wl | k | ar | rnf | |
|---|---|---|---|---|---|---|
| frequency | ||||||
| 10000 | 1470.0 | 11.0 | 0.147000 | 42.742757 | 1.034594 | 7.959450 |
| 11000 | 1470.0 | 11.0 | 0.133636 | 47.017033 | 0.940540 | 7.235864 |
| 12000 | 1470.0 | 11.0 | 0.122500 | 51.291309 | 0.862161 | 6.632875 |
| 13000 | 1470.0 | 11.0 | 0.113077 | 55.565584 | 0.795841 | 6.122654 |
| 14000 | 1470.0 | 11.0 | 0.105000 | 59.839860 | 0.738996 | 5.685322 |
| ... | ... | ... | ... | ... | ... | ... |
| 96000 | 1470.0 | 11.0 | 0.015312 | 410.330469 | 0.107770 | 0.829109 |
| 97000 | 1470.0 | 11.0 | 0.015155 | 414.604745 | 0.106659 | 0.820562 |
| 98000 | 1470.0 | 11.0 | 0.015000 | 418.879020 | 0.105571 | 0.812189 |
| 99000 | 1470.0 | 11.0 | 0.014848 | 423.153296 | 0.104504 | 0.803985 |
| 100000 | 1470.0 | 11.0 | 0.014700 | 427.427572 | 0.103459 | 0.795945 |
91 rows × 6 columns
Pandas Plotting#
Pandas has some built-in basic plotting capabilities. Here some examples:
nfs.reset_index().plot.scatter(x="frequency", y="bw", #x and y axes
c="rnf", #color vairable
clim=[0,10], #colorscale limits
s=2, #size of scatter
xlabel="Frequency (Hz)", #x axis title
ylabel="Beam width (°)") #y axis title
<Axes: xlabel='Frequency (Hz)', ylabel='Beam width (°)'>
nfs.boxplot(column="rnf", by="bw", xlabel="Beam width (°)", ylabel="Nearfield (m)")
<Axes: title={'center': 'rnf'}, xlabel='Beam width (°)', ylabel='Nearfield (m)'>
nfs
| frequency | bw | speed | wl | k | ar | rnf | |
|---|---|---|---|---|---|---|---|
| 0 | 10000 | 3 | 1470 | 0.1470 | 42.742757 | 2.860016 | 43.702814 |
| 1 | 10000 | 4 | 1470 | 0.1470 | 42.742757 | 2.145202 | 24.587202 |
| 2 | 10000 | 5 | 1470 | 0.1470 | 42.742757 | 1.716358 | 15.739405 |
| 3 | 10000 | 6 | 1470 | 0.1470 | 42.742757 | 1.430498 | 10.933195 |
| 4 | 10000 | 7 | 1470 | 0.1470 | 42.742757 | 1.226344 | 8.035203 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1542 | 100000 | 15 | 1470 | 0.0147 | 427.427572 | 0.057357 | 0.175773 |
| 1543 | 100000 | 16 | 1470 | 0.0147 | 427.427572 | 0.053794 | 0.154610 |
| 1544 | 100000 | 17 | 1470 | 0.0147 | 427.427572 | 0.050651 | 0.137070 |
| 1545 | 100000 | 18 | 1470 | 0.0147 | 427.427572 | 0.047858 | 0.122372 |
| 1546 | 100000 | 19 | 1470 | 0.0147 | 427.427572 | 0.045361 | 0.109933 |
1547 rows × 7 columns
nfs.query('frequency in [18000, 38000, 70000, 120000] & bw in [4,7,18]' ).rnf.unstack().plot(title="Nearfield (m)",
ylabel="Nearfield (m)",
xlabel="Frequency (Hz)",
xticks=[18000,38000,70000])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[50], line 1
----> 1 nfs.query('frequency in [18000, 38000, 70000, 120000] & bw in [4,7,18]' ).rnf.unstack().plot(title="Nearfield (m)",
2 ylabel="Nearfield (m)",
3 xlabel="Frequency (Hz)",
4 xticks=[18000,38000,70000])
File ~\AppData\Local\miniforge3\envs\wgfast25\Lib\site-packages\pandas\core\series.py:4615, in Series.unstack(self, level, fill_value, sort)
4570 """
4571 Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
4572
(...)
4611 b 2 4
4612 """
4613 from pandas.core.reshape.reshape import unstack
-> 4615 return unstack(self, level, fill_value, sort)
File ~\AppData\Local\miniforge3\envs\wgfast25\Lib\site-packages\pandas\core\reshape\reshape.py:511, in unstack(obj, level, fill_value, sort)
506 return obj.T.stack(future_stack=True)
507 elif not isinstance(obj.index, MultiIndex):
508 # GH 36113
509 # Give nicer error messages when unstack a Series whose
510 # Index is not a MultiIndex.
--> 511 raise ValueError(
512 f"index must be a MultiIndex to unstack, {type(obj.index)} was passed"
513 )
514 else:
515 if is_1d_only_ea_dtype(obj.dtype):
ValueError: index must be a MultiIndex to unstack, <class 'pandas.core.indexes.base.Index'> was passed
nfs.query('frequency in [18000, 38000, 70000, 120000]' ).rnf.plot(kind='hist',
title="Nearfield(m) for frequencies 18, 38, 70 kHz",
edgecolor='black',
color="orange")
<Axes: title={'center': 'Nearfield(m) for frequencies 18, 38, 70 kHz'}, ylabel='Frequency'>
n
0 43.702814
1 24.587202
2 15.739405
3 10.933195
4 8.035203
...
1542 0.175773
1543 0.154610
1544 0.137070
1545 0.122372
1546 0.109933
Name: rnf, Length: 1547, dtype: float64
Heatmaps#
Seaborn is the easiest way to create heatmaps.
import seaborn as sns
#we use the rnf column, then we reset the index, to use frequency and bw as our new rows (index) and columns.
nfs = nfs.set_index(['frequency', 'bw'])
nfs_piv = nfs.rnf.reset_index().pivot(index="frequency", columns="bw")
sns.heatmap(nfs_piv, #data to use
annot=False,#add values to cells
vmin=0, #minimum color value
vmax=5, #maximum color value
cmap="magma") #colorscale
<Axes: xlabel='None-bw', ylabel='frequency'>
Comparison of Pandas and R dplyr#
Task |
R dplyr |
Python pandas |
|---|---|---|
Loading Library |
|
|
Reading Data |
|
|
Selecting Columns |
|
|
Filtering Rows |
|
|
Mutating (Adding Columns) |
|
|
Renaming Columns |
|
|
Grouping Data |
|
|
Sorting Data |
|
|
Dropping Columns |
|
|
Counting Rows |
|
|
Pivoting (Wider Format) |
|
|
Pivoting (Longer Format) |
|
|
Merging (Join) |
|
|
Distinct Rows |
|
|
Replacing NA Values |
|
|
Feature |
Pandas |
dplyr |
|---|---|---|
Multi-indexing (Hierarchical Indexing) |
✅ |
❌ |
Time Series Handling |
✅ |
❌ |
Merging on Multiple Columns |
✅ |
✅ (but less flexible) |
Missing Value Handling |
✅ |
✅ (but more manual) |
Rolling/Expanding Windows |
✅ |
❌ |
Performance Optimizations (Cython, NumPy) |
✅ |
❌ |
Pipe Operator ( |
❌ |
✅ |
Grouped Mutate |
❌ (requires lambda) |
✅ |
Tidy Data Transformations |
❌ |
✅ |
Verb-Based Function Naming |
❌ |
✅ |
Multiple Summaries Easily |
❌ (requires dict/lambda) |
✅ |
Factor (Categorical) Handling |
❌ (manual) |
✅ (native) |
Xarray#
Xarray provides an excellent introduction tutorial
Xarray mainly works with Datasets and DataArrays. Datasets are containers of DataArrays.
Let’s illustrate this by an MVBS dataset, created within echopype:
import xarray as xr
mvbs = xr.open_dataset("./python_plotting_files/mvbs.nc")
mvbs
DataArray#
We can pull out the dataarray Sv using . notation
sv = mvbs.Sv
sv
Dimensions#
.dims - Dimensions, are the axes of the data, here the channel, ping_time and depth
sv.dims
Coordinates#
.coords contains the coordinate variables
sv.coords
sv.ping_time
Arbitrary attributes#
.attrs accesses arbitrary Python objects
sv.attrs
Underlying data#
.data provides a numpy array of the data.
sv.data
Plotting#
Plotting can be done easily:
we use the
.notation to select the DatArraymvbs.Svwe select one channel of the data through indexing
.isel(channel=1)we want to plot
.plotwe want
ping_timeto represent the x axis anddepththe y axis and we want depth to have 0 on top:x='ping_time', y='depth', yincrease = Falsewe want some additional decorations, so we set the colormap
cmap='RdYlBu_r'and we set the colorbar limits to be withinvmin=-75andvmax=-40
mvbs.Sv.isel(channel=1).plot(x="ping_time", y="depth",
yincrease=False,
cmap="RdYlBu_r",
vmin=-75, vmax=-40)
import plotly.graph_objects as go
dat = mvbs.swap_dims({"ping_time": "distance"}).isel(channel=1).Sv
tmp=dat.where(dat > -75)
fig = go.Figure(
data=go.Heatmap(z=tmp.transpose(),
zmin=-75,zmax=-30,
colorscale = "RdYlBu_r",
colorbar=dict(title='')))
fig.update_yaxes(autorange="reversed")
#px.imshow(tmp.transpose(),colorbar=dict(title=""))
fig