Pandas and Xarray

Pandas and Xarray#

Here we will focus on pandas and xarray. numpy functions are widely used even within pandas and xarray.

Differences Between xarray and pandas#

Feature	pandas	xarray
Data Structure	1D (Series) & 2D (DataFrame)	N-dimensional arrays (DataArray, Dataset)
Indexing	Labels for rows/columns	Labels for multiple dimensions (multi-dimensional indexing)
Handling of Dimensions	Works best with tabular data	Designed for multi-dimensional data (e.g., time, latitude, longitude)
Performance	Optimized for 1D & 2D operations	Optimized for large multi-dimensional datasets
Multi-Indexing	Supported but can get complex	Native multi-dimensional indexing
NetCDF & Scientific Data Support	No built-in support	Built-in support for NetCDF, GRIB, and HDF5 formats (limited Zarr support)
Aggregation & Reduction	Groupby and pivot for summarization	Can reduce over multiple dimensions easily
Memory Usage	Efficient for tabular data	Can be optimized with chunking via Dask
Integration with Dask	Limited support	Natively supports lazy loading & parallel computation
Use Case	Best for structured tabular data (e.g., CSV, SQL)	Best for multi-dimensional scientific data (e.g., climate models, geospatial, satellite data)

When to Use Pandas vs. Xarray?#

🔹 Use Pandas when:

You have structured, tabular data (like CSV, Excel, SQL).
Your data fits well in rows and columns.
You need fast operations on 1D or 2D datasets.

🔹 Use Xarray when:

Your data is multi-dimensional (e.g., climate, geospatial, satellite data).
You work with scientific datasets (NetCDF, GRIB, HDF5).
You need efficient handling of large datasets with parallel processing (via Dask).
You require native multi-dimensional indexing and operations.

Pandas#

Find the pandas reference here

A good way to learn pandas is through 10 minutes to pandas

import pandas as pd
from pandas import read_csv
import uwa

read_csv('Nearfield.csv')

	frequency	speed	bw	wl	k	ar	rnf
0	10000	1470	3	0.1470	42.742757	2.860016	43.702814
1	10000	1470	4	0.1470	42.742757	2.145202	24.587202
2	10000	1470	5	0.1470	42.742757	1.716358	15.739405
3	10000	1470	6	0.1470	42.742757	1.430498	10.933195
4	10000	1470	7	0.1470	42.742757	1.226344	8.035203
...	...	...	...	...	...	...	...
1542	100000	1470	15	0.0147	427.427572	0.057357	0.175773
1543	100000	1470	16	0.0147	427.427572	0.053794	0.154610
1544	100000	1470	17	0.0147	427.427572	0.050651	0.137070
1545	100000	1470	18	0.0147	427.427572	0.047858	0.122372
1546	100000	1470	19	0.0147	427.427572	0.045361	0.109933

1547 rows × 7 columns

Read csv#

nf_df = pd.read_csv('Nearfield.csv')

Creating a DataFrame in Pandas#

DataFrames are pretty much the same as dataframes in R basically the tabular form of Python dictionaries and can be create quickly:

pd.DataFrame({'First':[1,2,3],
              "Second":["One", "Two", "Three"]})

	First	Second
0	1	One
1	2	Two
2	3	Three

We can use the uwa package to generate results in a loop and combine the resuls dict into a dataframe:

nfs = pd.DataFrame([ #convert to dataframe
    uwa.AcousticWave(speed =1470, frequency=f * 1000, bw=b).__dict__  #extract all attributes as dict
    for f in range(10,101) #loop over a range of frequencies
    for b in range(3,20) #loop over a range of beam widths
])

#save the dataframe as csv

nfs.to_csv("Nearfield.csv", index=False)
#let's see the Dataframe
nfs

	frequency	speed	bw	wl	k	ar	rnf
0	10000	1470	3	0.1470	42.742757	2.860016	43.702814
1	10000	1470	4	0.1470	42.742757	2.145202	24.587202
2	10000	1470	5	0.1470	42.742757	1.716358	15.739405
3	10000	1470	6	0.1470	42.742757	1.430498	10.933195
4	10000	1470	7	0.1470	42.742757	1.226344	8.035203
...	...	...	...	...	...	...	...
1542	100000	1470	15	0.0147	427.427572	0.057357	0.175773
1543	100000	1470	16	0.0147	427.427572	0.053794	0.154610
1544	100000	1470	17	0.0147	427.427572	0.050651	0.137070
1545	100000	1470	18	0.0147	427.427572	0.047858	0.122372
1546	100000	1470	19	0.0147	427.427572	0.045361	0.109933

1547 rows × 7 columns

nfs.describe()

	frequency	speed	bw	wl	k	ar	rnf
count	1547.000000	1547.0	1547.000000	1547.000000	1547.000000	1547.000000	1547.000000
mean	55000.000000	1470.0	11.000000	0.038097	235.085165	0.268131	2.062818
std	26276.345124	0.0	4.900564	0.028463	112.312344	0.289890	3.820341
min	10000.000000	1470.0	3.000000	0.014700	42.742757	0.045361	0.109933
25%	32000.000000	1470.0	7.000000	0.018846	136.776823	0.099356	0.335185
50%	55000.000000	1470.0	11.000000	0.026727	235.085165	0.170922	0.746784
75%	78000.000000	1470.0	15.000000	0.045937	333.393506	0.319617	2.000565
max	100000.000000	1470.0	19.000000	0.147000	427.427572	2.860016	43.702814

Pandas is useful to read tabular files like csv files as pd.read_csv(filename) and supports a long list of other file formats.
Pandas Dataframes use indexing a lot and MultiIndexes are common:

nfs = nfs.set_index(['frequency', 'bw'])
nfs

		speed	wl	k	ar	rnf
frequency	bw
10000	3	1470	0.1470	42.742757	2.860016	43.702814
	4	1470	0.1470	42.742757	2.145202	24.587202
	5	1470	0.1470	42.742757	1.716358	15.739405
	6	1470	0.1470	42.742757	1.430498	10.933195
	7	1470	0.1470	42.742757	1.226344	8.035203
...	...	...	...	...	...	...
100000	15	1470	0.0147	427.427572	0.057357	0.175773
	16	1470	0.0147	427.427572	0.053794	0.154610
	17	1470	0.0147	427.427572	0.050651	0.137070
	18	1470	0.0147	427.427572	0.047858	0.122372
	19	1470	0.0147	427.427572	0.045361	0.109933

1547 rows × 5 columns

Data Wrangling#

There a loads of built-in functions in pandas to manipulate data frames, including summarising, handling missing data, combining data, reshaping data or grouping data, and more. Cheat sheets can be very useful to get an overview

Subsetting data#

´loc´ and ´iloc´#

´loc´ df.loc[ Subset dataframe by condition and select columns

nfs = nfs.reset_index()
nfs.loc[nfs.frequency == 38000, ['wl','k','rnf']]

	wl	k	rnf
476	0.038684	162.422477	11.500741
477	0.038684	162.422477	6.470316
478	0.038684	162.422477	4.141949
479	0.038684	162.422477	2.877157
480	0.038684	162.422477	2.114527
481	0.038684	162.422477	1.619552
482	0.038684	162.422477	1.280198
483	0.038684	162.422477	1.037461
484	0.038684	162.422477	0.857863
485	0.038684	162.422477	0.721265
486	0.038684	162.422477	0.614959
487	0.038684	162.422477	0.530609
488	0.038684	162.422477	0.462560
489	0.038684	162.422477	0.406868
490	0.038684	162.422477	0.360711
491	0.038684	162.422477	0.322032
492	0.038684	162.422477	0.289298

Subset dataframe by multiple conditions:

nf_df.loc[(nf_df.frequency == 10000) & (nf_df.bw > 10)]

	frequency	speed	bw	wl	k	ar	rnf
8	10000	1470	11	0.147	42.742757	0.781114	3.259881
9	10000	1470	12	0.147	42.742757	0.716231	2.740806
10	10000	1470	13	0.147	42.742757	0.661346	2.336845
11	10000	1470	14	0.147	42.742757	0.614318	2.016316
12	10000	1470	15	0.147	42.742757	0.573574	1.757730
13	10000	1470	16	0.147	42.742757	0.537938	1.546097
14	10000	1470	17	0.147	42.742757	0.506507	1.370703
15	10000	1470	18	0.147	42.742757	0.478581	1.223721
16	10000	1470	19	0.147	42.742757	0.453605	1.099332

Subset by index

nf_df.iloc[0,:]

frequency    10000.000000
speed         1470.000000
bw               3.000000
wl               0.147000
k               42.742757
ar               2.860016
rnf             43.702814
Name: 0, dtype: float64

nf_df.iloc[0:10,:]

	frequency	speed	bw	wl	k	ar	rnf
0	10000	1470	3	0.147	42.742757	2.860016	43.702814
1	10000	1470	4	0.147	42.742757	2.145202	24.587202
2	10000	1470	5	0.147	42.742757	1.716358	15.739405
3	10000	1470	6	0.147	42.742757	1.430498	10.933195
4	10000	1470	7	0.147	42.742757	1.226344	8.035203
5	10000	1470	8	0.147	42.742757	1.073255	6.154296
6	10000	1470	9	0.147	42.742757	0.954210	4.864754
7	10000	1470	10	0.147	42.742757	0.858997	3.942352
8	10000	1470	11	0.147	42.742757	0.781114	3.259881
9	10000	1470	12	0.147	42.742757	0.716231	2.740806

Summarise all columns by mean for each frequency:

nf_df.groupby('frequency').mean()

	speed	bw	wl	k	ar	rnf
frequency
10000	1470.0	11.0	0.147000	42.742757	1.034594	7.959450
11000	1470.0	11.0	0.133636	47.017033	0.940540	7.235864
12000	1470.0	11.0	0.122500	51.291309	0.862161	6.632875
13000	1470.0	11.0	0.113077	55.565584	0.795841	6.122654
14000	1470.0	11.0	0.105000	59.839860	0.738996	5.685322
...	...	...	...	...	...	...
96000	1470.0	11.0	0.015312	410.330469	0.107770	0.829109
97000	1470.0	11.0	0.015155	414.604745	0.106659	0.820562
98000	1470.0	11.0	0.015000	418.879020	0.105571	0.812189
99000	1470.0	11.0	0.014848	423.153296	0.104504	0.803985
100000	1470.0	11.0	0.014700	427.427572	0.103459	0.795945

91 rows × 6 columns

Pandas Plotting#

Pandas has some built-in basic plotting capabilities. Here some examples:

nfs.reset_index().plot.scatter(x="frequency", y="bw", #x and y axes
                               c="rnf", #color vairable
                               clim=[0,10], #colorscale limits
                               s=2, #size of scatter
                               xlabel="Frequency (Hz)", #x axis title
                               ylabel="Beam width (°)") #y axis title

<Axes: xlabel='Frequency (Hz)', ylabel='Beam width (°)'>

_images/412f42fef0617f60208bc88e23c8c72947723605361edb48fb182a63553e1138.png

nfs.boxplot(column="rnf", by="bw", xlabel="Beam width (°)", ylabel="Nearfield (m)")

<Axes: title={'center': 'rnf'}, xlabel='Beam width (°)', ylabel='Nearfield (m)'>

_images/056655808d688a2158bbcfa37dafcfc8ceb836da29a50d79f1dd0d96f413813b.png

nfs

	frequency	bw	speed	wl	k	ar	rnf
0	10000	3	1470	0.1470	42.742757	2.860016	43.702814
1	10000	4	1470	0.1470	42.742757	2.145202	24.587202
2	10000	5	1470	0.1470	42.742757	1.716358	15.739405
3	10000	6	1470	0.1470	42.742757	1.430498	10.933195
4	10000	7	1470	0.1470	42.742757	1.226344	8.035203
...	...	...	...	...	...	...	...
1542	100000	15	1470	0.0147	427.427572	0.057357	0.175773
1543	100000	16	1470	0.0147	427.427572	0.053794	0.154610
1544	100000	17	1470	0.0147	427.427572	0.050651	0.137070
1545	100000	18	1470	0.0147	427.427572	0.047858	0.122372
1546	100000	19	1470	0.0147	427.427572	0.045361	0.109933

1547 rows × 7 columns

nfs.query('frequency in [18000, 38000, 70000, 120000] & bw in [4,7,18]' ).rnf.unstack().plot(title="Nearfield (m)", 
                                                                                             ylabel="Nearfield (m)",
                                                                                             xlabel="Frequency (Hz)",
                                                                                             xticks=[18000,38000,70000])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[50], line 1
----> 1 nfs.query('frequency in [18000, 38000, 70000, 120000] & bw in [4,7,18]' ).rnf.unstack().plot(title="Nearfield (m)", 
      2                                                                                              ylabel="Nearfield (m)",
      3                                                                                              xlabel="Frequency (Hz)",
      4                                                                                              xticks=[18000,38000,70000])

File ~\AppData\Local\miniforge3\envs\wgfast25\Lib\site-packages\pandas\core\series.py:4615, in Series.unstack(self, level, fill_value, sort)
   4570 """
   4571 Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
   4572 
   (...)
   4611 b    2    4
   4612 """
   4613 from pandas.core.reshape.reshape import unstack
-> 4615 return unstack(self, level, fill_value, sort)

File ~\AppData\Local\miniforge3\envs\wgfast25\Lib\site-packages\pandas\core\reshape\reshape.py:511, in unstack(obj, level, fill_value, sort)
    506         return obj.T.stack(future_stack=True)
    507 elif not isinstance(obj.index, MultiIndex):
    508     # GH 36113
    509     # Give nicer error messages when unstack a Series whose
    510     # Index is not a MultiIndex.
--> 511     raise ValueError(
    512         f"index must be a MultiIndex to unstack, {type(obj.index)} was passed"
    513     )
    514 else:
    515     if is_1d_only_ea_dtype(obj.dtype):

ValueError: index must be a MultiIndex to unstack, <class 'pandas.core.indexes.base.Index'> was passed

nfs.query('frequency in [18000, 38000, 70000, 120000]' ).rnf.plot(kind='hist', 
                                                                  title="Nearfield(m) for frequencies 18, 38, 70 kHz", 
                                                                  edgecolor='black',
                                                                  color="orange")

<Axes: title={'center': 'Nearfield(m) for frequencies 18, 38, 70 kHz'}, ylabel='Frequency'>

_images/f3a01c070ae81b198f99067c3b60b3b3eb30515cc2e68f45bfa959873946b5a3.png

     43.702814
     24.587202
     15.739405
     10.933195
      8.035203
          ...    
   0.175773
   0.154610
   0.137070
   0.122372
   0.109933
Name: rnf, Length: 1547, dtype: float64

Heatmaps#

Seaborn is the easiest way to create heatmaps.

import seaborn as sns  

#we use the rnf column, then we reset the index, to use frequency and bw as our new rows (index) and columns.
nfs = nfs.set_index(['frequency', 'bw'])

nfs_piv = nfs.rnf.reset_index().pivot(index="frequency", columns="bw")

sns.heatmap(nfs_piv, #data to use
            annot=False,#add values to cells
            vmin=0, #minimum color value
            vmax=5, #maximum color value
            cmap="magma") #colorscale

<Axes: xlabel='None-bw', ylabel='frequency'>

_images/737b2327b28a9bbb62874588e03911e8d684028cd7425917cb8f5d0f8d25810f.png

Comparison of Pandas and R dplyr#

Task	R dplyr	Python pandas
Loading Library	`library(dplyr)`	`import pandas as pd`
Reading Data	`df <- read.csv("file.csv")`	`df = pd.read_csv("file.csv")`
Selecting Columns	`df %>% select(col1, col2)`	`df[['col1', 'col2']]`
Filtering Rows	`df %>% filter(col1 > 10)`	`df[df['col1'] > 10]`
Mutating (Adding Columns)	`df %>% mutate(new_col = col1 * 2)`	`df['new_col'] = df['col1'] * 2`
Renaming Columns	`df %>% rename(new_col1 = old_col1)`	`df = df.rename(columns={'old_col1': 'new_col1'})`
Grouping Data	`df %>% group_by(col1) %>% summarize(mean_val = mean(col2))`	`df.groupby('col1')['col2'].mean().reset_index()`
Sorting Data	`df %>% arrange(col1)`	`df.sort_values('col1')`
Dropping Columns	`df %>% select(-col1)`	`df.drop(columns=['col1'])`
Counting Rows	`df %>% count(col1)`	`df['col1'].value_counts()`
Pivoting (Wider Format)	`df %>% pivot_wider(key, value)`	`df.pivot(index, columns, values)`
Pivoting (Longer Format)	`df %>% pivot_longer(key, value, col1:col3)`	`df.melt(id_vars, var_name, value_name)`
Merging (Join)	`df1 %>% left_join(df2, by = "key")`	`df1.merge(df2, on='key', how='left')`
Distinct Rows	`df %>% distinct()`	`df.drop_duplicates()`
Replacing NA Values	`df %>% replace_na(list(col1 = 0))`	`df.fillna({'col1': 0})`

Feature	Pandas	dplyr
Multi-indexing (Hierarchical Indexing)	✅	❌
Time Series Handling	✅	❌
Merging on Multiple Columns	✅	✅ (but less flexible)
Missing Value Handling	✅	✅ (but more manual)
Rolling/Expanding Windows	✅	❌
Performance Optimizations (Cython, NumPy)	✅	❌
Pipe Operator (`%>%`) for readibility	❌	✅
Grouped Mutate	❌ (requires lambda)	✅
Tidy Data Transformations	❌	✅
Verb-Based Function Naming	❌	✅
Multiple Summaries Easily	❌ (requires dict/lambda)	✅
Factor (Categorical) Handling	❌ (manual)	✅ (native)

Xarray#

Xarray provides an excellent introduction tutorial

Xarray mainly works with Datasets and DataArrays. Datasets are containers of DataArrays.

Let’s illustrate this by an MVBS dataset, created within echopype:

import xarray as xr 

mvbs = xr.open_dataset("./python_plotting_files/mvbs.nc")
mvbs

DataArray#

We can pull out the dataarray Sv using . notation

sv = mvbs.Sv
sv

Dimensions#

.dims - Dimensions, are the axes of the data, here the channel, ping_time and depth

sv.dims

Coordinates#

.coords contains the coordinate variables

sv.coords

sv.ping_time

Arbitrary attributes#

.attrs accesses arbitrary Python objects

sv.attrs

Underlying data#

.data provides a numpy array of the data.

sv.data

Plotting#

Plotting can be done easily:

we use the .notation to select the DatArray mvbs.Sv
we select one channel of the data through indexing .isel(channel=1)
we want to plot .plot
we want ping_time to represent the x axis and depth the y axis and we want depth to have 0 on top: x='ping_time', y='depth', yincrease = False
we want some additional decorations, so we set the colormap cmap='RdYlBu_r' and we set the colorbar limits to be within vmin=-75 and vmax=-40

mvbs.Sv.isel(channel=1).plot(x="ping_time", y="depth", 
                             yincrease=False, 
                             cmap="RdYlBu_r", 
                             vmin=-75, vmax=-40)

import plotly.graph_objects as go

dat = mvbs.swap_dims({"ping_time": "distance"}).isel(channel=1).Sv
tmp=dat.where(dat > -75)
fig = go.Figure(
    data=go.Heatmap(z=tmp.transpose(),
                    zmin=-75,zmax=-30,
                    colorscale = "RdYlBu_r",
                    colorbar=dict(title='')))
fig.update_yaxes(autorange="reversed") 
    
#px.imshow(tmp.transpose(),colorbar=dict(title=""))
fig

Pandas and Xarray

Contents

Pandas and Xarray#

Differences Between xarray and pandas#

When to Use Pandas vs. Xarray?#

Pandas#

Read csv#

Creating a DataFrame in Pandas#

Data Wrangling#

Subsetting data#

´loc´ and ´iloc´#

Pandas Plotting#

Heatmaps#

Comparison of Pandas and R dplyr#

Xarray#

DataArray#

Dimensions#

Coordinates#

Arbitrary attributes#

Underlying data#

Plotting#