Pandas and Xarray#

Here we will focus on pandas and xarray. numpy functions are widely used even within pandas and xarray.

Differences Between xarray and pandas#

Feature

pandas

xarray

Data Structure

1D (Series) & 2D (DataFrame)

N-dimensional arrays (DataArray, Dataset)

Indexing

Labels for rows/columns

Labels for multiple dimensions (multi-dimensional indexing)

Handling of Dimensions

Works best with tabular data

Designed for multi-dimensional data (e.g., time, latitude, longitude)

Performance

Optimized for 1D & 2D operations

Optimized for large multi-dimensional datasets

Multi-Indexing

Supported but can get complex

Native multi-dimensional indexing

NetCDF & Scientific Data Support

No built-in support

Built-in support for NetCDF, GRIB, and HDF5 formats (limited Zarr support)

Aggregation & Reduction

Groupby and pivot for summarization

Can reduce over multiple dimensions easily

Memory Usage

Efficient for tabular data

Can be optimized with chunking via Dask

Integration with Dask

Limited support

Natively supports lazy loading & parallel computation

Use Case

Best for structured tabular data (e.g., CSV, SQL)

Best for multi-dimensional scientific data (e.g., climate models, geospatial, satellite data)

When to Use Pandas vs. Xarray?#

🔹 Use Pandas when:

  • You have structured, tabular data (like CSV, Excel, SQL).

  • Your data fits well in rows and columns.

  • You need fast operations on 1D or 2D datasets.

🔹 Use Xarray when:

  • Your data is multi-dimensional (e.g., climate, geospatial, satellite data).

  • You work with scientific datasets (NetCDF, GRIB, HDF5).

  • You need efficient handling of large datasets with parallel processing (via Dask).

  • You require native multi-dimensional indexing and operations.

Pandas#

Find the pandas reference here

A good way to learn pandas is through 10 minutes to pandas

import pandas as pd
from pandas import read_csv
import uwa
read_csv('Nearfield.csv')
frequency speed bw wl k ar rnf
0 10000 1470 3 0.1470 42.742757 2.860016 43.702814
1 10000 1470 4 0.1470 42.742757 2.145202 24.587202
2 10000 1470 5 0.1470 42.742757 1.716358 15.739405
3 10000 1470 6 0.1470 42.742757 1.430498 10.933195
4 10000 1470 7 0.1470 42.742757 1.226344 8.035203
... ... ... ... ... ... ... ...
1542 100000 1470 15 0.0147 427.427572 0.057357 0.175773
1543 100000 1470 16 0.0147 427.427572 0.053794 0.154610
1544 100000 1470 17 0.0147 427.427572 0.050651 0.137070
1545 100000 1470 18 0.0147 427.427572 0.047858 0.122372
1546 100000 1470 19 0.0147 427.427572 0.045361 0.109933

1547 rows × 7 columns

Read csv#

nf_df = pd.read_csv('Nearfield.csv')

Creating a DataFrame in Pandas#

DataFrames are pretty much the same as dataframes in R basically the tabular form of Python dictionaries and can be create quickly:

pd.DataFrame({'First':[1,2,3],
              "Second":["One", "Two", "Three"]})
First Second
0 1 One
1 2 Two
2 3 Three

We can use the uwa package to generate results in a loop and combine the resuls dict into a dataframe:

nfs = pd.DataFrame([ #convert to dataframe
    uwa.AcousticWave(speed =1470, frequency=f * 1000, bw=b).__dict__  #extract all attributes as dict
    for f in range(10,101) #loop over a range of frequencies
    for b in range(3,20) #loop over a range of beam widths
])

#save the dataframe as csv

nfs.to_csv("Nearfield.csv", index=False)
#let's see the Dataframe
nfs
frequency speed bw wl k ar rnf
0 10000 1470 3 0.1470 42.742757 2.860016 43.702814
1 10000 1470 4 0.1470 42.742757 2.145202 24.587202
2 10000 1470 5 0.1470 42.742757 1.716358 15.739405
3 10000 1470 6 0.1470 42.742757 1.430498 10.933195
4 10000 1470 7 0.1470 42.742757 1.226344 8.035203
... ... ... ... ... ... ... ...
1542 100000 1470 15 0.0147 427.427572 0.057357 0.175773
1543 100000 1470 16 0.0147 427.427572 0.053794 0.154610
1544 100000 1470 17 0.0147 427.427572 0.050651 0.137070
1545 100000 1470 18 0.0147 427.427572 0.047858 0.122372
1546 100000 1470 19 0.0147 427.427572 0.045361 0.109933

1547 rows × 7 columns

nfs.describe()
frequency speed bw wl k ar rnf
count 1547.000000 1547.0 1547.000000 1547.000000 1547.000000 1547.000000 1547.000000
mean 55000.000000 1470.0 11.000000 0.038097 235.085165 0.268131 2.062818
std 26276.345124 0.0 4.900564 0.028463 112.312344 0.289890 3.820341
min 10000.000000 1470.0 3.000000 0.014700 42.742757 0.045361 0.109933
25% 32000.000000 1470.0 7.000000 0.018846 136.776823 0.099356 0.335185
50% 55000.000000 1470.0 11.000000 0.026727 235.085165 0.170922 0.746784
75% 78000.000000 1470.0 15.000000 0.045937 333.393506 0.319617 2.000565
max 100000.000000 1470.0 19.000000 0.147000 427.427572 2.860016 43.702814

Pandas is useful to read tabular files like csv files as pd.read_csv(filename) and supports a long list of other file formats.
Pandas Dataframes use indexing a lot and MultiIndexes are common:

nfs = nfs.set_index(['frequency', 'bw'])
nfs
speed wl k ar rnf
frequency bw
10000 3 1470 0.1470 42.742757 2.860016 43.702814
4 1470 0.1470 42.742757 2.145202 24.587202
5 1470 0.1470 42.742757 1.716358 15.739405
6 1470 0.1470 42.742757 1.430498 10.933195
7 1470 0.1470 42.742757 1.226344 8.035203
... ... ... ... ... ... ...
100000 15 1470 0.0147 427.427572 0.057357 0.175773
16 1470 0.0147 427.427572 0.053794 0.154610
17 1470 0.0147 427.427572 0.050651 0.137070
18 1470 0.0147 427.427572 0.047858 0.122372
19 1470 0.0147 427.427572 0.045361 0.109933

1547 rows × 5 columns

Data Wrangling#

There a loads of built-in functions in pandas to manipulate data frames, including summarising, handling missing data, combining data, reshaping data or grouping data, and more. Cheat sheets can be very useful to get an overview

Subsetting data#

´loc´ and ´iloc´#

´loc´ df.loc[ Subset dataframe by condition and select columns

nfs = nfs.reset_index()
nfs.loc[nfs.frequency == 38000, ['wl','k','rnf']]
wl k rnf
476 0.038684 162.422477 11.500741
477 0.038684 162.422477 6.470316
478 0.038684 162.422477 4.141949
479 0.038684 162.422477 2.877157
480 0.038684 162.422477 2.114527
481 0.038684 162.422477 1.619552
482 0.038684 162.422477 1.280198
483 0.038684 162.422477 1.037461
484 0.038684 162.422477 0.857863
485 0.038684 162.422477 0.721265
486 0.038684 162.422477 0.614959
487 0.038684 162.422477 0.530609
488 0.038684 162.422477 0.462560
489 0.038684 162.422477 0.406868
490 0.038684 162.422477 0.360711
491 0.038684 162.422477 0.322032
492 0.038684 162.422477 0.289298

Subset dataframe by multiple conditions:

nf_df.loc[(nf_df.frequency == 10000) & (nf_df.bw > 10)]
frequency speed bw wl k ar rnf
8 10000 1470 11 0.147 42.742757 0.781114 3.259881
9 10000 1470 12 0.147 42.742757 0.716231 2.740806
10 10000 1470 13 0.147 42.742757 0.661346 2.336845
11 10000 1470 14 0.147 42.742757 0.614318 2.016316
12 10000 1470 15 0.147 42.742757 0.573574 1.757730
13 10000 1470 16 0.147 42.742757 0.537938 1.546097
14 10000 1470 17 0.147 42.742757 0.506507 1.370703
15 10000 1470 18 0.147 42.742757 0.478581 1.223721
16 10000 1470 19 0.147 42.742757 0.453605 1.099332

Subset by index

nf_df.iloc[0,:]
frequency    10000.000000
speed         1470.000000
bw               3.000000
wl               0.147000
k               42.742757
ar               2.860016
rnf             43.702814
Name: 0, dtype: float64
nf_df.iloc[0:10,:]
frequency speed bw wl k ar rnf
0 10000 1470 3 0.147 42.742757 2.860016 43.702814
1 10000 1470 4 0.147 42.742757 2.145202 24.587202
2 10000 1470 5 0.147 42.742757 1.716358 15.739405
3 10000 1470 6 0.147 42.742757 1.430498 10.933195
4 10000 1470 7 0.147 42.742757 1.226344 8.035203
5 10000 1470 8 0.147 42.742757 1.073255 6.154296
6 10000 1470 9 0.147 42.742757 0.954210 4.864754
7 10000 1470 10 0.147 42.742757 0.858997 3.942352
8 10000 1470 11 0.147 42.742757 0.781114 3.259881
9 10000 1470 12 0.147 42.742757 0.716231 2.740806

Summarise all columns by mean for each frequency:

nf_df.groupby('frequency').mean()
speed bw wl k ar rnf
frequency
10000 1470.0 11.0 0.147000 42.742757 1.034594 7.959450
11000 1470.0 11.0 0.133636 47.017033 0.940540 7.235864
12000 1470.0 11.0 0.122500 51.291309 0.862161 6.632875
13000 1470.0 11.0 0.113077 55.565584 0.795841 6.122654
14000 1470.0 11.0 0.105000 59.839860 0.738996 5.685322
... ... ... ... ... ... ...
96000 1470.0 11.0 0.015312 410.330469 0.107770 0.829109
97000 1470.0 11.0 0.015155 414.604745 0.106659 0.820562
98000 1470.0 11.0 0.015000 418.879020 0.105571 0.812189
99000 1470.0 11.0 0.014848 423.153296 0.104504 0.803985
100000 1470.0 11.0 0.014700 427.427572 0.103459 0.795945

91 rows × 6 columns

Pandas Plotting#

Pandas has some built-in basic plotting capabilities. Here some examples:

nfs.reset_index().plot.scatter(x="frequency", y="bw", #x and y axes
                               c="rnf", #color vairable
                               clim=[0,10], #colorscale limits
                               s=2, #size of scatter
                               xlabel="Frequency (Hz)", #x axis title
                               ylabel="Beam width (°)") #y axis title
<Axes: xlabel='Frequency (Hz)', ylabel='Beam width (°)'>
_images/412f42fef0617f60208bc88e23c8c72947723605361edb48fb182a63553e1138.png
nfs.boxplot(column="rnf", by="bw", xlabel="Beam width (°)", ylabel="Nearfield (m)")
<Axes: title={'center': 'rnf'}, xlabel='Beam width (°)', ylabel='Nearfield (m)'>
_images/056655808d688a2158bbcfa37dafcfc8ceb836da29a50d79f1dd0d96f413813b.png
nfs
frequency bw speed wl k ar rnf
0 10000 3 1470 0.1470 42.742757 2.860016 43.702814
1 10000 4 1470 0.1470 42.742757 2.145202 24.587202
2 10000 5 1470 0.1470 42.742757 1.716358 15.739405
3 10000 6 1470 0.1470 42.742757 1.430498 10.933195
4 10000 7 1470 0.1470 42.742757 1.226344 8.035203
... ... ... ... ... ... ... ...
1542 100000 15 1470 0.0147 427.427572 0.057357 0.175773
1543 100000 16 1470 0.0147 427.427572 0.053794 0.154610
1544 100000 17 1470 0.0147 427.427572 0.050651 0.137070
1545 100000 18 1470 0.0147 427.427572 0.047858 0.122372
1546 100000 19 1470 0.0147 427.427572 0.045361 0.109933

1547 rows × 7 columns

nfs.query('frequency in [18000, 38000, 70000, 120000] & bw in [4,7,18]' ).rnf.unstack().plot(title="Nearfield (m)", 
                                                                                             ylabel="Nearfield (m)",
                                                                                             xlabel="Frequency (Hz)",
                                                                                             xticks=[18000,38000,70000])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[50], line 1
----> 1 nfs.query('frequency in [18000, 38000, 70000, 120000] & bw in [4,7,18]' ).rnf.unstack().plot(title="Nearfield (m)", 
      2                                                                                              ylabel="Nearfield (m)",
      3                                                                                              xlabel="Frequency (Hz)",
      4                                                                                              xticks=[18000,38000,70000])

File ~\AppData\Local\miniforge3\envs\wgfast25\Lib\site-packages\pandas\core\series.py:4615, in Series.unstack(self, level, fill_value, sort)
   4570 """
   4571 Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
   4572 
   (...)
   4611 b    2    4
   4612 """
   4613 from pandas.core.reshape.reshape import unstack
-> 4615 return unstack(self, level, fill_value, sort)

File ~\AppData\Local\miniforge3\envs\wgfast25\Lib\site-packages\pandas\core\reshape\reshape.py:511, in unstack(obj, level, fill_value, sort)
    506         return obj.T.stack(future_stack=True)
    507 elif not isinstance(obj.index, MultiIndex):
    508     # GH 36113
    509     # Give nicer error messages when unstack a Series whose
    510     # Index is not a MultiIndex.
--> 511     raise ValueError(
    512         f"index must be a MultiIndex to unstack, {type(obj.index)} was passed"
    513     )
    514 else:
    515     if is_1d_only_ea_dtype(obj.dtype):

ValueError: index must be a MultiIndex to unstack, <class 'pandas.core.indexes.base.Index'> was passed
nfs.query('frequency in [18000, 38000, 70000, 120000]' ).rnf.plot(kind='hist', 
                                                                  title="Nearfield(m) for frequencies 18, 38, 70 kHz", 
                                                                  edgecolor='black',
                                                                  color="orange")
<Axes: title={'center': 'Nearfield(m) for frequencies 18, 38, 70 kHz'}, ylabel='Frequency'>
_images/f3a01c070ae81b198f99067c3b60b3b3eb30515cc2e68f45bfa959873946b5a3.png
n
0       43.702814
1       24.587202
2       15.739405
3       10.933195
4        8.035203
          ...    
1542     0.175773
1543     0.154610
1544     0.137070
1545     0.122372
1546     0.109933
Name: rnf, Length: 1547, dtype: float64

Heatmaps#

Seaborn is the easiest way to create heatmaps.

import seaborn as sns  

#we use the rnf column, then we reset the index, to use frequency and bw as our new rows (index) and columns.
nfs = nfs.set_index(['frequency', 'bw'])

nfs_piv = nfs.rnf.reset_index().pivot(index="frequency", columns="bw")

sns.heatmap(nfs_piv, #data to use
            annot=False,#add values to cells
            vmin=0, #minimum color value
            vmax=5, #maximum color value
            cmap="magma") #colorscale
<Axes: xlabel='None-bw', ylabel='frequency'>
_images/737b2327b28a9bbb62874588e03911e8d684028cd7425917cb8f5d0f8d25810f.png

Comparison of Pandas and R dplyr#

Task

R dplyr

Python pandas

Loading Library

library(dplyr)

import pandas as pd

Reading Data

df <- read.csv("file.csv")

df = pd.read_csv("file.csv")

Selecting Columns

df %>% select(col1, col2)

df[['col1', 'col2']]

Filtering Rows

df %>% filter(col1 > 10)

df[df['col1'] > 10]

Mutating (Adding Columns)

df %>% mutate(new_col = col1 * 2)

df['new_col'] = df['col1'] * 2

Renaming Columns

df %>% rename(new_col1 = old_col1)

df = df.rename(columns={'old_col1': 'new_col1'})

Grouping Data

df %>% group_by(col1) %>% summarize(mean_val = mean(col2))

df.groupby('col1')['col2'].mean().reset_index()

Sorting Data

df %>% arrange(col1)

df.sort_values('col1')

Dropping Columns

df %>% select(-col1)

df.drop(columns=['col1'])

Counting Rows

df %>% count(col1)

df['col1'].value_counts()

Pivoting (Wider Format)

df %>% pivot_wider(key, value)

df.pivot(index, columns, values)

Pivoting (Longer Format)

df %>% pivot_longer(key, value, col1:col3)

df.melt(id_vars, var_name, value_name)

Merging (Join)

df1 %>% left_join(df2, by = "key")

df1.merge(df2, on='key', how='left')

Distinct Rows

df %>% distinct()

df.drop_duplicates()

Replacing NA Values

df %>% replace_na(list(col1 = 0))

df.fillna({'col1': 0})

Feature

Pandas

dplyr

Multi-indexing (Hierarchical Indexing)

Time Series Handling

Merging on Multiple Columns

✅ (but less flexible)

Missing Value Handling

✅ (but more manual)

Rolling/Expanding Windows

Performance Optimizations (Cython, NumPy)

Pipe Operator (%>%) for readibility

Grouped Mutate

❌ (requires lambda)

Tidy Data Transformations

Verb-Based Function Naming

Multiple Summaries Easily

❌ (requires dict/lambda)

Factor (Categorical) Handling

❌ (manual)

✅ (native)

Xarray#

Xarray provides an excellent introduction tutorial

Xarray mainly works with Datasets and DataArrays. Datasets are containers of DataArrays.

Let’s illustrate this by an MVBS dataset, created within echopype:

import xarray as xr 

mvbs = xr.open_dataset("./python_plotting_files/mvbs.nc")
mvbs

DataArray#

We can pull out the dataarray Sv using . notation

sv = mvbs.Sv
sv

Dimensions#

.dims - Dimensions, are the axes of the data, here the channel, ping_time and depth

sv.dims

Coordinates#

.coords contains the coordinate variables

sv.coords
sv.ping_time

Arbitrary attributes#

.attrs accesses arbitrary Python objects

sv.attrs

Underlying data#

.data provides a numpy array of the data.

sv.data

Plotting#

Plotting can be done easily:

  • we use the .notation to select the DatArray mvbs.Sv

  • we select one channel of the data through indexing .isel(channel=1)

  • we want to plot .plot

  • we want ping_time to represent the x axis and depth the y axis and we want depth to have 0 on top: x='ping_time', y='depth', yincrease = False

  • we want some additional decorations, so we set the colormap cmap='RdYlBu_r' and we set the colorbar limits to be within vmin=-75 and vmax=-40

mvbs.Sv.isel(channel=1).plot(x="ping_time", y="depth", 
                             yincrease=False, 
                             cmap="RdYlBu_r", 
                             vmin=-75, vmax=-40)
import plotly.graph_objects as go

dat = mvbs.swap_dims({"ping_time": "distance"}).isel(channel=1).Sv
tmp=dat.where(dat > -75)
fig = go.Figure(
    data=go.Heatmap(z=tmp.transpose(),
                    zmin=-75,zmax=-30,
                    colorscale = "RdYlBu_r",
                    colorbar=dict(title='')))
fig.update_yaxes(autorange="reversed") 
    
#px.imshow(tmp.transpose(),colorbar=dict(title=""))
fig