Pandas Resample Time Series

Intro

Resampling is a common task when working with time series dta. Resampling goes in two directions, upsampling and downsampling. Upsampling allows us to go from a lower time frame to a higher, i.e. minutes to hours. Downsampling is the reverse. In this article, we will learn how to do resampling in python with pandas.

Loading the Data

The data we will use is air quality data for city and countries around the world. The main columns we will use are the date.utc and value.

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_no2_long.csv')

df.head()

	city	country	date.utc	location	parameter	value	unit
0	Paris	FR	2019-06-21 00:00:00+00:00	FR04014	no2	20.0	µg/m³
1	Paris	FR	2019-06-20 23:00:00+00:00	FR04014	no2	21.8	µg/m³
2	Paris	FR	2019-06-20 22:00:00+00:00	FR04014	no2	26.5	µg/m³
3	Paris	FR	2019-06-20 21:00:00+00:00	FR04014	no2	24.9	µg/m³
4	Paris	FR	2019-06-20 20:00:00+00:00	FR04014	no2	21.4	µg/m³

Indexing the Date

When working with time series, we often want to index our data frame by a date field. We first use to_datetime to convert date.utc to a date then we use set_index to set a date index.

df['date.utc'] = pd.to_datetime(df['date.utc'])
df = df.set_index('date.utc')

df = df.sort_index()
df.head()

	city	country	location	parameter	value	unit
date.utc
2019-05-07 01:00:00+00:00	London	GB	London Westminster	no2	23.0	µg/m³
2019-05-07 01:00:00+00:00	Antwerpen	BE	BETR801	no2	50.5	µg/m³
2019-05-07 01:00:00+00:00	Paris	FR	FR04014	no2	25.0	µg/m³
2019-05-07 02:00:00+00:00	Paris	FR	FR04014	no2	27.7	µg/m³
2019-05-07 02:00:00+00:00	London	GB	London Westminster	no2	19.0	µg/m³

Downsampling in Pandas

The first type of resample we will do is downsampling. We can do this by using the resample method and passing in a number plus an offset alias. Here we use 1D for one day. This will sum our data per a day.

You can check out more offsets here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.

df.resample('1D').sum().head()

	value
date.utc
2019-05-07 00:00:00+00:00	1343.2
2019-05-08 00:00:00+00:00	1253.8
2019-05-09 00:00:00+00:00	1734.6
2019-05-10 00:00:00+00:00	1613.5
2019-05-11 00:00:00+00:00	1202.4

We can also run multiple aggregate functions by using the agg function.

df['value'].resample('1D').agg(['min', 'max', 'sum']).head()

	min	max	sum
date.utc
2019-05-07 00:00:00+00:00	10.6	77.7	1343.2
2019-05-08 00:00:00+00:00	13.5	48.9	1253.8
2019-05-09 00:00:00+00:00	10.0	97.0	1734.6
2019-05-10 00:00:00+00:00	10.5	60.7	1613.5
2019-05-11 00:00:00+00:00	14.9	37.0	1202.4

We can also run more customize indexes by passing the offset parameter. Here aggregate by 5days with a 2 day offset.

df.resample('5D', offset="2D").sum()

	value
date.utc
2019-05-04 00:00:00+00:00	2597.0
2019-05-09 00:00:00+00:00	6932.5
2019-05-14 00:00:00+00:00	7044.0
2019-05-19 00:00:00+00:00	9317.5
2019-05-24 00:00:00+00:00	6065.3
2019-05-29 00:00:00+00:00	5698.8
2019-06-03 00:00:00+00:00	5022.5
2019-06-08 00:00:00+00:00	5401.1
2019-06-13 00:00:00+00:00	5391.9
2019-06-18 00:00:00+00:00	838.9

Upsampling in Pandas

Upsampling is the opposite of downsampling, as the name implies. We can run upsampling in a similar way, however, we will need to fill in missing data. Let's start with an example of upsampling our data to quarters.

We will need a new data frame for this. We will be moving from years to qurters, so there will be missing data. First let's create a fake data frame with yearly sales.

df = pd.DataFrame(
    { 
        'value': [1000, 2000, 3000]
    }, 
    index = pd.period_range(
        '2016-01-01',
         freq = 'A',
         periods = 3
    )
)
df.head()

	value
2016	1000
2017	2000
2018	3000

Just like before, we can use the resample method. This time, we will resample by Q for quarter.

df.resample('Q').sum()

	value
2016Q1	1000.0
2016Q2	NaN
2016Q3	NaN
2016Q4	NaN
2017Q1	2000.0
2017Q2	NaN
2017Q3	NaN
2017Q4	NaN
2018Q1	3000.0
2018Q2	NaN
2018Q3	NaN
2018Q4	NaN

As you may notice from above, we are missing data for the quarters. We will need to use a fill strategy to resolve this.

The first way we can resolve this is with the ffill method which will conduct forward fill. This takes the values from the year and moves them until it finds a non empty value.

df.resample('Q').ffill()

	value
2016Q1	1000
2016Q2	1000
2016Q3	1000
2016Q4	1000
2017Q1	2000
2017Q2	2000
2017Q3	2000
2017Q4	2000
2018Q1	3000
2018Q2	3000
2018Q3	3000
2018Q4	3000

Similarly, we can use the bfill to fill backwards.

df.resample('Q').bfill()

	value
2016Q1	1000.0
2016Q2	2000.0
2016Q3	2000.0
2016Q4	2000.0
2017Q1	2000.0
2017Q2	3000.0
2017Q3	3000.0
2017Q4	3000.0
2018Q1	3000.0
2018Q2	NaN
2018Q3	NaN
2018Q4	NaN

Pandas Resample Time Series

07.13.2021

Intro

Loading the Data

Indexing the Date

Downsampling in Pandas

Upsampling in Pandas

How to Plot a Timeseries in Python

R Resample Time Series