Pandas Resample Time Series

07.13.2021

Intro

Resampling is a common task when working with time series dta. Resampling goes in two directions, upsampling and downsampling. Upsampling allows us to go from a lower time frame to a higher, i.e. minutes to hours. Downsampling is the reverse. In this article, we will learn how to do resampling in python with pandas.

Loading the Data

The data we will use is air quality data for city and countries around the world. The main columns we will use are the date.utc and value.

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_no2_long.csv')
df.head()
city country date.utc location parameter value unit
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³

Indexing the Date

When working with time series, we often want to index our data frame by a date field. We first use to_datetime to convert date.utc to a date then we use set_index to set a date index.

df['date.utc'] = pd.to_datetime(df['date.utc'])
df = df.set_index('date.utc')
df = df.sort_index()
df.head()
city country location parameter value unit
date.utc
2019-05-07 01:00:00+00:00 London GB London Westminster no2 23.0 µg/m³
2019-05-07 01:00:00+00:00 Antwerpen BE BETR801 no2 50.5 µg/m³
2019-05-07 01:00:00+00:00 Paris FR FR04014 no2 25.0 µg/m³
2019-05-07 02:00:00+00:00 Paris FR FR04014 no2 27.7 µg/m³
2019-05-07 02:00:00+00:00 London GB London Westminster no2 19.0 µg/m³

Downsampling in Pandas

The first type of resample we will do is downsampling. We can do this by using the resample method and passing in a number plus an offset alias. Here we use 1D for one day. This will sum our data per a day.

You can check out more offsets here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.

df.resample('1D').sum().head()
value
date.utc
2019-05-07 00:00:00+00:00 1343.2
2019-05-08 00:00:00+00:00 1253.8
2019-05-09 00:00:00+00:00 1734.6
2019-05-10 00:00:00+00:00 1613.5
2019-05-11 00:00:00+00:00 1202.4

We can also run multiple aggregate functions by using the agg function.

df['value'].resample('1D').agg(['min', 'max', 'sum']).head()
min max sum
date.utc
2019-05-07 00:00:00+00:00 10.6 77.7 1343.2
2019-05-08 00:00:00+00:00 13.5 48.9 1253.8
2019-05-09 00:00:00+00:00 10.0 97.0 1734.6
2019-05-10 00:00:00+00:00 10.5 60.7 1613.5
2019-05-11 00:00:00+00:00 14.9 37.0 1202.4

We can also run more customize indexes by passing the offset parameter. Here aggregate by 5days with a 2 day offset.

df.resample('5D', offset="2D").sum()
value
date.utc
2019-05-04 00:00:00+00:00 2597.0
2019-05-09 00:00:00+00:00 6932.5
2019-05-14 00:00:00+00:00 7044.0
2019-05-19 00:00:00+00:00 9317.5
2019-05-24 00:00:00+00:00 6065.3
2019-05-29 00:00:00+00:00 5698.8
2019-06-03 00:00:00+00:00 5022.5
2019-06-08 00:00:00+00:00 5401.1
2019-06-13 00:00:00+00:00 5391.9
2019-06-18 00:00:00+00:00 838.9

Upsampling in Pandas

Upsampling is the opposite of downsampling, as the name implies. We can run upsampling in a similar way, however, we will need to fill in missing data. Let's start with an example of upsampling our data to quarters.

We will need a new data frame for this. We will be moving from years to qurters, so there will be missing data. First let's create a fake data frame with yearly sales.

df = pd.DataFrame(
    { 
        'value': [1000, 2000, 3000]
    }, 
    index = pd.period_range(
        '2016-01-01',
         freq = 'A',
         periods = 3
    )
)
df.head()
value
2016 1000
2017 2000
2018 3000

Just like before, we can use the resample method. This time, we will resample by Q for quarter.

df.resample('Q').sum()
value
2016Q1 1000.0
2016Q2 NaN
2016Q3 NaN
2016Q4 NaN
2017Q1 2000.0
2017Q2 NaN
2017Q3 NaN
2017Q4 NaN
2018Q1 3000.0
2018Q2 NaN
2018Q3 NaN
2018Q4 NaN

As you may notice from above, we are missing data for the quarters. We will need to use a fill strategy to resolve this.

The first way we can resolve this is with the ffill method which will conduct forward fill. This takes the values from the year and moves them until it finds a non empty value.

df.resample('Q').ffill()
value
2016Q1 1000
2016Q2 1000
2016Q3 1000
2016Q4 1000
2017Q1 2000
2017Q2 2000
2017Q3 2000
2017Q4 2000
2018Q1 3000
2018Q2 3000
2018Q3 3000
2018Q4 3000

Similarly, we can use the bfill to fill backwards.

df.resample('Q').bfill()
value
2016Q1 1000.0
2016Q2 2000.0
2016Q3 2000.0
2016Q4 2000.0
2017Q1 2000.0
2017Q2 3000.0
2017Q3 3000.0
2017Q4 3000.0
2018Q1 3000.0
2018Q2 NaN
2018Q3 NaN
2018Q4 NaN