Resampling is a common task when working with time series dta. Resampling goes in two directions, upsampling and downsampling. Upsampling allows us to go from a lower time frame to a higher, i.e. minutes to hours. Downsampling is the reverse. In this article, we will learn how to do resampling in python with pandas.
The data we will use is air quality data for city and countries around the world. The main columns we will use are the date.utc
and value
.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_no2_long.csv')
df.head()
city | country | date.utc | location | parameter | value | unit | |
---|---|---|---|---|---|---|---|
0 | Paris | FR | 2019-06-21 00:00:00+00:00 | FR04014 | no2 | 20.0 | µg/m³ |
1 | Paris | FR | 2019-06-20 23:00:00+00:00 | FR04014 | no2 | 21.8 | µg/m³ |
2 | Paris | FR | 2019-06-20 22:00:00+00:00 | FR04014 | no2 | 26.5 | µg/m³ |
3 | Paris | FR | 2019-06-20 21:00:00+00:00 | FR04014 | no2 | 24.9 | µg/m³ |
4 | Paris | FR | 2019-06-20 20:00:00+00:00 | FR04014 | no2 | 21.4 | µg/m³ |
When working with time series, we often want to index our data frame by a date field. We first use to_datetime
to convert date.utc
to a date then we use set_index
to set a date index.
df['date.utc'] = pd.to_datetime(df['date.utc'])
df = df.set_index('date.utc')
df = df.sort_index()
df.head()
city | country | location | parameter | value | unit | |
---|---|---|---|---|---|---|
date.utc | ||||||
2019-05-07 01:00:00+00:00 | London | GB | London Westminster | no2 | 23.0 | µg/m³ |
2019-05-07 01:00:00+00:00 | Antwerpen | BE | BETR801 | no2 | 50.5 | µg/m³ |
2019-05-07 01:00:00+00:00 | Paris | FR | FR04014 | no2 | 25.0 | µg/m³ |
2019-05-07 02:00:00+00:00 | Paris | FR | FR04014 | no2 | 27.7 | µg/m³ |
2019-05-07 02:00:00+00:00 | London | GB | London Westminster | no2 | 19.0 | µg/m³ |
The first type of resample we will do is downsampling. We can do this by using the resample
method and passing in a number plus an offset alias. Here we use 1D
for one day. This will sum our data per a day.
You can check out more offsets here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
df.resample('1D').sum().head()
value | |
---|---|
date.utc | |
2019-05-07 00:00:00+00:00 | 1343.2 |
2019-05-08 00:00:00+00:00 | 1253.8 |
2019-05-09 00:00:00+00:00 | 1734.6 |
2019-05-10 00:00:00+00:00 | 1613.5 |
2019-05-11 00:00:00+00:00 | 1202.4 |
We can also run multiple aggregate functions by using the agg
function.
df['value'].resample('1D').agg(['min', 'max', 'sum']).head()
min | max | sum | |
---|---|---|---|
date.utc | |||
2019-05-07 00:00:00+00:00 | 10.6 | 77.7 | 1343.2 |
2019-05-08 00:00:00+00:00 | 13.5 | 48.9 | 1253.8 |
2019-05-09 00:00:00+00:00 | 10.0 | 97.0 | 1734.6 |
2019-05-10 00:00:00+00:00 | 10.5 | 60.7 | 1613.5 |
2019-05-11 00:00:00+00:00 | 14.9 | 37.0 | 1202.4 |
We can also run more customize indexes by passing the offset
parameter. Here aggregate by 5days with a 2 day offset.
df.resample('5D', offset="2D").sum()
value | |
---|---|
date.utc | |
2019-05-04 00:00:00+00:00 | 2597.0 |
2019-05-09 00:00:00+00:00 | 6932.5 |
2019-05-14 00:00:00+00:00 | 7044.0 |
2019-05-19 00:00:00+00:00 | 9317.5 |
2019-05-24 00:00:00+00:00 | 6065.3 |
2019-05-29 00:00:00+00:00 | 5698.8 |
2019-06-03 00:00:00+00:00 | 5022.5 |
2019-06-08 00:00:00+00:00 | 5401.1 |
2019-06-13 00:00:00+00:00 | 5391.9 |
2019-06-18 00:00:00+00:00 | 838.9 |
Upsampling is the opposite of downsampling, as the name implies. We can run upsampling in a similar way, however, we will need to fill in missing data. Let's start with an example of upsampling our data to quarters.
We will need a new data frame for this. We will be moving from years to qurters, so there will be missing data. First let's create a fake data frame with yearly sales.
df = pd.DataFrame(
{
'value': [1000, 2000, 3000]
},
index = pd.period_range(
'2016-01-01',
freq = 'A',
periods = 3
)
)
df.head()
value | |
---|---|
2016 | 1000 |
2017 | 2000 |
2018 | 3000 |
Just like before, we can use the resample method. This time, we will resample by Q
for quarter.
df.resample('Q').sum()
value | |
---|---|
2016Q1 | 1000.0 |
2016Q2 | NaN |
2016Q3 | NaN |
2016Q4 | NaN |
2017Q1 | 2000.0 |
2017Q2 | NaN |
2017Q3 | NaN |
2017Q4 | NaN |
2018Q1 | 3000.0 |
2018Q2 | NaN |
2018Q3 | NaN |
2018Q4 | NaN |
As you may notice from above, we are missing data for the quarters. We will need to use a fill strategy to resolve this.
The first way we can resolve this is with the ffill
method which will conduct forward fill. This takes the values from the year and moves them until it finds a non empty value.
df.resample('Q').ffill()
value | |
---|---|
2016Q1 | 1000 |
2016Q2 | 1000 |
2016Q3 | 1000 |
2016Q4 | 1000 |
2017Q1 | 2000 |
2017Q2 | 2000 |
2017Q3 | 2000 |
2017Q4 | 2000 |
2018Q1 | 3000 |
2018Q2 | 3000 |
2018Q3 | 3000 |
2018Q4 | 3000 |
Similarly, we can use the bfill
to fill backwards.
df.resample('Q').bfill()
value | |
---|---|
2016Q1 | 1000.0 |
2016Q2 | 2000.0 |
2016Q3 | 2000.0 |
2016Q4 | 2000.0 |
2017Q1 | 2000.0 |
2017Q2 | 3000.0 |
2017Q3 | 3000.0 |
2017Q4 | 3000.0 |
2018Q1 | 3000.0 |
2018Q2 | NaN |
2018Q3 | NaN |
2018Q4 | NaN |