When working with time series, we often want to view the average over a certain number of days. For example, we can view a 7-day rolling average to give us an idea of change from week to week. In this article, we will learn how to conduct a moving average in python.
Let's load a data set of monthly milk production. We will load it from the url below. The data consists of monthly intervals and kilograms of milk produced.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_no2_long.csv')
df.head()
city | country | date.utc | location | parameter | value | unit | |
---|---|---|---|---|---|---|---|
0 | Paris | FR | 2019-06-21 00:00:00+00:00 | FR04014 | no2 | 20.0 | µg/m³ |
1 | Paris | FR | 2019-06-20 23:00:00+00:00 | FR04014 | no2 | 21.8 | µg/m³ |
2 | Paris | FR | 2019-06-20 22:00:00+00:00 | FR04014 | no2 | 26.5 | µg/m³ |
3 | Paris | FR | 2019-06-20 21:00:00+00:00 | FR04014 | no2 | 24.9 | µg/m³ |
4 | Paris | FR | 2019-06-20 20:00:00+00:00 | FR04014 | no2 | 21.4 | µg/m³ |
date_col = 'date.utc'
df = df[[date_col, 'value']]
df[date_col] = pd.to_datetime(df[date_col])
df = df.set_index(date_col)
df = df.sort_index()
C:\Users\krh12\AppData\Local\Temp/ipykernel_13784/1677584757.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[date_col] = pd.to_datetime(df[date_col])
df.head()
value | |
---|---|
date.utc | |
2019-05-07 01:00:00+00:00 | 23.0 |
2019-05-07 01:00:00+00:00 | 50.5 |
2019-05-07 01:00:00+00:00 | 25.0 |
2019-05-07 02:00:00+00:00 | 27.7 |
2019-05-07 02:00:00+00:00 | 19.0 |
df.plot()
<AxesSubplot:xlabel='date.utc'>
To conduct a moving average, we can use the rolling
function from the pandas
package that is a method of the DataFrame. This function takes three variables: the time series, the number of days to apply, and the function to apply. In the example below, we run a 2-day mean (or 2 day avg).
twoday_mean = df.rolling('2D').mean()
We can also plot the data over our orignal time series to see how the avg smoothed out the data.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1,figsize=(10,5))
ax.plot(df.value)
ax.plot(twoday_mean)
plt.show()
Let's do another example with a 7-day avg which is a common task in disease outbreaks and stocks.
sevenday_mean = df.rolling('7D').mean()
sevenday_mean
value | |
---|---|
date.utc | |
2019-05-07 01:00:00+00:00 | 23.000000 |
2019-05-07 01:00:00+00:00 | 36.750000 |
2019-05-07 01:00:00+00:00 | 32.833333 |
2019-05-07 02:00:00+00:00 | 31.550000 |
2019-05-07 02:00:00+00:00 | 29.040000 |
... | ... |
2019-06-20 20:00:00+00:00 | 24.443396 |
2019-06-20 21:00:00+00:00 | 24.327488 |
2019-06-20 22:00:00+00:00 | 24.127143 |
2019-06-20 23:00:00+00:00 | 23.900478 |
2019-06-21 00:00:00+00:00 | 23.692308 |
2068 rows × 1 columns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1,figsize=(10,5))
ax.plot(df.value)
ax.plot(sevenday_mean)
plt.show()
You may have noticed from the above that we can do more than a rolling average with the rolling
function. We can actually apply any math function. Let's run a couple of more examples, sum and median.
df.rolling('7D').median()
value | |
---|---|
date.utc | |
2019-05-07 01:00:00+00:00 | 23.00 |
2019-05-07 01:00:00+00:00 | 36.75 |
2019-05-07 01:00:00+00:00 | 25.00 |
2019-05-07 02:00:00+00:00 | 26.35 |
2019-05-07 02:00:00+00:00 | 25.00 |
... | ... |
2019-06-20 20:00:00+00:00 | 18.00 |
2019-06-20 21:00:00+00:00 | 18.00 |
2019-06-20 22:00:00+00:00 | 18.45 |
2019-06-20 23:00:00+00:00 | 18.90 |
2019-06-21 00:00:00+00:00 | 18.95 |
2068 rows × 1 columns
df.rolling('7D').sum()
value | |
---|---|
date.utc | |
2019-05-07 01:00:00+00:00 | 23.0 |
2019-05-07 01:00:00+00:00 | 73.5 |
2019-05-07 01:00:00+00:00 | 98.5 |
2019-05-07 02:00:00+00:00 | 126.2 |
2019-05-07 02:00:00+00:00 | 145.2 |
... | ... |
2019-06-20 20:00:00+00:00 | 5182.0 |
2019-06-20 21:00:00+00:00 | 5133.1 |
2019-06-20 22:00:00+00:00 | 5066.7 |
2019-06-20 23:00:00+00:00 | 4995.2 |
2019-06-21 00:00:00+00:00 | 4928.0 |
2068 rows × 1 columns