Using time series is a common task in data science with python. We often want to select specific information based on dates or a date range. In this article, we will learn how to index and subset time series data in Python.
Let's start by creating a basic time series. We will have two columns, a list of dates and the respective sales.
import pandas as pd
sales = [100, 300, 400, 200]
dates = pd.date_range("20180101", periods=4, freq="Y")
df = pd.DataFrame.from_dict({
'dates': dates,
'sales': sales
})
df.head()
dates | sales | |
---|---|---|
0 | 2018-12-31 | 100 |
1 | 2019-12-31 | 300 |
2 | 2020-12-31 | 400 |
3 | 2021-12-31 | 200 |
Now, we will set the index of our data frame to be dates
so we can use the selectors for dates.
df['dates'] = pd.to_datetime(df['dates'])
df = df.set_index('dates')
df.head()
sales | |
---|---|
dates | |
2018-12-31 | 100 |
2019-12-31 | 300 |
2020-12-31 | 400 |
2021-12-31 | 200 |
The first way we can select data based on date is using the loc
selector. We can pass in a specific date as follows.
df.loc['2018-12-31']
sales 100
Name: 2018-12-31 00:00:00, dtype: int64
We can also select groups of data using the loc
selector. For examplem we can select year this way.
df.loc['2019']
sales | |
---|---|
dates | |
2019-12-31 | 300 |
In a similar way we can select all days under a year and month using the following.
df.loc['2020-12']
sales | |
---|---|
dates | |
2020-12-31 | 400 |
We can also use the normal bracket notation to select between a range of dates. For example, we can use the following to select between January 2019 and February 2021 (not including the final month/year.)
df["2019-1":"2021-2"]
sales | |
---|---|
dates | |
2019-12-31 | 300 |
2020-12-31 | 400 |
In a similar way, we can select between two years.
df["2019":"2021"]
sales | |
---|---|
dates | |
2019-12-31 | 300 |
2020-12-31 | 400 |
2021-12-31 | 200 |