Using time series is a common task in data science with python. We often want to select specific information based on dates or a date range. In this article, we will learn how to index and subset time series data in Python.
Let's start by creating a basic time series. We will have two columns, a list of dates and the respective sales.
import pandas as pdsales = [100, 300, 400, 200]
dates = pd.date_range("20180101", periods=4, freq="Y")df = pd.DataFrame.from_dict({
'dates': dates,
'sales': sales
})
df.head()| dates | sales | |
|---|---|---|
| 0 | 2018-12-31 | 100 |
| 1 | 2019-12-31 | 300 |
| 2 | 2020-12-31 | 400 |
| 3 | 2021-12-31 | 200 |
Now, we will set the index of our data frame to be dates so we can use the selectors for dates.
df['dates'] = pd.to_datetime(df['dates'])
df = df.set_index('dates')df.head()| sales | |
|---|---|
| dates | |
| 2018-12-31 | 100 |
| 2019-12-31 | 300 |
| 2020-12-31 | 400 |
| 2021-12-31 | 200 |
The first way we can select data based on date is using the loc selector. We can pass in a specific date as follows.
df.loc['2018-12-31']sales 100
Name: 2018-12-31 00:00:00, dtype: int64We can also select groups of data using the loc selector. For examplem we can select year this way.
df.loc['2019']| sales | |
|---|---|
| dates | |
| 2019-12-31 | 300 |
In a similar way we can select all days under a year and month using the following.
df.loc['2020-12']| sales | |
|---|---|
| dates | |
| 2020-12-31 | 400 |
We can also use the normal bracket notation to select between a range of dates. For example, we can use the following to select between January 2019 and February 2021 (not including the final month/year.)
df["2019-1":"2021-2"]| sales | |
|---|---|
| dates | |
| 2019-12-31 | 300 |
| 2020-12-31 | 400 |
In a similar way, we can select between two years.
df["2019":"2021"]| sales | |
|---|---|
| dates | |
| 2019-12-31 | 300 |
| 2020-12-31 | 400 |
| 2021-12-31 | 200 |
