PySpark provides us with datediff
and months_between
that allows us to get the time differences between two dates. This is helpful when wanting to calculate the age of observations or time since an event occurred. In this article, we will learn how to compute the difference between dates in PySpark.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Now, let's create a dataframe to work with.
rdd = spark.sparkContext.parallelize([
("jan", "2019", 86000, 30, "2019-02-28 10:30:00", "2019-12-28 10:30:00"),
("jan", "2020", 71000, 30, "2020-02-28 10:30:00", "2020-12-28 10:30:00"),
("feb", "2019", 99000, 42, "2019-02-28 10:30:00", "2019-12-28 10:30:00"),
("feb", "2020", 99500, 36, "2020-02-28 10:30:00", "2020-12-28 10:30:00"),
("mar", "2019", 92000, 25, "2019-02-28 10:30:00", "2019-12-28 10:30:00"),
("mar", "2020", 91000, 50, "2020-02-28 10:30:00", "2020-12-28 10:30:00")
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold", "created_at", "updated_at"])
df.show()
+-----+----+-------------+--------------------+-------------------+-------------------+
|month|year|total_revenue|unique_products_sold| created_at| updated_at|
+-----+----+-------------+--------------------+-------------------+-------------------+
| jan|2019| 86000| 30|2019-02-28 10:30:00|2019-12-28 10:30:00|
| jan|2020| 71000| 30|2020-02-28 10:30:00|2020-12-28 10:30:00|
| feb|2019| 99000| 42|2019-02-28 10:30:00|2019-12-28 10:30:00|
| feb|2020| 99500| 36|2020-02-28 10:30:00|2020-12-28 10:30:00|
| mar|2019| 92000| 25|2019-02-28 10:30:00|2019-12-28 10:30:00|
| mar|2020| 91000| 50|2020-02-28 10:30:00|2020-12-28 10:30:00|
+-----+----+-------------+--------------------+-------------------+-------------------+
To use datediff
we pass in the two column names we would like to calculate the difference between. As with most sql functions, we can use select or withColumn. Here is an example of using select to convert.
from pyspark.sql.functions import datediff, col
df.select(
datediff("updated_at", "created_at").alias('updated_age')
).show()
+-----------+
|updated_age|
+-----------+
| 303|
| 304|
| 303|
| 304|
| 303|
| 304|
+-----------+
In a similar manner, we can use the months_between
function.
from pyspark.sql.functions import months_between, col
df.select(
months_between("updated_at", "created_at").alias('updated_age')
).show()
+-----------+
|updated_age|
+-----------+
| 10.0|
| 10.0|
| 10.0|
| 10.0|
| 10.0|
| 10.0|
+-----------+
We can also use the withColumn
to return a new DataFrame with the split column.
df.withColumn(
"updated_age",
datediff("updated_at", "created_at")
).show()
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+
|month|year|total_revenue|unique_products_sold| created_at| updated_at|updated_age|
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+
| jan|2019| 86000| 30|2019-02-28 10:30:00|2019-12-28 10:30:00| 303|
| jan|2020| 71000| 30|2020-02-28 10:30:00|2020-12-28 10:30:00| 304|
| feb|2019| 99000| 42|2019-02-28 10:30:00|2019-12-28 10:30:00| 303|
| feb|2020| 99500| 36|2020-02-28 10:30:00|2020-12-28 10:30:00| 304|
| mar|2019| 92000| 25|2019-02-28 10:30:00|2019-12-28 10:30:00| 303|
| mar|2020| 91000| 50|2020-02-28 10:30:00|2020-12-28 10:30:00| 304|
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+
df.withColumn(
"updated_age",
months_between("updated_at", "created_at")
).show()
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+
|month|year|total_revenue|unique_products_sold| created_at| updated_at|updated_age|
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+
| jan|2019| 86000| 30|2019-02-28 10:30:00|2019-12-28 10:30:00| 10.0|
| jan|2020| 71000| 30|2020-02-28 10:30:00|2020-12-28 10:30:00| 10.0|
| feb|2019| 99000| 42|2019-02-28 10:30:00|2019-12-28 10:30:00| 10.0|
| feb|2020| 99500| 36|2020-02-28 10:30:00|2020-12-28 10:30:00| 10.0|
| mar|2019| 92000| 25|2019-02-28 10:30:00|2019-12-28 10:30:00| 10.0|
| mar|2020| 91000| 50|2020-02-28 10:30:00|2020-12-28 10:30:00| 10.0|
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+