PySpark Difference Between Two Dates

10.31.2021

Intro

PySpark provides us with datediff and months_between that allows us to get the time differences between two dates. This is helpful when wanting to calculate the age of observations or time since an event occurred. In this article, we will learn how to compute the difference between dates in PySpark.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Creating the DataSet

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Now, let's create a dataframe to work with.

rdd = spark.sparkContext.parallelize([    
    ("jan", "2019", 86000, 30, "2019-02-28 10:30:00", "2019-12-28 10:30:00"),
    ("jan", "2020", 71000, 30, "2020-02-28 10:30:00", "2020-12-28 10:30:00"),
    
    ("feb", "2019", 99000, 42, "2019-02-28 10:30:00", "2019-12-28 10:30:00"),
    ("feb", "2020", 99500, 36, "2020-02-28 10:30:00", "2020-12-28 10:30:00"),
    
    ("mar", "2019", 92000, 25, "2019-02-28 10:30:00", "2019-12-28 10:30:00"),
    ("mar", "2020", 91000, 50, "2020-02-28 10:30:00", "2020-12-28 10:30:00")
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold", "created_at", "updated_at"])
df.show()
+-----+----+-------------+--------------------+-------------------+-------------------+
|month|year|total_revenue|unique_products_sold|         created_at|         updated_at|
+-----+----+-------------+--------------------+-------------------+-------------------+
|  jan|2019|        86000|                  30|2019-02-28 10:30:00|2019-12-28 10:30:00|
|  jan|2020|        71000|                  30|2020-02-28 10:30:00|2020-12-28 10:30:00|
|  feb|2019|        99000|                  42|2019-02-28 10:30:00|2019-12-28 10:30:00|
|  feb|2020|        99500|                  36|2020-02-28 10:30:00|2020-12-28 10:30:00|
|  mar|2019|        92000|                  25|2019-02-28 10:30:00|2019-12-28 10:30:00|
|  mar|2020|        91000|                  50|2020-02-28 10:30:00|2020-12-28 10:30:00|
+-----+----+-------------+--------------------+-------------------+-------------------+

Using datediff

To use datediff we pass in the two column names we would like to calculate the difference between. As with most sql functions, we can use select or withColumn. Here is an example of using select to convert.

from pyspark.sql.functions import datediff, col

df.select(
    datediff("updated_at", "created_at").alias('updated_age')
).show()
+-----------+
|updated_age|
+-----------+
|        303|
|        304|
|        303|
|        304|
|        303|
|        304|
+-----------+

In a similar manner, we can use the months_between function.

from pyspark.sql.functions import months_between, col

df.select(
    months_between("updated_at", "created_at").alias('updated_age')
).show()
+-----------+
|updated_age|
+-----------+
|       10.0|
|       10.0|
|       10.0|
|       10.0|
|       10.0|
|       10.0|
+-----------+

We can also use the withColumn to return a new DataFrame with the split column.

df.withColumn(
    "updated_age",
    datediff("updated_at", "created_at")
).show()
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+
|month|year|total_revenue|unique_products_sold|         created_at|         updated_at|updated_age|
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+
|  jan|2019|        86000|                  30|2019-02-28 10:30:00|2019-12-28 10:30:00|        303|
|  jan|2020|        71000|                  30|2020-02-28 10:30:00|2020-12-28 10:30:00|        304|
|  feb|2019|        99000|                  42|2019-02-28 10:30:00|2019-12-28 10:30:00|        303|
|  feb|2020|        99500|                  36|2020-02-28 10:30:00|2020-12-28 10:30:00|        304|
|  mar|2019|        92000|                  25|2019-02-28 10:30:00|2019-12-28 10:30:00|        303|
|  mar|2020|        91000|                  50|2020-02-28 10:30:00|2020-12-28 10:30:00|        304|
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+
df.withColumn(
    "updated_age",
    months_between("updated_at", "created_at")
).show()
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+
|month|year|total_revenue|unique_products_sold|         created_at|         updated_at|updated_age|
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+
|  jan|2019|        86000|                  30|2019-02-28 10:30:00|2019-12-28 10:30:00|       10.0|
|  jan|2020|        71000|                  30|2020-02-28 10:30:00|2020-12-28 10:30:00|       10.0|
|  feb|2019|        99000|                  42|2019-02-28 10:30:00|2019-12-28 10:30:00|       10.0|
|  feb|2020|        99500|                  36|2020-02-28 10:30:00|2020-12-28 10:30:00|       10.0|
|  mar|2019|        92000|                  25|2019-02-28 10:30:00|2019-12-28 10:30:00|       10.0|
|  mar|2020|        91000|                  50|2020-02-28 10:30:00|2020-12-28 10:30:00|       10.0|
+-----+----+-------------+--------------------+-------------------+-------------------+-----------+