PySpark Date_Format

10.30.2021

Intro

The PySpark date_format function allows use to convert date columns into string columns using a specific output. A common task would be to convert to and iso8601 standard for exporting to other systems. In this article, we will learn how to use the data_format function.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Creating the DataSet

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Now, let's create a dataframe to work with.

rdd = spark.sparkContext.parallelize([    
    ("jan", "2019", 86000, 30, "1997-02-28 10:30:00"),
    ("jan", "2020", 71000, 30, "1997-02-28 10:30:00"),
    
    ("feb", "2019", 99000, 42, "1997-02-28 10:30:00"),
    ("feb", "2020", 99500, 36, "1997-02-28 10:30:00"),
    
    ("mar", "2019", 92000, 25, "1997-02-28 10:30:00"),
    ("mar", "2020", 91000, 50, "1997-02-28 10:30:00")
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold", "Date"])
df.show()
+-----+----+-------------+--------------------+-------------------+
|month|year|total_revenue|unique_products_sold|               Date|
+-----+----+-------------+--------------------+-------------------+
|  jan|2019|        86000|                  30|1997-02-28 10:30:00|
|  jan|2020|        71000|                  30|1997-02-28 10:30:00|
|  feb|2019|        99000|                  42|1997-02-28 10:30:00|
|  feb|2020|        99500|                  36|1997-02-28 10:30:00|
|  mar|2019|        92000|                  25|1997-02-28 10:30:00|
|  mar|2020|        91000|                  50|1997-02-28 10:30:00|
+-----+----+-------------+--------------------+-------------------+

Using Date_Format

To use date_format we pass in the column name of the column we want to convert. As with most sql functions, we can use select or withColumn. Here is an example of using select to convert.

from pyspark.sql.functions import date_format, col

df.select(
    date_format("date", "yyyy MM dd").alias('Date')
).show()
+----------+
|      Date|
+----------+
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
+----------+

We can also use the withColumn to return a new DataFrame with the split column.

df.withColumn(
    "Date",
     date_format("date", "yyyy MM dd")
).show()
+-----+----+-------------+--------------------+----------+
|month|year|total_revenue|unique_products_sold|      Date|
+-----+----+-------------+--------------------+----------+
|  jan|2019|        86000|                  30|1997 02 28|
|  jan|2020|        71000|                  30|1997 02 28|
|  feb|2019|        99000|                  42|1997 02 28|
|  feb|2020|        99500|                  36|1997 02 28|
|  mar|2019|        92000|                  25|1997 02 28|
|  mar|2020|        91000|                  50|1997 02 28|
+-----+----+-------------+--------------------+----------+

We can also use sql to convert our column. When running a sql query we have access to the date_format function.

df.createOrReplaceTempView("sales")

spark.sql("select date_format(date, 'yyyy MM dd') as Date from sales").show()
+----------+
|      Date|
+----------+
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
+----------+