PySpark To_Date

10.29.2021

Intro

The PySpark to_date method allows us to convert timestamps to a date type. This is useful if you import csv data that has date strings, but you want to perform date options on it. In this article, we will learn how to use PySpark to_date.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Creating the DataSet

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Now, let's create a dataframe to work with.

rdd = spark.sparkContext.parallelize([    
    ("jan", "2019", 86000, 30, "1997-02-28 10:30:00"),
    ("jan", "2020", 71000, 30, "1997-02-28 10:30:00"),
    
    ("feb", "2019", 99000, 42, "1997-02-28 10:30:00"),
    ("feb", "2020", 99500, 36, "1997-02-28 10:30:00"),
    
    ("mar", "2019", 92000, 25, "1997-02-28 10:30:00"),
    ("mar", "2020", 91000, 50, "1997-02-28 10:30:00")
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold", "Date"])
df.show()
+-----+----+-------------+--------------------+-------------------+
|month|year|total_revenue|unique_products_sold|               Date|
+-----+----+-------------+--------------------+-------------------+
|  jan|2019|        86000|                  30|1997-02-28 10:30:00|
|  jan|2020|        71000|                  30|1997-02-28 10:30:00|
|  feb|2019|        99000|                  42|1997-02-28 10:30:00|
|  feb|2020|        99500|                  36|1997-02-28 10:30:00|
|  mar|2019|        92000|                  25|1997-02-28 10:30:00|
|  mar|2020|        91000|                  50|1997-02-28 10:30:00|
+-----+----+-------------+--------------------+-------------------+

Using To_Date

To use to_date we pass in the column name of the column we want to convert. As with most sql functions, we can use select or withColumn. Here is an example of using select to convert. Below we also show the collect method to see the data type of the column.

from pyspark.sql.functions import to_date, col

df.select(
    to_date("date").alias('Month')
).show()

df.collect()
+----------+
|     Month|
+----------+
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
+----------+






[Row(month='jan', year='2019', total_revenue=86000, unique_products_sold=30, Date='1997-02-28 10:30:00'),
 Row(month='jan', year='2020', total_revenue=71000, unique_products_sold=30, Date='1997-02-28 10:30:00'),
 Row(month='feb', year='2019', total_revenue=99000, unique_products_sold=42, Date='1997-02-28 10:30:00'),
 Row(month='feb', year='2020', total_revenue=99500, unique_products_sold=36, Date='1997-02-28 10:30:00'),
 Row(month='mar', year='2019', total_revenue=92000, unique_products_sold=25, Date='1997-02-28 10:30:00'),
 Row(month='mar', year='2020', total_revenue=91000, unique_products_sold=50, Date='1997-02-28 10:30:00')]

We can also use the withColumn to return a new DataFrame with the split column.

df.withColumn(
    "Month",
     to_date("date")
).show()
+----------+----+-------------+--------------------+-------------------+
|     Month|year|total_revenue|unique_products_sold|               Date|
+----------+----+-------------+--------------------+-------------------+
|1997-02-28|2019|        86000|                  30|1997-02-28 10:30:00|
|1997-02-28|2020|        71000|                  30|1997-02-28 10:30:00|
|1997-02-28|2019|        99000|                  42|1997-02-28 10:30:00|
|1997-02-28|2020|        99500|                  36|1997-02-28 10:30:00|
|1997-02-28|2019|        92000|                  25|1997-02-28 10:30:00|
|1997-02-28|2020|        91000|                  50|1997-02-28 10:30:00|
+----------+----+-------------+--------------------+-------------------+

We can also use sql to convert our column. When running a sql query we have access to the to_date function.

df.createOrReplaceTempView("sales")

spark.sql("select to_date(date) as date_type from sales").show()
+----------+
| date_type|
+----------+
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
+----------+

We can use the collect method to see the new data type.

spark.sql("select to_date(date) as date_type from sales").collect()
[Row(date_type=datetime.date(1997, 2, 28)),
 Row(date_type=datetime.date(1997, 2, 28)),
 Row(date_type=datetime.date(1997, 2, 28)),
 Row(date_type=datetime.date(1997, 2, 28)),
 Row(date_type=datetime.date(1997, 2, 28)),
 Row(date_type=datetime.date(1997, 2, 28))]