The PySpark to_date
method allows us to convert timestamps to a date type. This is useful if you import csv data that has date strings, but you want to perform date options on it. In this article, we will learn how to use PySpark to_date.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Now, let's create a dataframe to work with.
rdd = spark.sparkContext.parallelize([
("jan", "2019", 86000, 30, "1997-02-28 10:30:00"),
("jan", "2020", 71000, 30, "1997-02-28 10:30:00"),
("feb", "2019", 99000, 42, "1997-02-28 10:30:00"),
("feb", "2020", 99500, 36, "1997-02-28 10:30:00"),
("mar", "2019", 92000, 25, "1997-02-28 10:30:00"),
("mar", "2020", 91000, 50, "1997-02-28 10:30:00")
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold", "Date"])
df.show()
+-----+----+-------------+--------------------+-------------------+
|month|year|total_revenue|unique_products_sold| Date|
+-----+----+-------------+--------------------+-------------------+
| jan|2019| 86000| 30|1997-02-28 10:30:00|
| jan|2020| 71000| 30|1997-02-28 10:30:00|
| feb|2019| 99000| 42|1997-02-28 10:30:00|
| feb|2020| 99500| 36|1997-02-28 10:30:00|
| mar|2019| 92000| 25|1997-02-28 10:30:00|
| mar|2020| 91000| 50|1997-02-28 10:30:00|
+-----+----+-------------+--------------------+-------------------+
To use to_date
we pass in the column name of the column we want to convert. As with most sql functions, we can use select or withColumn. Here is an example of using select to convert. Below we also show the collect
method to see the data type of the column.
from pyspark.sql.functions import to_date, col
df.select(
to_date("date").alias('Month')
).show()
df.collect()
+----------+
| Month|
+----------+
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
+----------+
[Row(month='jan', year='2019', total_revenue=86000, unique_products_sold=30, Date='1997-02-28 10:30:00'),
Row(month='jan', year='2020', total_revenue=71000, unique_products_sold=30, Date='1997-02-28 10:30:00'),
Row(month='feb', year='2019', total_revenue=99000, unique_products_sold=42, Date='1997-02-28 10:30:00'),
Row(month='feb', year='2020', total_revenue=99500, unique_products_sold=36, Date='1997-02-28 10:30:00'),
Row(month='mar', year='2019', total_revenue=92000, unique_products_sold=25, Date='1997-02-28 10:30:00'),
Row(month='mar', year='2020', total_revenue=91000, unique_products_sold=50, Date='1997-02-28 10:30:00')]
We can also use the withColumn
to return a new DataFrame with the split column.
df.withColumn(
"Month",
to_date("date")
).show()
+----------+----+-------------+--------------------+-------------------+
| Month|year|total_revenue|unique_products_sold| Date|
+----------+----+-------------+--------------------+-------------------+
|1997-02-28|2019| 86000| 30|1997-02-28 10:30:00|
|1997-02-28|2020| 71000| 30|1997-02-28 10:30:00|
|1997-02-28|2019| 99000| 42|1997-02-28 10:30:00|
|1997-02-28|2020| 99500| 36|1997-02-28 10:30:00|
|1997-02-28|2019| 92000| 25|1997-02-28 10:30:00|
|1997-02-28|2020| 91000| 50|1997-02-28 10:30:00|
+----------+----+-------------+--------------------+-------------------+
We can also use sql to convert our column. When running a sql query we have access to the to_date
function.
df.createOrReplaceTempView("sales")
spark.sql("select to_date(date) as date_type from sales").show()
+----------+
| date_type|
+----------+
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
|1997-02-28|
+----------+
We can use the collect
method to see the new data type.
spark.sql("select to_date(date) as date_type from sales").collect()
[Row(date_type=datetime.date(1997, 2, 28)),
Row(date_type=datetime.date(1997, 2, 28)),
Row(date_type=datetime.date(1997, 2, 28)),
Row(date_type=datetime.date(1997, 2, 28)),
Row(date_type=datetime.date(1997, 2, 28)),
Row(date_type=datetime.date(1997, 2, 28))]