The PySpark date_format
function allows use to convert date columns into string columns using a specific output. A common task would be to convert to and iso8601 standard for exporting to other systems. In this article, we will learn how to use the data_format
function.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Now, let's create a dataframe to work with.
rdd = spark.sparkContext.parallelize([
("jan", "2019", 86000, 30, "1997-02-28 10:30:00"),
("jan", "2020", 71000, 30, "1997-02-28 10:30:00"),
("feb", "2019", 99000, 42, "1997-02-28 10:30:00"),
("feb", "2020", 99500, 36, "1997-02-28 10:30:00"),
("mar", "2019", 92000, 25, "1997-02-28 10:30:00"),
("mar", "2020", 91000, 50, "1997-02-28 10:30:00")
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold", "Date"])
df.show()
+-----+----+-------------+--------------------+-------------------+
|month|year|total_revenue|unique_products_sold| Date|
+-----+----+-------------+--------------------+-------------------+
| jan|2019| 86000| 30|1997-02-28 10:30:00|
| jan|2020| 71000| 30|1997-02-28 10:30:00|
| feb|2019| 99000| 42|1997-02-28 10:30:00|
| feb|2020| 99500| 36|1997-02-28 10:30:00|
| mar|2019| 92000| 25|1997-02-28 10:30:00|
| mar|2020| 91000| 50|1997-02-28 10:30:00|
+-----+----+-------------+--------------------+-------------------+
To use date_format
we pass in the column name of the column we want to convert. As with most sql functions, we can use select or withColumn. Here is an example of using select to convert.
from pyspark.sql.functions import date_format, col
df.select(
date_format("date", "yyyy MM dd").alias('Date')
).show()
+----------+
| Date|
+----------+
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
+----------+
We can also use the withColumn
to return a new DataFrame with the split column.
df.withColumn(
"Date",
date_format("date", "yyyy MM dd")
).show()
+-----+----+-------------+--------------------+----------+
|month|year|total_revenue|unique_products_sold| Date|
+-----+----+-------------+--------------------+----------+
| jan|2019| 86000| 30|1997 02 28|
| jan|2020| 71000| 30|1997 02 28|
| feb|2019| 99000| 42|1997 02 28|
| feb|2020| 99500| 36|1997 02 28|
| mar|2019| 92000| 25|1997 02 28|
| mar|2020| 91000| 50|1997 02 28|
+-----+----+-------------+--------------------+----------+
We can also use sql to convert our column. When running a sql query we have access to the date_format
function.
df.createOrReplaceTempView("sales")
spark.sql("select date_format(date, 'yyyy MM dd') as Date from sales").show()
+----------+
| Date|
+----------+
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
|1997 02 28|
+----------+