The PySpark split
method allows us to split a column that contains a string by a delimiter. For example, we have a column that combines a date string, we can split this string into an Array Column. In this article, we will learn how to use PySpark Split.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Now, let's create a dataframe to work with.
rdd = spark.sparkContext.parallelize([
("jan 2019", 86000, 30),
("jan 2020", 71000, 30),
("feb 2019", 99000, 42),
("feb 2020", 99500, 36),
("mar 2019", 92000, 25),
("mar 2020", 91000, 50)
])
df = spark.createDataFrame(rdd, schema = ["month year", "total_revenue", "unique_products_sold"])
df.show()
+----------+-------------+--------------------+
|month year|total_revenue|unique_products_sold|
+----------+-------------+--------------------+
| jan 2019| 86000| 30|
| jan 2020| 71000| 30|
| feb 2019| 99000| 42|
| feb 2020| 99500| 36|
| mar 2019| 92000| 25|
| mar 2020| 91000| 50|
+----------+-------------+--------------------+
To use split
, we pass the column and a separator. We can combine with this with the select method to break up our month year
column
from pyspark.sql.functions import split, col
df.select(
split(col("month year"), " ").alias('MonthYear')
).show()
+-----------+
| MonthYear|
+-----------+
|[jan, 2019]|
|[jan, 2020]|
|[feb, 2019]|
|[feb, 2020]|
|[mar, 2019]|
|[mar, 2020]|
+-----------+
We can also use the withColumn
to return a new DataFrame with the split column.
df.withColumn(
"MonthYear",
split(col("month year"), " ")
).drop('month year').show()
+-------------+--------------------+-----------+
|total_revenue|unique_products_sold| MonthYear|
+-------------+--------------------+-----------+
| 86000| 30|[jan, 2019]|
| 71000| 30|[jan, 2020]|
| 99000| 42|[feb, 2019]|
| 99500| 36|[feb, 2020]|
| 92000| 25|[mar, 2019]|
| 91000| 50|[mar, 2020]|
+-------------+--------------------+-----------+
Another option we have is to use the sql api from PySpark. We first create a temporary table, then we can use the split method in our sql select.
df.createOrReplaceTempView("Sales")
spark.sql("select SPLIT('month year',',') as MonthYear from Sales") \
.show()
+------------+
| MonthYear|
+------------+
|[month year]|
|[month year]|
|[month year]|
|[month year]|
|[month year]|
|[month year]|
+------------+