The PySpark substring
method allows us to extract a substring from a column in a DataFrame. In this article, we will learn how to use substring in PySpark.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Now, let's create a dataframe to work with.
rdd = spark.sparkContext.parallelize([
("jan", "2019", 86000, 30),
("jan", "2020", 71000, 30),
("feb", "2019", 99000, 42),
("feb", "2020", 99500, 36),
("mar", "2019", 92000, 25),
("mar", "2020", 91000, 50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 30|
| jan|2020| 71000| 30|
| feb|2019| 99000| 42|
| feb|2020| 99500| 36|
| mar|2019| 92000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
To use substring
we can pass in a string, a position to start, and the length of the string to abstract. Similar to other sql methods, we can combine this use with select
and withColumn
.
from pyspark.sql.functions import substring, col
df.select(
substring("month", 0, 2).alias('Month')
).show()
+-----+
|Month|
+-----+
| ja|
| ja|
| fe|
| fe|
| ma|
| ma|
+-----+
We can also use the withColumn
to return a new DataFrame with the split column.
df.withColumn(
"Month",
substring("month", 0, 2)
).show()
+-----+----+-------------+--------------------+
|Month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| ja|2019| 86000| 30|
| ja|2020| 71000| 30|
| fe|2019| 99000| 42|
| fe|2020| 99500| 36|
| ma|2019| 92000| 25|
| ma|2020| 91000| 50|
+-----+----+-------------+--------------------+
Another option we have is to use the selectExpr
method on our data frame. Here, we can pass the column to substring and some substring commands.
df.selectExpr('month', 'substring(month, 1,2) as Month').show()
+-----+-----+
|month|Month|
+-----+-----+
| jan| ja|
| jan| ja|
| feb| fe|
| feb| fe|
| mar| ma|
| mar| ma|
+-----+-----+