PySpark Substring

10.28.2021

Intro

The PySpark substring method allows us to extract a substring from a column in a DataFrame. In this article, we will learn how to use substring in PySpark.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Creating the DataSet

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Now, let's create a dataframe to work with.

rdd = spark.sparkContext.parallelize([    
    ("jan", "2019", 86000, 30),
    ("jan", "2020", 71000, 30),
    
    ("feb", "2019", 99000, 42),
    ("feb", "2020", 99500, 36),
    
    ("mar", "2019", 92000, 25),
    ("mar", "2020", 91000, 50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                  30|
|  jan|2020|        71000|                  30|
|  feb|2019|        99000|                  42|
|  feb|2020|        99500|                  36|
|  mar|2019|        92000|                  25|
|  mar|2020|        91000|                  50|
+-----+----+-------------+--------------------+

Using Substring

To use substring we can pass in a string, a position to start, and the length of the string to abstract. Similar to other sql methods, we can combine this use with select and withColumn.

from pyspark.sql.functions import substring, col

df.select(
    substring("month", 0, 2).alias('Month')
).show()
+-----+
|Month|
+-----+
|   ja|
|   ja|
|   fe|
|   fe|
|   ma|
|   ma|
+-----+

We can also use the withColumn to return a new DataFrame with the split column.

df.withColumn(
    "Month",
     substring("month", 0, 2)
).show()
+-----+----+-------------+--------------------+
|Month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|   ja|2019|        86000|                  30|
|   ja|2020|        71000|                  30|
|   fe|2019|        99000|                  42|
|   fe|2020|        99500|                  36|
|   ma|2019|        92000|                  25|
|   ma|2020|        91000|                  50|
+-----+----+-------------+--------------------+

Another option we have is to use the selectExpr method on our data frame. Here, we can pass the column to substring and some substring commands.

df.selectExpr('month', 'substring(month, 1,2) as Month').show()
+-----+-----+
|month|Month|
+-----+-----+
|  jan|   ja|
|  jan|   ja|
|  feb|   fe|
|  feb|   fe|
|  mar|   ma|
|  mar|   ma|
+-----+-----+