PySpark WithColumnRenamed

10.05.2021

Intro

The withColumnRenamed allows us to easily change the column names in our PySpark dataframes. In this article, we will learn how to change column names with PySpark withColumnRenamed.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Using the PySpark Collect

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Now, let's create a dataframe to work with.

rdd = spark.sparkContext.parallelize([    
    ("jan", 2019, 86000,56),
    ("jan", 2020, 71000,30),
    ("jan", 2021, 90000,24),
    
    ("feb", 2019, 99000,40),
    ("feb", 2020, 83000,36),
    ("feb", 2021, 69000,53),
    
    ("mar", 2019, 80000,25),
    ("mar", 2020, 91000,50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                  56|
|  jan|2020|        71000|                  30|
|  jan|2021|        90000|                  24|
|  feb|2019|        99000|                  40|
|  feb|2020|        83000|                  36|
|  feb|2021|        69000|                  53|
|  mar|2019|        80000|                  25|
|  mar|2020|        91000|                  50|
+-----+----+-------------+--------------------+

Renaming a Column

To rename a column, we call the withColumnRenamed on our dataframe. When then past the old column name and the new column name.

from pyspark.sql.functions import lit

df.withColumnRenamed("total_revenue", "TotalRevenue").show()
+-----+----+------------+--------------------+
|month|year|TotalRevenue|unique_products_sold|
+-----+----+------------+--------------------+
|  jan|2019|       86000|                  56|
|  jan|2020|       71000|                  30|
|  jan|2021|       90000|                  24|
|  feb|2019|       99000|                  40|
|  feb|2020|       83000|                  36|
|  feb|2021|       69000|                  53|
|  mar|2019|       80000|                  25|
|  mar|2020|       91000|                  50|
+-----+----+------------+--------------------+

We can also chaining together multiple calls of withColumnRenamed to rename multiple columns.

df.withColumnRenamed("total_revenue", "TotalRevenue")\
    .withColumnRenamed("unique_products_sold", "UniqueProductsSold")\
    .show()
+-----+----+------------+------------------+
|month|year|TotalRevenue|UniqueProductsSold|
+-----+----+------------+------------------+
|  jan|2019|       86000|                56|
|  jan|2020|       71000|                30|
|  jan|2021|       90000|                24|
|  feb|2019|       99000|                40|
|  feb|2020|       83000|                36|
|  feb|2021|       69000|                53|
|  mar|2019|       80000|                25|
|  mar|2020|       91000|                50|
+-----+----+------------+------------------+