PySpark Fillna

10.20.2021

Intro

The PySpark fillna and fill methods allow you to replace empty or null values in your dataframes. This helps when you need to run your data through algorithms or plotting that does not allow for empty values. An alternative to this is using imputing, but sometimes filling in the data with constants is enough. In this article, we will learn how to use PySpark fill and fillna.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Creating the Data

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Now, let's create a dataframe to work with.

rdd = spark.sparkContext.parallelize([    
    ("jan", 2019, 86000, None),
    ("jan", 2020, 71000, 30),
    ("jan", 2021, None, 24),
    
    ("feb", 2019, 99000, None),
    ("feb", 2020, None, 36),
    ("feb", 2021, 69000, 53),
    
    ("mar", 2019, None, 25),
    ("mar", 2020, 91000, 50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                null|
|  jan|2020|        71000|                  30|
|  jan|2021|         null|                  24|
|  feb|2019|        99000|                null|
|  feb|2020|         null|                  36|
|  feb|2021|        69000|                  53|
|  mar|2019|         null|                  25|
|  mar|2020|        91000|                  50|
+-----+----+-------------+--------------------+

Using Fill

To use fill, we call the method on the na property of our dataframe. Here, we fill in all values with 0.

df.na.fill(value=0).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                   0|
|  jan|2020|        71000|                  30|
|  jan|2021|            0|                  24|
|  feb|2019|        99000|                   0|
|  feb|2020|            0|                  36|
|  feb|2021|        69000|                  53|
|  mar|2019|            0|                  25|
|  mar|2020|        91000|                  50|
+-----+----+-------------+--------------------+

If we want, we can fill only a subset of columns.

df.na.fill(value=0, subset=["total_revenue"]).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                null|
|  jan|2020|        71000|                  30|
|  jan|2021|            0|                  24|
|  feb|2019|        99000|                null|
|  feb|2020|            0|                  36|
|  feb|2021|        69000|                  53|
|  mar|2019|            0|                  25|
|  mar|2020|        91000|                  50|
+-----+----+-------------+--------------------+

Note, that you can use other values for the fill.

df.na.fill(value=10000, subset=["total_revenue"]).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                null|
|  jan|2020|        71000|                  30|
|  jan|2021|        10000|                  24|
|  feb|2019|        99000|                null|
|  feb|2020|        10000|                  36|
|  feb|2021|        69000|                  53|
|  mar|2019|        10000|                  25|
|  mar|2020|        91000|                  50|
+-----+----+-------------+--------------------+

Using FillNA

Using fillna works in a similar way, but you don't need to use the na property.

df.fillna(value=10000, subset=["total_revenue"]).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                null|
|  jan|2020|        71000|                  30|
|  jan|2021|        10000|                  24|
|  feb|2019|        99000|                null|
|  feb|2020|        10000|                  36|
|  feb|2021|        69000|                  53|
|  mar|2019|        10000|                  25|
|  mar|2020|        91000|                  50|
+-----+----+-------------+--------------------+