PySpark Handle Null Data

09.22.2021

Intro

Often when working with data you will find null values. It is a common task to work with and know how to manage these null values. The decision to drop or to impute is important in the model building and reporting process. In this article, we will learn how to work with null values in Spark with Python.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Creating a PySpark Data Frame

We begin by creating a spark session and importing a few libraries.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
from datetime import datetime, date

rdd = spark.sparkContext.parallelize([
    (None, 'jan', datetime(2000, 1, 1, 12, 0)),
    (40000, 'feb', datetime(2000, 2, 1, 12, 0)),
    (50000, None, datetime(2000, 3, 1, 12, 0))
])
df = spark.createDataFrame(rdd, schema = ["amount", "month", "date"])
df.show()
+------+-----+-------------------+
|amount|month|               date|
+------+-----+-------------------+
|  null|  jan|2000-01-01 12:00:00|
| 40000|  feb|2000-02-01 12:00:00|
| 50000| null|2000-03-01 12:00:00|
+------+-----+-------------------+

The first method we have to replace null values is fillna. We can pass it a value, and based on the type, all null values of the same type will be replaced.

For example, here us replacing all strings.

df.fillna('0').show()
+------+-----+-------------------+
|amount|month|               date|
+------+-----+-------------------+
|  null|  jan|2000-01-01 12:00:00|
| 40000|  feb|2000-02-01 12:00:00|
| 50000|    0|2000-03-01 12:00:00|
+------+-----+-------------------+

And similarly we can replace all numbers.

df.fillna(0).show()
+------+-----+-------------------+
|amount|month|               date|
+------+-----+-------------------+
|     0|  jan|2000-01-01 12:00:00|
| 40000|  feb|2000-02-01 12:00:00|
| 50000| null|2000-03-01 12:00:00|
+------+-----+-------------------+

Next, we can also tell Spark to replace all missing values in a column using a dictionary. For example, below we replace all null months with march and all amounts with 0.

df.fillna({'month': 'mar', 'amount': 0}).show()
+------+-----+-------------------+
|amount|month|               date|
+------+-----+-------------------+
|     0|  jan|2000-01-01 12:00:00|
| 40000|  feb|2000-02-01 12:00:00|
| 50000|  mar|2000-03-01 12:00:00|
+------+-----+-------------------+

We can also drop all rows with null values using drop.

df.na.drop().show()
+------+-----+-------------------+
|amount|month|               date|
+------+-----+-------------------+
| 40000|  feb|2000-02-01 12:00:00|
+------+-----+-------------------+

If we only want to drop a subset of values, we can use the subset parameter. For example, we drop only rows with null amounts in the example below.

df.na.drop(subset='amount').show()
+------+-----+-------------------+
|amount|month|               date|
+------+-----+-------------------+
| 40000|  feb|2000-02-01 12:00:00|
| 50000| null|2000-03-01 12:00:00|
+------+-----+-------------------+

Another option we have is to replace values in our data set. In the example below we rename the month of February.

df.replace("feb", "February").show()
+------+--------+-------------------+
|amount|   month|               date|
+------+--------+-------------------+
|  null|     jan|2000-01-01 12:00:00|
| 40000|February|2000-02-01 12:00:00|
| 50000|    null|2000-03-01 12:00:00|
+------+--------+-------------------+

Our last example is using drop to specify a specific column. In the example below, we drop any rows with a null month.

df.drop('month').show()
+------+-------------------+
|amount|               date|
+------+-------------------+
|  null|2000-01-01 12:00:00|
| 40000|2000-02-01 12:00:00|
| 50000|2000-03-01 12:00:00|
+------+-------------------+