The PySpark fillna
and fill
methods allow you to replace empty or null values in your dataframes. This helps when you need to run your data through algorithms or plotting that does not allow for empty values. An alternative to this is using imputing, but sometimes filling in the data with constants is enough. In this article, we will learn how to use PySpark fill
and fillna
.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Now, let's create a dataframe to work with.
rdd = spark.sparkContext.parallelize([
("jan", 2019, 86000, None),
("jan", 2020, 71000, 30),
("jan", 2021, None, 24),
("feb", 2019, 99000, None),
("feb", 2020, None, 36),
("feb", 2021, 69000, 53),
("mar", 2019, None, 25),
("mar", 2020, 91000, 50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| null|
| jan|2020| 71000| 30|
| jan|2021| null| 24|
| feb|2019| 99000| null|
| feb|2020| null| 36|
| feb|2021| 69000| 53|
| mar|2019| null| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
To use fill
, we call the method on the na
property of our dataframe. Here, we fill in all values with 0.
df.na.fill(value=0).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 0|
| jan|2020| 71000| 30|
| jan|2021| 0| 24|
| feb|2019| 99000| 0|
| feb|2020| 0| 36|
| feb|2021| 69000| 53|
| mar|2019| 0| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
If we want, we can fill only a subset of columns.
df.na.fill(value=0, subset=["total_revenue"]).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| null|
| jan|2020| 71000| 30|
| jan|2021| 0| 24|
| feb|2019| 99000| null|
| feb|2020| 0| 36|
| feb|2021| 69000| 53|
| mar|2019| 0| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
Note, that you can use other values for the fill.
df.na.fill(value=10000, subset=["total_revenue"]).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| null|
| jan|2020| 71000| 30|
| jan|2021| 10000| 24|
| feb|2019| 99000| null|
| feb|2020| 10000| 36|
| feb|2021| 69000| 53|
| mar|2019| 10000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
Using fillna works in a similar way, but you don't need to use the na
property.
df.fillna(value=10000, subset=["total_revenue"]).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| null|
| jan|2020| 71000| 30|
| jan|2021| 10000| 24|
| feb|2019| 99000| null|
| feb|2020| 10000| 36|
| feb|2021| 69000| 53|
| mar|2019| 10000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+