During the data cleaning process, we would like to remove duplicate rows. PySpark provides us with the dropDuplicates
and distinct
that let's us remove duplicates on large amounts of data. In this article, we will learn how to Drop Duplicates with PySpark.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Now, let's create a dataframe to work with. For this, we will ensure there are some duplicate records.
rdd = spark.sparkContext.parallelize([
("jan", 2019, 86000,56),
("jan", 2020, 71000,30),
("jan", 2020, 71000,30),
("jan", 2021, 90000,24),
("jan", 2021, 90000,24),
("feb", 2019, 99000,40),
("feb", 2020, 83000,36),
("feb", 2021, 69000,53),
("mar", 2019, 80000,25),
("mar", 2019, 80000,25),
("mar", 2020, 91000,50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| jan|2020| 71000| 30|
| jan|2021| 90000| 24|
| jan|2021| 90000| 24|
| feb|2019| 99000| 40|
| feb|2020| 83000| 36|
| feb|2021| 69000| 53|
| mar|2019| 80000| 25|
| mar|2019| 80000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
The first way we can remove duplicates is using the distinct
method. By default, the method will use all columns to find distinct values.
df.distinct().show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| feb|2020| 83000| 36|
| mar|2020| 91000| 50|
| jan|2020| 71000| 30|
| mar|2019| 80000| 25|
| jan|2021| 90000| 24|
| feb|2021| 69000| 53|
| jan|2019| 86000| 56|
| feb|2019| 99000| 40|
+-----+----+-------------+--------------------+
We can also use the dropDuplicates
method to do this same.
df.dropDuplicates().show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| feb|2020| 83000| 36|
| mar|2020| 91000| 50|
| jan|2020| 71000| 30|
| mar|2019| 80000| 25|
| jan|2021| 90000| 24|
| feb|2021| 69000| 53|
| jan|2019| 86000| 56|
| feb|2019| 99000| 40|
+-----+----+-------------+--------------------+
The dropDuplicates
method allows use to provide specific columns to find distinct values on. This helps with performance, but also if we only want to determine uniqueness on specific columns.
df.dropDuplicates(["month", "year"]).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| feb|2020| 83000| 36|
| feb|2021| 69000| 53|
| jan|2021| 90000| 24|
| mar|2019| 80000| 25|
| feb|2019| 99000| 40|
| jan|2019| 86000| 56|
| mar|2020| 91000| 50|
| jan|2020| 71000| 30|
+-----+----+-------------+--------------------+