PySpark Drop Duplicates

Intro

During the data cleaning process, we would like to remove duplicate rows. PySpark provides us with the dropDuplicates and distinct that let's us remove duplicates on large amounts of data. In this article, we will learn how to Drop Duplicates with PySpark.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Using the PySpark Collect

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Now, let's create a dataframe to work with. For this, we will ensure there are some duplicate records.

rdd = spark.sparkContext.parallelize([    
    ("jan", 2019, 86000,56),
    ("jan", 2020, 71000,30),
    ("jan", 2020, 71000,30),
    ("jan", 2021, 90000,24),
    ("jan", 2021, 90000,24),
    
    ("feb", 2019, 99000,40),
    ("feb", 2020, 83000,36),
    ("feb", 2021, 69000,53),
    
    ("mar", 2019, 80000,25),
    ("mar", 2019, 80000,25),
    ("mar", 2020, 91000,50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()

+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                  56|
|  jan|2020|        71000|                  30|
|  jan|2020|        71000|                  30|
|  jan|2021|        90000|                  24|
|  jan|2021|        90000|                  24|
|  feb|2019|        99000|                  40|
|  feb|2020|        83000|                  36|
|  feb|2021|        69000|                  53|
|  mar|2019|        80000|                  25|
|  mar|2019|        80000|                  25|
|  mar|2020|        91000|                  50|
+-----+----+-------------+--------------------+

Using Distinct

The first way we can remove duplicates is using the distinct method. By default, the method will use all columns to find distinct values.

df.distinct().show()

+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  feb|2020|        83000|                  36|
|  mar|2020|        91000|                  50|
|  jan|2020|        71000|                  30|
|  mar|2019|        80000|                  25|
|  jan|2021|        90000|                  24|
|  feb|2021|        69000|                  53|
|  jan|2019|        86000|                  56|
|  feb|2019|        99000|                  40|
+-----+----+-------------+--------------------+

We can also use the dropDuplicates method to do this same.

df.dropDuplicates().show()

+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  feb|2020|        83000|                  36|
|  mar|2020|        91000|                  50|
|  jan|2020|        71000|                  30|
|  mar|2019|        80000|                  25|
|  jan|2021|        90000|                  24|
|  feb|2021|        69000|                  53|
|  jan|2019|        86000|                  56|
|  feb|2019|        99000|                  40|
+-----+----+-------------+--------------------+

Distinct Based on Columns

The dropDuplicates method allows use to provide specific columns to find distinct values on. This helps with performance, but also if we only want to determine uniqueness on specific columns.

df.dropDuplicates(["month", "year"]).show()

+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  feb|2020|        83000|                  36|
|  feb|2021|        69000|                  53|
|  jan|2021|        90000|                  24|
|  mar|2019|        80000|                  25|
|  feb|2019|        99000|                  40|
|  jan|2019|        86000|                  56|
|  mar|2020|        91000|                  50|
|  jan|2020|        71000|                  30|
+-----+----+-------------+--------------------+

PySpark Drop Duplicates

10.07.2021

Intro

Setting Up

Using the PySpark Collect

Using Distinct

Distinct Based on Columns

PySpark Filter

PySpark Drop OrderBy