PySpark GroupBy

10.09.2021

Intro

When working we often want to group data to view distributions or aggregations. PySpark provides us with the groupBy method to group our dataframes. In this particle, we will learn how to work with PySpark GroupBy.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Using the PySpark Collect

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Now, let's create a dataframe to work with.

rdd = spark.sparkContext.parallelize([    
    ("jan", 2019, 86000,56),
    ("jan", 2020, 71000,30),
    ("jan", 2021, 90000,24),
    
    ("feb", 2019, 99000,40),
    ("feb", 2020, 83000,36),
    ("feb", 2021, 69000,53),
    
    ("mar", 2019, 80000,25),
    ("mar", 2020, 91000,50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                  56|
|  jan|2020|        71000|                  30|
|  jan|2021|        90000|                  24|
|  feb|2019|        99000|                  40|
|  feb|2020|        83000|                  36|
|  feb|2021|        69000|                  53|
|  mar|2019|        80000|                  25|
|  mar|2020|        91000|                  50|
+-----+----+-------------+--------------------+

Using GroupBy

The basic usage of group by is to call groupBy on a dataframe and pass the name of the column. For example, below we group by month.

df.groupBy('month')
<pyspark.sql.group.GroupedData at 0x7f227a08a340>

As you can see, PySpark provides us with a GroupedData object. Once we have this object, we can run different aggregation methods on it. Below, we show many examples.

df.groupBy('month').count().show()
+-----+-----+
|month|count|
+-----+-----+
|  feb|    3|
|  mar|    2|
|  jan|    3|
+-----+-----+
df.groupBy('month').sum("total_revenue").show()
+-----+------------------+
|month|sum(total_revenue)|
+-----+------------------+
|  feb|            251000|
|  mar|            171000|
|  jan|            247000|
+-----+------------------+
df.groupBy('month').min("total_revenue").show()
+-----+------------------+
|month|min(total_revenue)|
+-----+------------------+
|  feb|             69000|
|  mar|             80000|
|  jan|             71000|
+-----+------------------+
df.groupBy('month').max("total_revenue").show()
+-----+------------------+
|month|max(total_revenue)|
+-----+------------------+
|  feb|             99000|
|  mar|             91000|
|  jan|             90000|
+-----+------------------+
df.groupBy('month').avg("total_revenue").show()
+-----+------------------+
|month|avg(total_revenue)|
+-----+------------------+
|  feb| 83666.66666666667|
|  mar|           85500.0|
|  jan| 82333.33333333333|
+-----+------------------+
df.groupBy('month').mean("total_revenue").show()
+-----+------------------+
|month|avg(total_revenue)|
+-----+------------------+
|  feb| 83666.66666666667|
|  mar|           85500.0|
|  jan| 82333.33333333333|
+-----+------------------+

Using GroupBy On Multiple Columns

All of the above, can also been done on multiple groups. To do this, we pass multiple column names to the groupBy method.

df.groupBy('month', 'year').avg("total_revenue").show()
+-----+----+------------------+
|month|year|avg(total_revenue)|
+-----+----+------------------+
|  feb|2020|           83000.0|
|  feb|2021|           69000.0|
|  jan|2021|           90000.0|
|  mar|2019|           80000.0|
|  feb|2019|           99000.0|
|  jan|2019|           86000.0|
|  mar|2020|           91000.0|
|  jan|2020|           71000.0|
+-----+----+------------------+