When working we often want to group data to view distributions or aggregations. PySpark provides us with the groupBy
method to group our dataframes. In this particle, we will learn how to work with PySpark GroupBy.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Now, let's create a dataframe to work with.
rdd = spark.sparkContext.parallelize([
("jan", 2019, 86000,56),
("jan", 2020, 71000,30),
("jan", 2021, 90000,24),
("feb", 2019, 99000,40),
("feb", 2020, 83000,36),
("feb", 2021, 69000,53),
("mar", 2019, 80000,25),
("mar", 2020, 91000,50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| jan|2021| 90000| 24|
| feb|2019| 99000| 40|
| feb|2020| 83000| 36|
| feb|2021| 69000| 53|
| mar|2019| 80000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
The basic usage of group by is to call groupBy
on a dataframe and pass the name of the column. For example, below we group by month.
df.groupBy('month')
<pyspark.sql.group.GroupedData at 0x7f227a08a340>
As you can see, PySpark provides us with a GroupedData object. Once we have this object, we can run different aggregation methods on it. Below, we show many examples.
df.groupBy('month').count().show()
+-----+-----+
|month|count|
+-----+-----+
| feb| 3|
| mar| 2|
| jan| 3|
+-----+-----+
df.groupBy('month').sum("total_revenue").show()
+-----+------------------+
|month|sum(total_revenue)|
+-----+------------------+
| feb| 251000|
| mar| 171000|
| jan| 247000|
+-----+------------------+
df.groupBy('month').min("total_revenue").show()
+-----+------------------+
|month|min(total_revenue)|
+-----+------------------+
| feb| 69000|
| mar| 80000|
| jan| 71000|
+-----+------------------+
df.groupBy('month').max("total_revenue").show()
+-----+------------------+
|month|max(total_revenue)|
+-----+------------------+
| feb| 99000|
| mar| 91000|
| jan| 90000|
+-----+------------------+
df.groupBy('month').avg("total_revenue").show()
+-----+------------------+
|month|avg(total_revenue)|
+-----+------------------+
| feb| 83666.66666666667|
| mar| 85500.0|
| jan| 82333.33333333333|
+-----+------------------+
df.groupBy('month').mean("total_revenue").show()
+-----+------------------+
|month|avg(total_revenue)|
+-----+------------------+
| feb| 83666.66666666667|
| mar| 85500.0|
| jan| 82333.33333333333|
+-----+------------------+
All of the above, can also been done on multiple groups. To do this, we pass multiple column names to the groupBy
method.
df.groupBy('month', 'year').avg("total_revenue").show()
+-----+----+------------------+
|month|year|avg(total_revenue)|
+-----+----+------------------+
| feb|2020| 83000.0|
| feb|2021| 69000.0|
| jan|2021| 90000.0|
| mar|2019| 80000.0|
| feb|2019| 99000.0|
| jan|2019| 86000.0|
| mar|2020| 91000.0|
| jan|2020| 71000.0|
+-----+----+------------------+