Often when working with dataframes we want to filter our data to a subset. PySpark provides us with the filter
and the alias where
to filter our data frames. In this article, we will learn how to use PySpark filter.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Now, let's create a dataframe to work with.
rdd = spark.sparkContext.parallelize([
("jan", 2019, 86000,56),
("jan", 2020, 71000,30),
("jan", 2021, 90000,24),
("feb", 2019, 99000,40),
("feb", 2020, 83000,36),
("feb", 2021, 69000,53),
("mar", 2019, 80000,25),
("mar", 2020, 91000,50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| jan|2021| 90000| 24|
| feb|2019| 99000| 40|
| feb|2020| 83000| 36|
| feb|2021| 69000| 53|
| mar|2019| 80000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
To start, we can use the filter function and pass a dataframe column with a condition. If you use pandas, you should find this familiar. Here are two example.
from pyspark.sql.functions import lit
df.filter(df.total_revenue > 80000).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2021| 90000| 24|
| feb|2019| 99000| 40|
| feb|2020| 83000| 36|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
df.filter(df.month != 'jan').show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| feb|2019| 99000| 40|
| feb|2020| 83000| 36|
| feb|2021| 69000| 53|
| mar|2019| 80000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
If we prefer to use sql strings, we can pass those to the filter functions instead of using the pandas notation above.
df.filter("total_revenue > 8000").show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| jan|2021| 90000| 24|
| feb|2019| 99000| 40|
| feb|2020| 83000| 36|
| feb|2021| 69000| 53|
| mar|2019| 80000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
df.filter("month != 'jan'").show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| feb|2019| 99000| 40|
| feb|2020| 83000| 36|
| feb|2021| 69000| 53|
| mar|2019| 80000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
df.filter("month <> 'jan'").show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| feb|2019| 99000| 40|
| feb|2020| 83000| 36|
| feb|2021| 69000| 53|
| mar|2019| 80000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
We can combine multiple conditions by using &
for AND queries and |
for OR queries.
df.filter((df.month == "jan") & (df.year == "2019") ).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
+-----+----+-------------+--------------------+
df.filter((df.month == "jan") | (df.year == "2019") ).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| jan|2021| 90000| 24|
| feb|2019| 99000| 40|
| mar|2019| 80000| 25|
+-----+----+-------------+--------------------+
We also have access to list function such as isin
to filter values.
years = ['2019', '2020']
df.filter(df.year.isin(years)).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| feb|2019| 99000| 40|
| feb|2020| 83000| 36|
| mar|2019| 80000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
A few string helpers we can use are the startswith
, endswith
, and contains
.
df.filter(df.month.startswith("j")).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| jan|2021| 90000| 24|
+-----+----+-------------+--------------------+
df.filter(df.month.endswith("n")).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| jan|2021| 90000| 24|
+-----+----+-------------+--------------------+
df.filter(df.month.contains("ja")).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| jan|2021| 90000| 24|
+-----+----+-------------+--------------------+
We can also use the like
function to do pattern matching. This is similar to the sql like
function.
df.filter(df.month.like("%an%")).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| jan|2021| 90000| 24|
+-----+----+-------------+--------------------+