PySpark ForEach

10.17.2021

Intro

The PySpark sample method allows us to take small samples from large data sets. This allows us to analyze datasets that are too large to review completely.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Creating the Data

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Now, let's create a dataframe to work with.

rdd = spark.sparkContext.parallelize([    
    ("jan", 2019, 86000, 56),
    ("jan", 2020, 71000, 30),
    ("jan", 2021, 90000, 24),
    
    ("feb", 2019, 99000, 40),
    ("feb", 2020, 83000, 36),
    ("feb", 2021, 69000, 53),
    
    ("mar", 2019, 80000, 25),
    ("mar", 2020, 91000, 50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                  56|
|  jan|2020|        71000|                  30|
|  jan|2021|        90000|                  24|
|  feb|2019|        99000|                  40|
|  feb|2020|        83000|                  36|
|  feb|2021|        69000|                  53|
|  mar|2019|        80000|                  25|
|  mar|2020|        91000|                  50|
+-----+----+-------------+--------------------+

Using Sample

The basic usage of sample is to call the sample method and pass in a fraction to use. For example, we will sample 40% of our data by using the following.

df.sample(0.4).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                  56|
|  jan|2020|        71000|                  30|
|  feb|2021|        69000|                  53|
|  mar|2020|        91000|                  50|
+-----+----+-------------+--------------------+

We can also use the sample option, the second positional parameter, to set a seed for recreating our sample.

Notice if we run the first example, we don't get the same result, but once we use a seed we do.

df.sample(0.4).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  feb|2019|        99000|                  40|
|  feb|2021|        69000|                  53|
+-----+----+-------------+--------------------+
df.sample(fraction = 0.2, seed = 10).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                  56|
|  jan|2021|        90000|                  24|
|  feb|2019|        99000|                  40|
+-----+----+-------------+--------------------+
df.sample(fraction = 0.2, seed = 10).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                  56|
|  jan|2021|        90000|                  24|
|  feb|2019|        99000|                  40|
+-----+----+-------------+--------------------+

If we want, we can also use the withReplacement parameter to sample with replacement.

df.sample(withReplacement = True, fraction = 0.5, seed = 10).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                  56|
+-----+----+-------------+--------------------+

Stratified with PySpark SampleBy

One final example is a slightly different function. We can use the sampleBy to stratify our sample. In the example, we can set the fractions for each group.

df.sampleBy("month", {
    "jan": 0.2,
    "feb": 0.3,
}).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
|  jan|2019|        86000|                  56|
|  feb|2021|        69000|                  53|
+-----+----+-------------+--------------------+