The PySpark sample
method allows us to take small samples from large data sets. This allows us to analyze datasets that are too large to review completely.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Now, let's create a dataframe to work with.
rdd = spark.sparkContext.parallelize([
("jan", 2019, 86000, 56),
("jan", 2020, 71000, 30),
("jan", 2021, 90000, 24),
("feb", 2019, 99000, 40),
("feb", 2020, 83000, 36),
("feb", 2021, 69000, 53),
("mar", 2019, 80000, 25),
("mar", 2020, 91000, 50)
])
df = spark.createDataFrame(rdd, schema = ["month", "year", "total_revenue", "unique_products_sold"])
df.show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| jan|2021| 90000| 24|
| feb|2019| 99000| 40|
| feb|2020| 83000| 36|
| feb|2021| 69000| 53|
| mar|2019| 80000| 25|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
The basic usage of sample is to call the sample
method and pass in a fraction to use. For example, we will sample 40% of our data by using the following.
df.sample(0.4).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2020| 71000| 30|
| feb|2021| 69000| 53|
| mar|2020| 91000| 50|
+-----+----+-------------+--------------------+
We can also use the sample option, the second positional parameter, to set a seed for recreating our sample.
Notice if we run the first example, we don't get the same result, but once we use a seed we do.
df.sample(0.4).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| feb|2019| 99000| 40|
| feb|2021| 69000| 53|
+-----+----+-------------+--------------------+
df.sample(fraction = 0.2, seed = 10).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2021| 90000| 24|
| feb|2019| 99000| 40|
+-----+----+-------------+--------------------+
df.sample(fraction = 0.2, seed = 10).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| jan|2021| 90000| 24|
| feb|2019| 99000| 40|
+-----+----+-------------+--------------------+
If we want, we can also use the withReplacement
parameter to sample with replacement.
df.sample(withReplacement = True, fraction = 0.5, seed = 10).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
+-----+----+-------------+--------------------+
One final example is a slightly different function. We can use the sampleBy
to stratify our sample. In the example, we can set the fractions for each group.
df.sampleBy("month", {
"jan": 0.2,
"feb": 0.3,
}).show()
+-----+----+-------------+--------------------+
|month|year|total_revenue|unique_products_sold|
+-----+----+-------------+--------------------+
| jan|2019| 86000| 56|
| feb|2021| 69000| 53|
+-----+----+-------------+--------------------+