The PySpark Binarizer allows you to convert a continuous variable into a discrete 0 or 1. This is helpful when you want to simply convert a column to check if a value exists based on a threshold. In this article, we will learn how to use PySpark Binarizer.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Next, let's import some data to use.
import pandas as pd
df = pd.read_csv('../data/abalone.csv')
df.head()
Sex | Length | Diameter | Height | Whole weight | Shucked weight | Viscera weight | Shell weight | Rings | |
---|---|---|---|---|---|---|---|---|---|
0 | M | 0.455 | 0.365 | 0.095 | 0.5140 | 0.2245 | 0.1010 | 0.150 | 15 |
1 | M | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.070 | 7 |
2 | F | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.210 | 9 |
3 | M | 0.440 | 0.365 | 0.125 | 0.5160 | 0.2155 | 0.1140 | 0.155 | 10 |
4 | I | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.055 | 7 |
Now, let's convert our pandas object to a dataframe.
sdf = spark.createDataFrame(df)
sdf.show()
[Stage 0:> (0 + 1) / 1]
+---+------+--------+------+------------+--------------+--------------+------------+-----+
|Sex|Length|Diameter|Height|Whole weight|Shucked weight|Viscera weight|Shell weight|Rings|
+---+------+--------+------+------------+--------------+--------------+------------+-----+
| M| 0.455| 0.365| 0.095| 0.514| 0.2245| 0.101| 0.15| 15|
| M| 0.35| 0.265| 0.09| 0.2255| 0.0995| 0.0485| 0.07| 7|
| F| 0.53| 0.42| 0.135| 0.677| 0.2565| 0.1415| 0.21| 9|
| M| 0.44| 0.365| 0.125| 0.516| 0.2155| 0.114| 0.155| 10|
| I| 0.33| 0.255| 0.08| 0.205| 0.0895| 0.0395| 0.055| 7|
| I| 0.425| 0.3| 0.095| 0.3515| 0.141| 0.0775| 0.12| 8|
| F| 0.53| 0.415| 0.15| 0.7775| 0.237| 0.1415| 0.33| 20|
| F| 0.545| 0.425| 0.125| 0.768| 0.294| 0.1495| 0.26| 16|
| M| 0.475| 0.37| 0.125| 0.5095| 0.2165| 0.1125| 0.165| 9|
| F| 0.55| 0.44| 0.15| 0.8945| 0.3145| 0.151| 0.32| 19|
| F| 0.525| 0.38| 0.14| 0.6065| 0.194| 0.1475| 0.21| 14|
| M| 0.43| 0.35| 0.11| 0.406| 0.1675| 0.081| 0.135| 10|
| M| 0.49| 0.38| 0.135| 0.5415| 0.2175| 0.095| 0.19| 11|
| F| 0.535| 0.405| 0.145| 0.6845| 0.2725| 0.171| 0.205| 10|
| F| 0.47| 0.355| 0.1| 0.4755| 0.1675| 0.0805| 0.185| 10|
| M| 0.5| 0.4| 0.13| 0.6645| 0.258| 0.133| 0.24| 12|
| I| 0.355| 0.28| 0.085| 0.2905| 0.095| 0.0395| 0.115| 7|
| F| 0.44| 0.34| 0.1| 0.451| 0.188| 0.087| 0.13| 10|
| M| 0.365| 0.295| 0.08| 0.2555| 0.097| 0.043| 0.1| 7|
| M| 0.45| 0.32| 0.1| 0.381| 0.1705| 0.075| 0.115| 9|
+---+------+--------+------+------------+--------------+--------------+------------+-----+
only showing top 20 rows
Now, to use the Binarizer, we can create a Constructor with three params. First we specify a threshold. In this example, anything above .4 will be changed to 1 and below will be 0. Then we specify an input column and an output column for the new feature.
from pyspark.ml.feature import Binarizer
binarizer = Binarizer(threshold = .4, inputCol = "Length", outputCol = "LengthBinarized")
binarizer
Binarizer_b46f6ef9df36
Finally, we can use the transform
to add the new feature to a new data frame, pyspark does not alter the existing one.
newdf = binarizer.transform(sdf)
newdf.show()
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+
|Sex|Length|Diameter|Height|Whole weight|Shucked weight|Viscera weight|Shell weight|Rings|LengthBinarized|
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+
| M| 0.455| 0.365| 0.095| 0.514| 0.2245| 0.101| 0.15| 15| 1.0|
| M| 0.35| 0.265| 0.09| 0.2255| 0.0995| 0.0485| 0.07| 7| 0.0|
| F| 0.53| 0.42| 0.135| 0.677| 0.2565| 0.1415| 0.21| 9| 1.0|
| M| 0.44| 0.365| 0.125| 0.516| 0.2155| 0.114| 0.155| 10| 1.0|
| I| 0.33| 0.255| 0.08| 0.205| 0.0895| 0.0395| 0.055| 7| 0.0|
| I| 0.425| 0.3| 0.095| 0.3515| 0.141| 0.0775| 0.12| 8| 1.0|
| F| 0.53| 0.415| 0.15| 0.7775| 0.237| 0.1415| 0.33| 20| 1.0|
| F| 0.545| 0.425| 0.125| 0.768| 0.294| 0.1495| 0.26| 16| 1.0|
| M| 0.475| 0.37| 0.125| 0.5095| 0.2165| 0.1125| 0.165| 9| 1.0|
| F| 0.55| 0.44| 0.15| 0.8945| 0.3145| 0.151| 0.32| 19| 1.0|
| F| 0.525| 0.38| 0.14| 0.6065| 0.194| 0.1475| 0.21| 14| 1.0|
| M| 0.43| 0.35| 0.11| 0.406| 0.1675| 0.081| 0.135| 10| 1.0|
| M| 0.49| 0.38| 0.135| 0.5415| 0.2175| 0.095| 0.19| 11| 1.0|
| F| 0.535| 0.405| 0.145| 0.6845| 0.2725| 0.171| 0.205| 10| 1.0|
| F| 0.47| 0.355| 0.1| 0.4755| 0.1675| 0.0805| 0.185| 10| 1.0|
| M| 0.5| 0.4| 0.13| 0.6645| 0.258| 0.133| 0.24| 12| 1.0|
| I| 0.355| 0.28| 0.085| 0.2905| 0.095| 0.0395| 0.115| 7| 0.0|
| F| 0.44| 0.34| 0.1| 0.451| 0.188| 0.087| 0.13| 10| 1.0|
| M| 0.365| 0.295| 0.08| 0.2555| 0.097| 0.043| 0.1| 7| 0.0|
| M| 0.45| 0.32| 0.1| 0.381| 0.1705| 0.075| 0.115| 9| 1.0|
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+
only showing top 20 rows
One final example, we can also modify multiple columns. Let's update our input and output columns.
inputCols = ["Length", "Diameter"]
outputCols = ["LengthBinarized", "DiameterBinarized"]
binarizer = Binarizer(threshold = .4, inputCols = inputCols, outputCols = outputCols)
newdf = binarizer.transform(sdf)
newdf.show()
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+-----------------+
|Sex|Length|Diameter|Height|Whole weight|Shucked weight|Viscera weight|Shell weight|Rings|LengthBinarized|DiameterBinarized|
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+-----------------+
| M| 0.455| 0.365| 0.095| 0.514| 0.2245| 0.101| 0.15| 15| 1.0| 0.0|
| M| 0.35| 0.265| 0.09| 0.2255| 0.0995| 0.0485| 0.07| 7| 0.0| 0.0|
| F| 0.53| 0.42| 0.135| 0.677| 0.2565| 0.1415| 0.21| 9| 1.0| 1.0|
| M| 0.44| 0.365| 0.125| 0.516| 0.2155| 0.114| 0.155| 10| 1.0| 0.0|
| I| 0.33| 0.255| 0.08| 0.205| 0.0895| 0.0395| 0.055| 7| 0.0| 0.0|
| I| 0.425| 0.3| 0.095| 0.3515| 0.141| 0.0775| 0.12| 8| 1.0| 0.0|
| F| 0.53| 0.415| 0.15| 0.7775| 0.237| 0.1415| 0.33| 20| 1.0| 1.0|
| F| 0.545| 0.425| 0.125| 0.768| 0.294| 0.1495| 0.26| 16| 1.0| 1.0|
| M| 0.475| 0.37| 0.125| 0.5095| 0.2165| 0.1125| 0.165| 9| 1.0| 0.0|
| F| 0.55| 0.44| 0.15| 0.8945| 0.3145| 0.151| 0.32| 19| 1.0| 1.0|
| F| 0.525| 0.38| 0.14| 0.6065| 0.194| 0.1475| 0.21| 14| 1.0| 0.0|
| M| 0.43| 0.35| 0.11| 0.406| 0.1675| 0.081| 0.135| 10| 1.0| 0.0|
| M| 0.49| 0.38| 0.135| 0.5415| 0.2175| 0.095| 0.19| 11| 1.0| 0.0|
| F| 0.535| 0.405| 0.145| 0.6845| 0.2725| 0.171| 0.205| 10| 1.0| 1.0|
| F| 0.47| 0.355| 0.1| 0.4755| 0.1675| 0.0805| 0.185| 10| 1.0| 0.0|
| M| 0.5| 0.4| 0.13| 0.6645| 0.258| 0.133| 0.24| 12| 1.0| 0.0|
| I| 0.355| 0.28| 0.085| 0.2905| 0.095| 0.0395| 0.115| 7| 0.0| 0.0|
| F| 0.44| 0.34| 0.1| 0.451| 0.188| 0.087| 0.13| 10| 1.0| 0.0|
| M| 0.365| 0.295| 0.08| 0.2555| 0.097| 0.043| 0.1| 7| 0.0| 0.0|
| M| 0.45| 0.32| 0.1| 0.381| 0.1705| 0.075| 0.115| 9| 1.0| 0.0|
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+-----------------+
only showing top 20 rows