PySpark Binarizer

11.03.2021

Intro

The PySpark Binarizer allows you to convert a continuous variable into a discrete 0 or 1. This is helpful when you want to simply convert a column to check if a value exists based on a threshold. In this article, we will learn how to use PySpark Binarizer.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Using the Binarizer

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Next, let's import some data to use.

import pandas as pd

df = pd.read_csv('../data/abalone.csv')
df.head()
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7

Now, let's convert our pandas object to a dataframe.

sdf = spark.createDataFrame(df)
sdf.show()
[Stage 0:>                                                          (0 + 1) / 1]

+---+------+--------+------+------------+--------------+--------------+------------+-----+
|Sex|Length|Diameter|Height|Whole weight|Shucked weight|Viscera weight|Shell weight|Rings|
+---+------+--------+------+------------+--------------+--------------+------------+-----+
|  M| 0.455|   0.365| 0.095|       0.514|        0.2245|         0.101|        0.15|   15|
|  M|  0.35|   0.265|  0.09|      0.2255|        0.0995|        0.0485|        0.07|    7|
|  F|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21|    9|
|  M|  0.44|   0.365| 0.125|       0.516|        0.2155|         0.114|       0.155|   10|
|  I|  0.33|   0.255|  0.08|       0.205|        0.0895|        0.0395|       0.055|    7|
|  I| 0.425|     0.3| 0.095|      0.3515|         0.141|        0.0775|        0.12|    8|
|  F|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33|   20|
|  F| 0.545|   0.425| 0.125|       0.768|         0.294|        0.1495|        0.26|   16|
|  M| 0.475|    0.37| 0.125|      0.5095|        0.2165|        0.1125|       0.165|    9|
|  F|  0.55|    0.44|  0.15|      0.8945|        0.3145|         0.151|        0.32|   19|
|  F| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21|   14|
|  M|  0.43|    0.35|  0.11|       0.406|        0.1675|         0.081|       0.135|   10|
|  M|  0.49|    0.38| 0.135|      0.5415|        0.2175|         0.095|        0.19|   11|
|  F| 0.535|   0.405| 0.145|      0.6845|        0.2725|         0.171|       0.205|   10|
|  F|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185|   10|
|  M|   0.5|     0.4|  0.13|      0.6645|         0.258|         0.133|        0.24|   12|
|  I| 0.355|    0.28| 0.085|      0.2905|         0.095|        0.0395|       0.115|    7|
|  F|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13|   10|
|  M| 0.365|   0.295|  0.08|      0.2555|         0.097|         0.043|         0.1|    7|
|  M|  0.45|    0.32|   0.1|       0.381|        0.1705|         0.075|       0.115|    9|
+---+------+--------+------+------------+--------------+--------------+------------+-----+
only showing top 20 rows



                                                                                

Now, to use the Binarizer, we can create a Constructor with three params. First we specify a threshold. In this example, anything above .4 will be changed to 1 and below will be 0. Then we specify an input column and an output column for the new feature.

from pyspark.ml.feature import Binarizer
binarizer = Binarizer(threshold = .4, inputCol = "Length", outputCol = "LengthBinarized")
binarizer
Binarizer_b46f6ef9df36

Finally, we can use the transform to add the new feature to a new data frame, pyspark does not alter the existing one.

newdf = binarizer.transform(sdf)
newdf.show()
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+
|Sex|Length|Diameter|Height|Whole weight|Shucked weight|Viscera weight|Shell weight|Rings|LengthBinarized|
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+
|  M| 0.455|   0.365| 0.095|       0.514|        0.2245|         0.101|        0.15|   15|            1.0|
|  M|  0.35|   0.265|  0.09|      0.2255|        0.0995|        0.0485|        0.07|    7|            0.0|
|  F|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21|    9|            1.0|
|  M|  0.44|   0.365| 0.125|       0.516|        0.2155|         0.114|       0.155|   10|            1.0|
|  I|  0.33|   0.255|  0.08|       0.205|        0.0895|        0.0395|       0.055|    7|            0.0|
|  I| 0.425|     0.3| 0.095|      0.3515|         0.141|        0.0775|        0.12|    8|            1.0|
|  F|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33|   20|            1.0|
|  F| 0.545|   0.425| 0.125|       0.768|         0.294|        0.1495|        0.26|   16|            1.0|
|  M| 0.475|    0.37| 0.125|      0.5095|        0.2165|        0.1125|       0.165|    9|            1.0|
|  F|  0.55|    0.44|  0.15|      0.8945|        0.3145|         0.151|        0.32|   19|            1.0|
|  F| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21|   14|            1.0|
|  M|  0.43|    0.35|  0.11|       0.406|        0.1675|         0.081|       0.135|   10|            1.0|
|  M|  0.49|    0.38| 0.135|      0.5415|        0.2175|         0.095|        0.19|   11|            1.0|
|  F| 0.535|   0.405| 0.145|      0.6845|        0.2725|         0.171|       0.205|   10|            1.0|
|  F|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185|   10|            1.0|
|  M|   0.5|     0.4|  0.13|      0.6645|         0.258|         0.133|        0.24|   12|            1.0|
|  I| 0.355|    0.28| 0.085|      0.2905|         0.095|        0.0395|       0.115|    7|            0.0|
|  F|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13|   10|            1.0|
|  M| 0.365|   0.295|  0.08|      0.2555|         0.097|         0.043|         0.1|    7|            0.0|
|  M|  0.45|    0.32|   0.1|       0.381|        0.1705|         0.075|       0.115|    9|            1.0|
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+
only showing top 20 rows

One final example, we can also modify multiple columns. Let's update our input and output columns.

inputCols = ["Length", "Diameter"]
outputCols = ["LengthBinarized", "DiameterBinarized"]

binarizer = Binarizer(threshold = .4, inputCols = inputCols, outputCols = outputCols)
newdf = binarizer.transform(sdf)
newdf.show()
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+-----------------+
|Sex|Length|Diameter|Height|Whole weight|Shucked weight|Viscera weight|Shell weight|Rings|LengthBinarized|DiameterBinarized|
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+-----------------+
|  M| 0.455|   0.365| 0.095|       0.514|        0.2245|         0.101|        0.15|   15|            1.0|              0.0|
|  M|  0.35|   0.265|  0.09|      0.2255|        0.0995|        0.0485|        0.07|    7|            0.0|              0.0|
|  F|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21|    9|            1.0|              1.0|
|  M|  0.44|   0.365| 0.125|       0.516|        0.2155|         0.114|       0.155|   10|            1.0|              0.0|
|  I|  0.33|   0.255|  0.08|       0.205|        0.0895|        0.0395|       0.055|    7|            0.0|              0.0|
|  I| 0.425|     0.3| 0.095|      0.3515|         0.141|        0.0775|        0.12|    8|            1.0|              0.0|
|  F|  0.53|   0.415|  0.15|      0.7775|         0.237|        0.1415|        0.33|   20|            1.0|              1.0|
|  F| 0.545|   0.425| 0.125|       0.768|         0.294|        0.1495|        0.26|   16|            1.0|              1.0|
|  M| 0.475|    0.37| 0.125|      0.5095|        0.2165|        0.1125|       0.165|    9|            1.0|              0.0|
|  F|  0.55|    0.44|  0.15|      0.8945|        0.3145|         0.151|        0.32|   19|            1.0|              1.0|
|  F| 0.525|    0.38|  0.14|      0.6065|         0.194|        0.1475|        0.21|   14|            1.0|              0.0|
|  M|  0.43|    0.35|  0.11|       0.406|        0.1675|         0.081|       0.135|   10|            1.0|              0.0|
|  M|  0.49|    0.38| 0.135|      0.5415|        0.2175|         0.095|        0.19|   11|            1.0|              0.0|
|  F| 0.535|   0.405| 0.145|      0.6845|        0.2725|         0.171|       0.205|   10|            1.0|              1.0|
|  F|  0.47|   0.355|   0.1|      0.4755|        0.1675|        0.0805|       0.185|   10|            1.0|              0.0|
|  M|   0.5|     0.4|  0.13|      0.6645|         0.258|         0.133|        0.24|   12|            1.0|              0.0|
|  I| 0.355|    0.28| 0.085|      0.2905|         0.095|        0.0395|       0.115|    7|            0.0|              0.0|
|  F|  0.44|    0.34|   0.1|       0.451|         0.188|         0.087|        0.13|   10|            1.0|              0.0|
|  M| 0.365|   0.295|  0.08|      0.2555|         0.097|         0.043|         0.1|    7|            0.0|              0.0|
|  M|  0.45|    0.32|   0.1|       0.381|        0.1705|         0.075|       0.115|    9|            1.0|              0.0|
+---+------+--------+------+------------+--------------+--------------+------------+-----+---------------+-----------------+
only showing top 20 rows