PySpark Vectors

11.01.2021

Intro

PySpark provides several methods for working with linear algebra methods in the machine learning library. Specifically, we have a few ways to build and work with vectors at scale. In this article, we will learn how to use PySpark Vectors.

Setting Up

The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml, paste the following code, then run docker-compose up. You will then see a link in the console to open up and access a jupyter notebook.

version: '3'
services:
  spark:
    image: jupyter/pyspark-notebook
    ports:
      - "8888:8888"
      - "4040-4080:4040-4080"
    volumes:
      - ./notebooks:/home/jovyan/work/notebooks/

Work With Vectors

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

To use vectors, we will use the Vectors class which provides factory methods for creating vectors. Let's begin by importing this and creating our first vector, the dense vector.

As you can see, we can create a dense vector using the dense method and we can either pass in numbers in the parameters or a list of numbers.

from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation

v1 = Vectors.dense(1, 2)
v2 = Vectors.dense([2, 3])

With the vectors created, we can use our normal math operators on them and PySpark will perform the vector algebra we would expect.

v1 * v2
DenseVector([2.0, 6.0])

The other option we have is creating sparse vectors (vectors with missing data). Here are a view examples.

Vectors.sparse(4, {1: 1.0, 3: 5.5}).toArray()
array([0. , 1. , 0. , 5.5])
Vectors.sparse(4, [(1, 1.0), (3, 5.5)]).toArray()
array([0. , 1. , 0. , 5.5])
Vectors.sparse(4, [1, 3], [1.0, 5.5]).toArray()
array([0. , 1. , 0. , 5.5])

And finally, we have the zeros function to create zero vectors.

Vectors.zeros(10).toArray()
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Vector Operations

As mention above, we can conduct normal python math operations on our vectors. We also have access to the norm and squared_distance methods from the Vectors factory class.

Vectors.norm(v1, p = 10)
2.0001952267223593
Vectors.squared_distance(v1, v2)
2.0

And finally, we can also use these vectors in other PySpark libraries/modules. For example, we can use the Correlation from the stat module in the PySpark ML Library. We will have a separate article(s) for these functions.

from pyspark.ml.stat import Correlation
data = [
    (Vectors.dense(1, 2),),
    (Vectors.dense(3, 4),)
]
df = spark.createDataFrame(data, ["features"])

r1 = Correlation.corr(df, "features").head()