PySpark provides several methods for working with linear algebra methods in the machine learning library. Specifically, we have a few ways to build and work with vectors at scale. In this article, we will learn how to use PySpark Vectors.
The quickest way to get started working with python is to use the following docker compose file. Simple create a docker-compose.yml
, paste the following code, then run docker-compose up
. You will then see a link in the console to open up and access a jupyter notebook.
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
To use vectors, we will use the Vectors
class which provides factory methods for creating vectors. Let's begin by importing this and creating our first vector, the dense
vector.
As you can see, we can create a dense vector using the dense
method and we can either pass in numbers in the parameters or a list of numbers.
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
v1 = Vectors.dense(1, 2)
v2 = Vectors.dense([2, 3])
With the vectors created, we can use our normal math operators on them and PySpark will perform the vector algebra we would expect.
v1 * v2
DenseVector([2.0, 6.0])
The other option we have is creating sparse
vectors (vectors with missing data). Here are a view examples.
Vectors.sparse(4, {1: 1.0, 3: 5.5}).toArray()
array([0. , 1. , 0. , 5.5])
Vectors.sparse(4, [(1, 1.0), (3, 5.5)]).toArray()
array([0. , 1. , 0. , 5.5])
Vectors.sparse(4, [1, 3], [1.0, 5.5]).toArray()
array([0. , 1. , 0. , 5.5])
And finally, we have the zeros
function to create zero vectors.
Vectors.zeros(10).toArray()
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
As mention above, we can conduct normal python math operations on our vectors. We also have access to the norm
and squared_distance
methods from the Vectors
factory class.
Vectors.norm(v1, p = 10)
2.0001952267223593
Vectors.squared_distance(v1, v2)
2.0
And finally, we can also use these vectors in other PySpark libraries/modules. For example, we can use the Correlation
from the stat
module in the PySpark ML Library. We will have a separate article(s) for these functions.
from pyspark.ml.stat import Correlation
data = [
(Vectors.dense(1, 2),),
(Vectors.dense(3, 4),)
]
df = spark.createDataFrame(data, ["features"])
r1 = Correlation.corr(df, "features").head()