PySpark Row

10.01.2021

Intro

The PySpark Row class is located in the pyspark.sql module and provides a simple way to create rows or observations in a dataframe or an RDD. In this article, we will learn how to use PySpark Row.

Let's start by creating a Spark Session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

The row class extends a Python tuple, so we can use it in a similar way. There are many way to create a row. We can pass positional parameters to a row as a tuple, we can passed named parameters, or we can create a class of the row and new objects from that class.

from pyspark.sql import Row

## Positional Parameters
Row(1000, 'jan')
<Row(1000, 'jan')>
## Named Parameters
Row(amount = 1000, month = 'jan')
Row(amount=1000, month='jan')
## Class Based

SalesRecord = Row("amount", "month")

sales1 = SalesRecord(1000, "jan")
sales2 = SalesRecord(2000, "feb")

print(sales1)
print(sales2)
Row(amount=1000, month='jan')
Row(amount=2000, month='feb')

Once we have some rows, we can pass them to spark.sparkContext.parallelize to create an RDD.

row1 = Row(1000, 'jan')
row2 = Row(2000, 'feb')


rdd = spark.sparkContext.parallelize([row1, row2])
rdd.collect()
[<Row(1000, 'jan')>, <Row(2000, 'feb')>]

In a similar way, we can create a dataframe using spark.createDataFrame.

row1 = Row(1000, 'jan')
row2 = Row(2000, 'feb')

df = spark.createDataFrame([row1, row2])

df.show()
+----+---+
|  _1| _2|
+----+---+
|1000|jan|
|2000|feb|
+----+---+