The PySpark Row class is located in the pyspark.sql
module and provides a simple way to create rows or observations in a dataframe or an RDD. In this article, we will learn how to use PySpark Row.
Let's start by creating a Spark Session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
The row class extends a Python tuple, so we can use it in a similar way. There are many way to create a row. We can pass positional parameters to a row as a tuple, we can passed named parameters, or we can create a class of the row and new objects from that class.
from pyspark.sql import Row
## Positional Parameters
Row(1000, 'jan')
<Row(1000, 'jan')>
## Named Parameters
Row(amount = 1000, month = 'jan')
Row(amount=1000, month='jan')
## Class Based
SalesRecord = Row("amount", "month")
sales1 = SalesRecord(1000, "jan")
sales2 = SalesRecord(2000, "feb")
print(sales1)
print(sales2)
Row(amount=1000, month='jan')
Row(amount=2000, month='feb')
Once we have some rows, we can pass them to spark.sparkContext.parallelize
to create an RDD.
row1 = Row(1000, 'jan')
row2 = Row(2000, 'feb')
rdd = spark.sparkContext.parallelize([row1, row2])
rdd.collect()
[<Row(1000, 'jan')>, <Row(2000, 'feb')>]
In a similar way, we can create a dataframe using spark.createDataFrame
.
row1 = Row(1000, 'jan')
row2 = Row(2000, 'feb')
df = spark.createDataFrame([row1, row2])
df.show()
+----+---+
| _1| _2|
+----+---+
|1000|jan|
|2000|feb|
+----+---+