PySpark Show

Intro

The show function allows us to preview a data frame. The show method provides us with a few options to edit the output. In this article, we will learn how to use the PySpark show function.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row

df = spark.createDataFrame([
    Row(amount = 20000, month = 'jan', date = datetime(2000, 1, 1, 12, 0), desc = "a very very very long description"),
    Row(amount = 40000, month = 'feb', date = datetime(2000, 2, 1, 12, 0), desc = "a very very very long description"),
    Row(amount = 50000, month = 'mar', date = datetime(2000, 3, 1, 12, 0), desc = "a very very very long description")
])

df.show()

+------+-----+-------------------+--------------------+
|amount|month|               date|                desc|
+------+-----+-------------------+--------------------+
| 20000|  jan|2000-01-01 12:00:00|a very very very ...|
| 40000|  feb|2000-02-01 12:00:00|a very very very ...|
| 50000|  mar|2000-03-01 12:00:00|a very very very ...|
+------+-----+-------------------+--------------------+

We can specify the number of row to display using the n named parameter.

df.show(n=2)

+------+-----+-------------------+--------------------+
|amount|month|               date|                desc|
+------+-----+-------------------+--------------------+
| 20000|  jan|2000-01-01 12:00:00|a very very very ...|
| 40000|  feb|2000-02-01 12:00:00|a very very very ...|
+------+-----+-------------------+--------------------+
only showing top 2 rows

Notice above that our description column has been cut off. But default, pyspark will truncate this data. We can change that using the truncate named parameter.

df.show(truncate = False)

+------+-----+-------------------+---------------------------------+
|amount|month|date               |desc                             |
+------+-----+-------------------+---------------------------------+
|20000 |jan  |2000-01-01 12:00:00|a very very very long description|
|40000 |feb  |2000-02-01 12:00:00|a very very very long description|
|50000 |mar  |2000-03-01 12:00:00|a very very very long description|
+------+-----+-------------------+---------------------------------+

We can also specificy the length to truncate by passing a number to the truncate named parameter.

df.show(truncate = 20)

+------+-----+-------------------+--------------------+
|amount|month|               date|                desc|
+------+-----+-------------------+--------------------+
| 20000|  jan|2000-01-01 12:00:00|a very very very ...|
| 40000|  feb|2000-02-01 12:00:00|a very very very ...|
| 50000|  mar|2000-03-01 12:00:00|a very very very ...|
+------+-----+-------------------+--------------------+

The final option we can do is to display the dataframe vertically using the vertical parameter.

df.show(vertical = True)

-RECORD 0----------------------
 amount | 20000                
 month  | jan                  
 date   | 2000-01-01 12:00:00  
 desc   | a very very very ... 
-RECORD 1----------------------
 amount | 40000                
 month  | feb                  
 date   | 2000-02-01 12:00:00  
 desc   | a very very very ... 
-RECORD 2----------------------
 amount | 50000                
 month  | mar                  
 date   | 2000-03-01 12:00:00  
 desc   | a very very very ...

You can also combine all of these parameters.

df.show(n = 2, truncate = 30, vertical = True)

-RECORD 0--------------------------------
 amount | 20000                          
 month  | jan                            
 date   | 2000-01-01 12:00:00            
 desc   | a very very very long descr... 
-RECORD 1--------------------------------
 amount | 40000                          
 month  | feb                            
 date   | 2000-02-01 12:00:00            
 desc   | a very very very long descr... 
only showing top 2 rows

PySpark Show

09.28.2021

Intro

PySpark Window Functions

PySpark StructType and StructField