Describing and Summarzing Data in a Pandas DataFrame

2021-01-13

Intro

In this article, we will look at common methods for summarizing and describing data in your DataFrame. There are a list of methods to use when you first load data, that will help you get an idea of what is inside.

Describing a DataFrame

Let's start by creating a DataFrame.

import pandas as pd

df = pd.DataFrame([
	{
		"person": "James",
		"sales": 1000,
		"lastName": "Taylor",
	},
	{
		"person": "Clara",
		"sales": 3000,
		"lastName": "Brown"
	}
])

The first few properties we will look at are shape, size, ndim and len. Each of these will give you an idea of how many rows and columns are in your data. Specifically shape returns the number of (rows, columns).

print(df.shape)
print(df.size)
print(df.ndim)
print(len(df))

Next, let's look at the count method which will return the number of non-missing values for each column.

df.count()

We move on to look at some statistics. The follow summary methods are frquently used in interpreting continous data. For example, the min function will give us the minimum value from each numerical column.

print(df.min())
print(df.max())
print(df.mean())
print(df.median())
print(df.std())

We can also can see many summary statistics by using the describe method. This will print out the above and more for each of our columns.

print(df.describe())
GoTea - KoalaTea