 KoalaTea

# How to Summarize Data in R

### 05.03.2021

When working with statitics, there are many common stats you would like to calculate on your data sets. R supplies the summary method to assist you. We will see later on, that the summary method handles many different things, but in this article, we will focus on summarizing some basic data types. In this article, we will learn how to summarize data in R.

The summarize data, we can pass a vector, matrix, factor, or data frame to the summary method. You can also pass much more, but that is for a later article. Let's see what the summary method gives us for different data types.

sales = c(1000, 2000, 15000)
summary(sales)

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
# 1000    1500    2000    6000    8500   15000 

Above, we can see that R calculates the mean and quantiles of our data. This gives us an idea of the avg and spread, similar to what we would see in a box plot.

The summary method can also summarize factors which gives us a nice count table.

fac = factor(c("yes", "yes", "no", "no", "maybe"))
summary(fac)

# maybe    no   yes
#     1     2     2 

For a matrix, we can see that the summary method will summarize the columns by default. If you think of each of our columns representing conversions, clicks, impressions, then the summarize method tells us stats for each of these features.

sales1 = c(1000, 1000, 1000)
sales2 = c(2000, 2000, 2000)
sales3 = c(15000, 15000, 15000)
mat = matrix(c(sales1, sales2, sales3), 3, 3)
mat

#      [,1] [,2]  [,3]
# [1,] 1000 2000 15000
# [2,] 1000 2000 15000
# [3,] 1000 2000 15000

summary(mat)

#        V1             V2             V3
# Min.   :1000   Min.   :2000   Min.   :15000
# 1st Qu.:1000   1st Qu.:2000   1st Qu.:15000
# Median :1000   Median :2000   Median :15000
# Mean   :1000   Mean   :2000   Mean   :15000
# 3rd Qu.:1000   3rd Qu.:2000   3rd Qu.:15000
# Max.   :1000   Max.   :2000   Max.   :15000 

Probabily the most common use of summary is summarizing a data frame. Usually, you will import an excel sheet or csv file and want to summarize the data.

conversions = c(1000, 1000, 1000)
clicks = c(2000, 2000, 2000)
impressions = c(15000, 15000, 15000)

df = data.frame(cbind(conversions, clicks, impressions))
df

#     conversions clicks impressions
# [1,]        1000   2000       15000
# [2,]        1000   2000       15000
# [3,]        1000   2000       15000

summary(df)

# conversions       clicks      impressions
# Min.   :1000   Min.   :2000   Min.   :15000
# 1st Qu.:1000   1st Qu.:2000   1st Qu.:15000
# Median :1000   Median :2000   Median :15000
# Mean   :1000   Mean   :2000   Mean   :15000
# 3rd Qu.:1000   3rd Qu.:2000   3rd Qu.:15000
# Max.   :1000   Max.   :2000   Max.   :15000 

This is essentially the same as the matrix, although a data frame can support many types and includes the column names.