When working with statitics, there are many common stats you would like to calculate on your data sets. R supplies the summary
method to assist you. We will see later on, that the summary
method handles many different things, but in this article, we will focus on summarizing some basic data types. In this article, we will learn how to summarize data in R.
The summarize data, we can pass a vector, matrix, factor, or data frame to the summary
method. You can also pass much more, but that is for a later article. Let's see what the summary method gives us for different data types.
sales = c(1000, 2000, 15000)
summary(sales)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1000 1500 2000 6000 8500 15000
Above, we can see that R calculates the mean and quantiles of our data. This gives us an idea of the avg and spread, similar to what we would see in a box plot.
The summary method can also summarize factors which gives us a nice count table.
fac = factor(c("yes", "yes", "no", "no", "maybe"))
summary(fac)
# maybe no yes
# 1 2 2
For a matrix, we can see that the summary method will summarize the columns by default. If you think of each of our columns representing conversions, clicks, impressions, then the summarize method tells us stats for each of these features.
sales1 = c(1000, 1000, 1000)
sales2 = c(2000, 2000, 2000)
sales3 = c(15000, 15000, 15000)
mat = matrix(c(sales1, sales2, sales3), 3, 3)
mat
# [,1] [,2] [,3]
# [1,] 1000 2000 15000
# [2,] 1000 2000 15000
# [3,] 1000 2000 15000
summary(mat)
# V1 V2 V3
# Min. :1000 Min. :2000 Min. :15000
# 1st Qu.:1000 1st Qu.:2000 1st Qu.:15000
# Median :1000 Median :2000 Median :15000
# Mean :1000 Mean :2000 Mean :15000
# 3rd Qu.:1000 3rd Qu.:2000 3rd Qu.:15000
# Max. :1000 Max. :2000 Max. :15000
Probabily the most common use of summary is summarizing a data frame. Usually, you will import an excel sheet or csv file and want to summarize the data.
conversions = c(1000, 1000, 1000)
clicks = c(2000, 2000, 2000)
impressions = c(15000, 15000, 15000)
df = data.frame(cbind(conversions, clicks, impressions))
df
# conversions clicks impressions
# [1,] 1000 2000 15000
# [2,] 1000 2000 15000
# [3,] 1000 2000 15000
summary(df)
# conversions clicks impressions
# Min. :1000 Min. :2000 Min. :15000
# 1st Qu.:1000 1st Qu.:2000 1st Qu.:15000
# Median :1000 Median :2000 Median :15000
# Mean :1000 Mean :2000 Mean :15000
# 3rd Qu.:1000 3rd Qu.:2000 3rd Qu.:15000
# Max. :1000 Max. :2000 Max. :15000
This is essentially the same as the matrix, although a data frame can support many types and includes the column names.