The summarize method allows you to run summary statistics easily on your dataset. Mean and counts are easily accessed with this tidyverse method. In this article, we will learn how to use dplyr summarize in R.
If you don’t have time to read, here is a quick code snippet for you.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
mtcars %>%
group_by(cyl) %>%
summarize(mean = mean(disp), n = n())
## # A tibble: 3 x 3
## cyl mean n
## <dbl> <dbl> <int>
## 1 4 105. 11
## 2 6 183. 7
## 3 8 353. 14
We can load the dplyr package directly, but I recommend loading the
tidyverse
package as we will use some other features in side.
library(tidyverse)
For this tutorial, we will use the mtcars
data set the comes with
tidyverse
. We take a look at this data set below.
data(mtcars)
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~
We can use the basic summarize method by passing the data as the first
parameter and the named parameter with a summary method. For example,
below we pass the mean
parameter to create a new column and we pass
the mean()
function call on the column we would like to summarize.
This would add the mean of disp
.
summarize(mtcars, mean = mean(disp))
## mean
## 1 230.7219
When working with dplyr and the tidyverse, we often use the pipe, %>% operator. With this, we can send the data set to our method to use. Here is a rewrite of the code above.
mtcars %>%
summarize(mean = mean(disp))
## mean
## 1 230.7219
We can also use the summarize method with the group by to summarize
groups of data. Below we group by cyl
and summarize the disp
in each
cyl group.
mtcars %>%
group_by(cyl) %>%
summarize(mean = mean(disp), n = n())
## # A tibble: 3 x 3
## cyl mean n
## <dbl> <dbl> <int>
## 1 4 105. 11
## 2 6 183. 7
## 3 8 353. 14
There are many summary method provided and able to be used with the
summarize
method. Here is an example with many more summarize
statistics.
mtcars %>%
group_by(cyl) %>%
summarize(
mean = mean(disp),
median = median(disp),
sd = sd(disp),
iqr = IQR(disp),
mad = mad(disp),
min = min(disp),
max = max(disp),
quantile(disp),
first = first(disp),
last = last(disp),
nth = nth(disp, 4),
n = n(),
n_dist = n_distinct(disp),
any = any(disp),
all = all(disp),
)
## Warning in any(disp): coercing argument of type 'double' to logical
## Warning in any(disp): coercing argument of type 'double' to logical
## Warning in any(disp): coercing argument of type 'double' to logical
## Warning in all(disp): coercing argument of type 'double' to logical
## Warning in all(disp): coercing argument of type 'double' to logical
## Warning in all(disp): coercing argument of type 'double' to logical
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
## # A tibble: 15 x 16
## # Groups: cyl [3]
## cyl mean median sd iqr mad min max `quantile(disp)` first last
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 105. 108 26.9 41.8 43.0 71.1 147. 71.1 108 121
## 2 4 105. 108 26.9 41.8 43.0 71.1 147. 78.8 108 121
## 3 4 105. 108 26.9 41.8 43.0 71.1 147. 108 108 121
## 4 4 105. 108 26.9 41.8 43.0 71.1 147. 121. 108 121
## 5 4 105. 108 26.9 41.8 43.0 71.1 147. 147. 108 121
## 6 6 183. 168. 41.6 36.3 11.3 145 258 145 160 145
## 7 6 183. 168. 41.6 36.3 11.3 145 258 160 160 145
## 8 6 183. 168. 41.6 36.3 11.3 145 258 168. 160 145
## 9 6 183. 168. 41.6 36.3 11.3 145 258 196. 160 145
## 10 6 183. 168. 41.6 36.3 11.3 145 258 258 160 145
## 11 8 353. 350. 67.8 88.2 73.4 276. 472 276. 360 301
## 12 8 353. 350. 67.8 88.2 73.4 276. 472 302. 360 301
## 13 8 353. 350. 67.8 88.2 73.4 276. 472 350. 360 301
## 14 8 353. 350. 67.8 88.2 73.4 276. 472 390 360 301
## 15 8 353. 350. 67.8 88.2 73.4 276. 472 472 360 301
## # ... with 5 more variables: nth <dbl>, n <int>, n_dist <int>, any <lgl>,
## # all <lgl>
For certain summarize, we can use the options of the functions. For example, we can tell the quantile and probability functions which values to summarize on.
mtcars %>%
group_by(cyl) %>%
summarize(qs = quantile(disp, c(0.25, 0.75)), prob = c(0.25, 0.75))
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: cyl [3]
## cyl qs prob
## <dbl> <dbl> <dbl>
## 1 4 78.8 0.25
## 2 4 121. 0.75
## 3 6 160 0.25
## 4 6 196. 0.75
## 5 8 302. 0.25
## 6 8 390 0.75
We can also create our own functions and use them in summarize. Here is a rewrite of the above using a custom function.
my_quantile <- function(x, probs) {
tibble(x = quantile(x, probs), probs = probs)
}
mtcars %>%
group_by(cyl) %>%
summarize(my_quantile(disp, c(0.25, 0.75)))
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: cyl [3]
## cyl x probs
## <dbl> <dbl> <dbl>
## 1 4 78.8 0.25
## 2 4 121. 0.75
## 3 6 160 0.25
## 4 6 196. 0.75
## 5 8 302. 0.25
## 6 8 390 0.75