How to use dplyr summarize in R

Intro

The summarize method allows you to run summary statistics easily on your dataset. Mean and counts are easily accessed with this tidyverse method. In this article, we will learn how to use dplyr summarize in R.

If you are in a hurry

If you don’t have time to read, here is a quick code snippet for you.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

mtcars %>%
  group_by(cyl) %>%
  summarize(mean = mean(disp), n = n())

## # A tibble: 3 x 3
##     cyl  mean     n
##   <dbl> <dbl> <int>
## 1     4  105.    11
## 2     6  183.     7
## 3     8  353.    14

Loading the Library

We can load the dplyr package directly, but I recommend loading the tidyverse package as we will use some other features in side.

library(tidyverse)

Loading the Dataset

For this tutorial, we will use the mtcars data set the comes with tidyverse. We take a look at this data set below.

data(mtcars)

glimpse(mtcars)

## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~

Basic dplyr Summarize

We can use the basic summarize method by passing the data as the first parameter and the named parameter with a summary method. For example, below we pass the mean parameter to create a new column and we pass the mean() function call on the column we would like to summarize. This would add the mean of disp.

summarize(mtcars, mean = mean(disp))

##       mean
## 1 230.7219

When working with dplyr and the tidyverse, we often use the pipe, %>% operator. With this, we can send the data set to our method to use. Here is a rewrite of the code above.

mtcars %>%
  summarize(mean = mean(disp))

##       mean
## 1 230.7219

Summarize Groups using dplyr Summarize

We can also use the summarize method with the group by to summarize groups of data. Below we group by cyl and summarize the disp in each cyl group.

mtcars %>%
  group_by(cyl) %>%
  summarize(mean = mean(disp), n = n())

## # A tibble: 3 x 3
##     cyl  mean     n
##   <dbl> <dbl> <int>
## 1     4  105.    11
## 2     6  183.     7
## 3     8  353.    14

There are many summary method provided and able to be used with the summarize method. Here is an example with many more summarize statistics.

mtcars %>%
  group_by(cyl) %>%
  summarize(
    mean = mean(disp),
    median = median(disp),
    sd = sd(disp),
    iqr = IQR(disp),
    mad = mad(disp),
    min = min(disp),
    max = max(disp),
    quantile(disp),
    first = first(disp),
    last = last(disp),
    nth = nth(disp, 4),
    n = n(),
    n_dist = n_distinct(disp),
    any = any(disp),
    all = all(disp),
  )

## Warning in any(disp): coercing argument of type 'double' to logical

## Warning in any(disp): coercing argument of type 'double' to logical

## Warning in any(disp): coercing argument of type 'double' to logical

## Warning in all(disp): coercing argument of type 'double' to logical

## Warning in all(disp): coercing argument of type 'double' to logical

## Warning in all(disp): coercing argument of type 'double' to logical

## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.

## # A tibble: 15 x 16
## # Groups:   cyl [3]
##      cyl  mean median    sd   iqr   mad   min   max `quantile(disp)` first  last
##    <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>            <dbl> <dbl> <dbl>
##  1     4  105.   108   26.9  41.8  43.0  71.1  147.             71.1   108   121
##  2     4  105.   108   26.9  41.8  43.0  71.1  147.             78.8   108   121
##  3     4  105.   108   26.9  41.8  43.0  71.1  147.            108     108   121
##  4     4  105.   108   26.9  41.8  43.0  71.1  147.            121.    108   121
##  5     4  105.   108   26.9  41.8  43.0  71.1  147.            147.    108   121
##  6     6  183.   168.  41.6  36.3  11.3 145    258             145     160   145
##  7     6  183.   168.  41.6  36.3  11.3 145    258             160     160   145
##  8     6  183.   168.  41.6  36.3  11.3 145    258             168.    160   145
##  9     6  183.   168.  41.6  36.3  11.3 145    258             196.    160   145
## 10     6  183.   168.  41.6  36.3  11.3 145    258             258     160   145
## 11     8  353.   350.  67.8  88.2  73.4 276.   472             276.    360   301
## 12     8  353.   350.  67.8  88.2  73.4 276.   472             302.    360   301
## 13     8  353.   350.  67.8  88.2  73.4 276.   472             350.    360   301
## 14     8  353.   350.  67.8  88.2  73.4 276.   472             390     360   301
## 15     8  353.   350.  67.8  88.2  73.4 276.   472             472     360   301
## # ... with 5 more variables: nth <dbl>, n <int>, n_dist <int>, any <lgl>,
## #   all <lgl>

Group By Options

For certain summarize, we can use the options of the functions. For example, we can tell the quantile and probability functions which values to summarize on.

mtcars %>%
   group_by(cyl) %>%
   summarize(qs = quantile(disp, c(0.25, 0.75)), prob = c(0.25, 0.75))

## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.

## # A tibble: 6 x 3
## # Groups:   cyl [3]
##     cyl    qs  prob
##   <dbl> <dbl> <dbl>
## 1     4  78.8  0.25
## 2     4 121.   0.75
## 3     6 160    0.25
## 4     6 196.   0.75
## 5     8 302.   0.25
## 6     8 390    0.75

Using Custom Functions

We can also create our own functions and use them in summarize. Here is a rewrite of the above using a custom function.

my_quantile <- function(x, probs) {
  tibble(x = quantile(x, probs), probs = probs)
}

mtcars %>%
  group_by(cyl) %>%
  summarize(my_quantile(disp, c(0.25, 0.75)))

## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.

## # A tibble: 6 x 3
## # Groups:   cyl [3]
##     cyl     x probs
##   <dbl> <dbl> <dbl>
## 1     4  78.8  0.25
## 2     4 121.   0.75
## 3     6 160    0.25
## 4     6 196.   0.75
## 5     8 302.   0.25
## 6     8 390    0.75