How to use dplyr group by in R

Intro

The group_by method allows you to group data which allows for easy visualization and summarizing over groups. Tidyverse makes single and multiple groups easy. In this article, we will learn how to us dplyr group by.

If you are in a hurry

If you don’t have time to read, here is a quick code snippet for you.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

mtcars %>% group_by(cyl, hp)

## # A tibble: 32 x 11
## # Groups:   cyl, hp [23]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows

Loading the Library

We can load the dplyr package directly, but I recommend loading the tidyverse package as we will use some other features in side.

library(tidyverse)

Loading the Dataset

For this tutorial, we will use the mtcars data set the comes with tidyverse. We take a look at this data set below.

data(mtcars)

glimpse(mtcars)

## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~

Basic dplyr group by

To use group_by, we pass our data and the column we want to group by as parameters to the method.

group_by(mtcars, cyl)

## # A tibble: 32 x 11
## # Groups:   cyl [3]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows

Notice how the look does not change, but there is a new item at the top saying we have 3 cyl groups.

When working with dplyr and the tidyverse, we often use the pipe, %>% operator. With this, we can send the data set to our method to use. Here is a rewrite of the code above.

mtcars %>% group_by(cyl)

## # A tibble: 32 x 11
## # Groups:   cyl [3]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows

Usually, group_by is used in conjunction with summarize. Here is an example of getting the avgerage horsepower per a car’s cyl.

mtcars %>% 
  group_by(cyl) %>%
  summarize(avg_hp =  mean(hp))

## # A tibble: 3 x 2
##     cyl avg_hp
##   <dbl>  <dbl>
## 1     4   82.6
## 2     6  122. 
## 3     8  209.

More dplyr group by

Let’s go over some other features of the group_by method. First of all, if you us group_by on an existing grouped data frame, the old group will be removed.

group1 = mtcars %>% 
  group_by(cyl)

group1 %>%
  group_by(vs, am)

## # A tibble: 32 x 11
## # Groups:   vs, am [4]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows

If we would like to add groups instead, we can use the .add parameter.

group1 %>%
  group_by(vs, am, .add = TRUE)

## # A tibble: 32 x 11
## # Groups:   cyl, vs, am [7]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows

Ungroup

We can remove group by using the ungroup method.

mtcars %>% 
  group_by(cyl) %>%
  ungroup()

## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows

Group by Computed

We can also group by a computed column by passing the new parameter name and the computing function. This is a short hand for using the mutate function.

mtcars %>% group_by(vsam = vs + am)

## # A tibble: 32 x 12
## # Groups:   vsam [3]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  vsam
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     1
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     1
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     2
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     0
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     0
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     1
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     1
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     1
## # ... with 22 more rows

Notice how the above is the same as using the mutate function.

mtcars %>% 
  mutate(vsam = vs + am) %>%
  group_by(vsam)

## # A tibble: 32 x 12
## # Groups:   vsam [3]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  vsam
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     1
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     1
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     2
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     0
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     0
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     1
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     1
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     1
## # ... with 22 more rows

Group Vars

We an use the group_vars function to get the list of groups.

mtcars %>% 
  group_by(vs, am) %>%
  group_vars()

## [1] "vs" "am"