How to use dplyr distinct in R

Intro

Finding duplicates is simple using the distinct verb in dplyr. There are many options that allow you to specify column combinations to find distinct rows which is essential when looking for duplicates. In this article, we will learn how to use dplyr distinct.

If you are in a Hurry

If you are here for the code snippet, here is a quick example to use.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

row1 = data.frame(
  x = 1,
  y = 2,
  x = 3
)
row2 = data.frame(
  x = 1,
  y = 2,
  x = 3
)
row3 = data.frame(
  x = 1,
  y = 4,
  x = 5
)

df = rbind(row1, row2, row3)
df

##   x y x.1
## 1 1 2   3
## 2 1 2   3
## 3 1 4   5

df %>% distinct(x, .keep_all = TRUE)

##   x y x.1
## 1 1 2   3

Loading the library

To use the dplyr library we can either load the dplyr library library(dplyr) or we can load the full tidyverse library. We will use the full suit of tidyverse as we will want some other features.

library(tidyverse)

Loading the data

For our data, we will create a quick example with duplicate rows. Notice the first two are the same, but the third row only shares the same x column.

row1 = data.frame(
  x = 1,
  y = 2,
  x = 3
)
row2 = data.frame(
  x = 1,
  y = 2,
  x = 3
)
row3 = data.frame(
  x = 1,
  y = 4,
  x = 5
)

df = rbind(row1, row2, row3)
df

##   x y x.1
## 1 1 2   3
## 2 1 2   3
## 3 1 4   5

Basic dplyr Distinct Usage

We can find distinct rows by calling the distinct function on our data frame. By default, distinct will use all columns for uniqueness. In our example below, we are left with only two rows which makes sense as the first two are duplicates.

nrow(df)

## [1] 3

nrow(distinct(df))

## [1] 2

We can also tell distinct to use specific columns. To do this, we simply pass the column names as parameters.

nrow(distinct(df, x))

## [1] 1

distinct(df, x)

##   x
## 1 1

This example requires x and y to be distinct to remove.

distinct(df, x, y)

##   x y
## 1 1 2
## 2 1 4

Using Pipe

In the example above, we passed the data to our method. When using tidyverse, it is more common to use the %>% pipe operator. We can change our example above to pipe the data frame in to distinct.

df %>% distinct(x, y)

##   x y
## 1 1 2
## 2 1 4

Keeping Other Columns

In the example above, if we did not tell distiinct which columns to use, the method removes them by default. We can change this behavior using the keep_all parameter.

df %>% distinct(x, .keep_all = TRUE)

##   x y x.1
## 1 1 2   3

Computed Distinct

We can also use computed columns to find uniqueness. In our example, we find distinct rows based on the difference between x and y. Another good example would be joining first and last name to find distinct based on full names.

df %>% distinct(diff = abs(x - y))

##   diff
## 1    1
## 2    3

Distinct Using Select Helpers

If we would like to select multiple columns based on conditions, we can use dplyr select helpers. In this example, we look for a columns that contain the word color to use when finding distinct rows.

starwars %>% distinct(across(contains("color")))

## # A tibble: 67 x 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ... with 57 more rows

Distinct Groups

Our final example will be to find distinct rows in grouped data. First, we group all star wars character by their eye color.

df = starwars %>% group_by(eye_color)
df

## # A tibble: 87 x 14
## # Groups:   eye_color [15]
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Now, we find distinct values based on species. This will give us a list of unique combinations of species and their eye colors.

df %>% distinct(species)

## # A tibble: 51 x 2
## # Groups:   eye_color [15]
##    eye_color species
##    <chr>     <chr>  
##  1 blue      Human  
##  2 yellow    Droid  
##  3 red       Droid  
##  4 yellow    Human  
##  5 brown     Human  
##  6 blue-gray Human  
##  7 blue      Wookiee
##  8 black     Rodian 
##  9 orange    Hutt   
## 10 hazel     Human  
## # ... with 41 more rows

If we use arrange to sort by species, the data frame is a bit easier to read.

df %>% distinct(species) %>% arrange(species)

## # A tibble: 51 x 2
## # Groups:   eye_color [15]
##    eye_color species 
##    <chr>     <chr>   
##  1 unknown   Aleena  
##  2 yellow    Besalisk
##  3 yellow    Cerean  
##  4 blue      Chagrian
##  5 yellow    Clawdite
##  6 yellow    Droid   
##  7 red       Droid   
##  8 red, blue Droid   
##  9 black     Droid   
## 10 orange    Dug     
## # ... with 41 more rows