Finding duplicates is simple using the distinct verb in dplyr. There
are many options that allow you to specify column combinations to find
distinct rows which is essential when looking for duplicates. In this
article, we will learn how to use dplyr distinct.
If you are here for the code snippet, here is a quick example to use.
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()row1 = data.frame(
x = 1,
y = 2,
x = 3
)
row2 = data.frame(
x = 1,
y = 2,
x = 3
)
row3 = data.frame(
x = 1,
y = 4,
x = 5
)
df = rbind(row1, row2, row3)
df## x y x.1
## 1 1 2 3
## 2 1 2 3
## 3 1 4 5df %>% distinct(x, .keep_all = TRUE)## x y x.1
## 1 1 2 3To use the dplyr library we can either load the dplyr library
library(dplyr) or we can load the full tidyverse library. We will use
the full suit of tidyverse as we will want some other features.
library(tidyverse)For our data, we will create a quick example with duplicate rows. Notice the first two are the same, but the third row only shares the same x column.
row1 = data.frame(
x = 1,
y = 2,
x = 3
)
row2 = data.frame(
x = 1,
y = 2,
x = 3
)
row3 = data.frame(
x = 1,
y = 4,
x = 5
)
df = rbind(row1, row2, row3)
df## x y x.1
## 1 1 2 3
## 2 1 2 3
## 3 1 4 5We can find distinct rows by calling the distinct function on our data
frame. By default, distinct will use all columns for uniqueness. In
our example below, we are left with only two rows which makes sense as
the first two are duplicates.
nrow(df)## [1] 3nrow(distinct(df))## [1] 2We can also tell distinct to use specific columns. To do this, we
simply pass the column names as parameters.
nrow(distinct(df, x))## [1] 1distinct(df, x)## x
## 1 1This example requires x and y to be distinct to remove.
distinct(df, x, y)## x y
## 1 1 2
## 2 1 4In the example above, we passed the data to our method. When using
tidyverse, it is more common to use the %>% pipe operator. We can
change our example above to pipe the data frame in to distinct.
df %>% distinct(x, y)## x y
## 1 1 2
## 2 1 4In the example above, if we did not tell distiinct which columns to
use, the method removes them by default. We can change this behavior
using the keep_all parameter.
df %>% distinct(x, .keep_all = TRUE)## x y x.1
## 1 1 2 3We can also use computed columns to find uniqueness. In our example, we find distinct rows based on the difference between x and y. Another good example would be joining first and last name to find distinct based on full names.
df %>% distinct(diff = abs(x - y))## diff
## 1 1
## 2 3If we would like to select multiple columns based on conditions, we can
use dplyr select helpers. In this example, we look for a columns that
contain the word color to use when finding distinct rows.
starwars %>% distinct(across(contains("color")))## # A tibble: 67 x 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ... with 57 more rowsOur final example will be to find distinct rows in grouped data. First, we group all star wars character by their eye color.
df = starwars %>% group_by(eye_color)
df## # A tibble: 87 x 14
## # Groups: eye_color [15]
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>Now, we find distinct values based on species. This will give us a list of unique combinations of species and their eye colors.
df %>% distinct(species)## # A tibble: 51 x 2
## # Groups: eye_color [15]
## eye_color species
## <chr> <chr>
## 1 blue Human
## 2 yellow Droid
## 3 red Droid
## 4 yellow Human
## 5 brown Human
## 6 blue-gray Human
## 7 blue Wookiee
## 8 black Rodian
## 9 orange Hutt
## 10 hazel Human
## # ... with 41 more rowsIf we use arrange to sort by species, the data frame is a bit easier to read.
df %>% distinct(species) %>% arrange(species)## # A tibble: 51 x 2
## # Groups: eye_color [15]
## eye_color species
## <chr> <chr>
## 1 unknown Aleena
## 2 yellow Besalisk
## 3 yellow Cerean
## 4 blue Chagrian
## 5 yellow Clawdite
## 6 yellow Droid
## 7 red Droid
## 8 red, blue Droid
## 9 black Droid
## 10 orange Dug
## # ... with 41 more rows