Finding duplicates is simple using the distinct
verb in dplyr
. There
are many options that allow you to specify column combinations to find
distinct rows which is essential when looking for duplicates. In this
article, we will learn how to use dplyr distinct.
If you are here for the code snippet, here is a quick example to use.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
row1 = data.frame(
x = 1,
y = 2,
x = 3
)
row2 = data.frame(
x = 1,
y = 2,
x = 3
)
row3 = data.frame(
x = 1,
y = 4,
x = 5
)
df = rbind(row1, row2, row3)
df
## x y x.1
## 1 1 2 3
## 2 1 2 3
## 3 1 4 5
df %>% distinct(x, .keep_all = TRUE)
## x y x.1
## 1 1 2 3
To use the dplyr
library we can either load the dplyr library
library(dplyr)
or we can load the full tidyverse library. We will use
the full suit of tidyverse as we will want some other features.
library(tidyverse)
For our data, we will create a quick example with duplicate rows. Notice the first two are the same, but the third row only shares the same x column.
row1 = data.frame(
x = 1,
y = 2,
x = 3
)
row2 = data.frame(
x = 1,
y = 2,
x = 3
)
row3 = data.frame(
x = 1,
y = 4,
x = 5
)
df = rbind(row1, row2, row3)
df
## x y x.1
## 1 1 2 3
## 2 1 2 3
## 3 1 4 5
We can find distinct rows by calling the distinct
function on our data
frame. By default, distinct
will use all columns for uniqueness. In
our example below, we are left with only two rows which makes sense as
the first two are duplicates.
nrow(df)
## [1] 3
nrow(distinct(df))
## [1] 2
We can also tell distinct
to use specific columns. To do this, we
simply pass the column names as parameters.
nrow(distinct(df, x))
## [1] 1
distinct(df, x)
## x
## 1 1
This example requires x and y to be distinct to remove.
distinct(df, x, y)
## x y
## 1 1 2
## 2 1 4
In the example above, we passed the data to our method. When using
tidyverse, it is more common to use the %>%
pipe operator. We can
change our example above to pipe the data frame in to distinct.
df %>% distinct(x, y)
## x y
## 1 1 2
## 2 1 4
In the example above, if we did not tell distiinct
which columns to
use, the method removes them by default. We can change this behavior
using the keep_all
parameter.
df %>% distinct(x, .keep_all = TRUE)
## x y x.1
## 1 1 2 3
We can also use computed columns to find uniqueness. In our example, we find distinct rows based on the difference between x and y. Another good example would be joining first and last name to find distinct based on full names.
df %>% distinct(diff = abs(x - y))
## diff
## 1 1
## 2 3
If we would like to select multiple columns based on conditions, we can
use dplyr
select helpers. In this example, we look for a columns that
contain the word color to use when finding distinct rows.
starwars %>% distinct(across(contains("color")))
## # A tibble: 67 x 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ... with 57 more rows
Our final example will be to find distinct rows in grouped data. First, we group all star wars character by their eye color.
df = starwars %>% group_by(eye_color)
df
## # A tibble: 87 x 14
## # Groups: eye_color [15]
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Now, we find distinct values based on species. This will give us a list of unique combinations of species and their eye colors.
df %>% distinct(species)
## # A tibble: 51 x 2
## # Groups: eye_color [15]
## eye_color species
## <chr> <chr>
## 1 blue Human
## 2 yellow Droid
## 3 red Droid
## 4 yellow Human
## 5 brown Human
## 6 blue-gray Human
## 7 blue Wookiee
## 8 black Rodian
## 9 orange Hutt
## 10 hazel Human
## # ... with 41 more rows
If we use arrange to sort by species, the data frame is a bit easier to read.
df %>% distinct(species) %>% arrange(species)
## # A tibble: 51 x 2
## # Groups: eye_color [15]
## eye_color species
## <chr> <chr>
## 1 unknown Aleena
## 2 yellow Besalisk
## 3 yellow Cerean
## 4 blue Chagrian
## 5 yellow Clawdite
## 6 yellow Droid
## 7 red Droid
## 8 red, blue Droid
## 9 black Droid
## 10 orange Dug
## # ... with 41 more rows