How to Merge Two Data Frames By Column in R

04.18.2021

Let's say you have a list of users in one data frame and a list of their purchases in a second data frame. You'd like to combine these data frames into one based on the user id. In this article, we will learn how to use joins in R to combine data frames by column.

The basic way to merge two data frames is to use the merge function. We supply the two data frames and the column that we want to merge on.

users <- data.frame(
  userId = c(1, 2, 3, 4),
  name = c("Moe", "Larry", "Curly", "Harry")
)

purchases <- data.frame(
  userId = c(1, 2, 3, 3, 4),
  products = c("Prod1", "Prod2", "Prod3", "Prod1", "Prod2")
)

merge(users, purchases, by="userId")

#     userId  name products
# 1      1   Moe    Prod1
# 2      2 Larry    Prod2
# 3      3 Curly    Prod3
# 4      3 Curly    Prod1
# 5      4 Harry    Prod2

This type of join is known as an "inner join" and will include only items that match. For example, if one of the users didn't have a purchase, it would not be shown. Let's take a look at the same merge, but remove the purchase from Larry.



users <- data.frame(
  userId = c(1, 2, 3, 4),
  name = c("Moe", "Larry", "Curly", "Harry")
)

purchases <- data.frame(
  userId = c(1, 3, 3, 4),
  products = c("Prod1", "Prod3", "Prod1", "Prod2")
)


merge(users, purchases, by="userId")

#    userId  name products
# 1      1   Moe    Prod1
# 2      3 Curly    Prod3
# 3      3 Curly    Prod1
# 4      4 Harry    Prod2

If we want to show Larry, even though they don't have a purchase, we can use a left join which will join and keep everything in our left table (which is users in this case). To do this, we use the all.x= TRUE property on the merge function.

users <- data.frame(
  userId = c(1, 2, 3, 4),
  name = c("Moe", "Larry", "Curly", "Harry")
)

purchases <- data.frame(
  userId = c(1, 3, 3, 4),
  products = c("Prod1", "Prod3", "Prod1", "Prod2")
)


merge(users, purchases, by="userId", all.x= TRUE)

# userId  name products
# 1      1   Moe    Prod1
# 2      2 Larry     <NA>
# 3      3 Curly    Prod3
# 4      3 Curly    Prod1
# 5      4 Harry    Prod2

In a similar manner, if we have a product, but we don't have the user data, maybe the user was deleted, we can use a right join to display the product.

users <- data.frame(
  userId = c(1, 2, 3, 4),
  name = c("Moe", "Larry", "Curly", "Harry")
)

purchases <- data.frame(
  userId = c(1, 3, 3, 4, 5),
  products = c("Prod1", "Prod3", "Prod1", "Prod2", "OrphanProduct")
)


merge(users, purchases, by="userId", all.y=TRUE)

#    userId  name      products
# 1      1   Moe         Prod1
# 2      3 Curly         Prod3
# 3      3 Curly         Prod1
# 4      4 Harry         Prod2
# 5      5  <NA> OrphanProduct

If we would like to do both the left and right, we can use the outer join. This will show both tables even if there are missing values.

users <- data.frame(
  userId = c(1, 2, 3, 4),
  name = c("Moe", "Larry", "Curly", "Harry")
)

purchases <- data.frame(
  userId = c(1, 3, 3, 4, 5),
  products = c("Prod1", "Prod3", "Prod1", "Prod2", "OrphanProduct")
)


merge(users, purchases, by="userId", all=TRUE)


#      userId  name      products
# 1      1   Moe         Prod1
# 2      2 Larry          <NA>
# 3      3 Curly         Prod3
# 4      3 Curly         Prod1
# 5      4 Harry         Prod2
# 6      5  <NA> OrphanProduct

These joins will help you with most of your day to day tasks. There are a few other joins to look into. Also, many libraries have added functions that are a bit eaier to use. We will learn these later on.