Let's say you have a list of users in one data frame and a list of their purchases in a second data frame. You'd like to combine these data frames into one based on the user id. In this article, we will learn how to use joins in R to combine data frames by column.
The basic way to merge two data frames is to use the merge
function. We supply the two data frames and the column that we want to merge on.
users <- data.frame(
userId = c(1, 2, 3, 4),
name = c("Moe", "Larry", "Curly", "Harry")
)
purchases <- data.frame(
userId = c(1, 2, 3, 3, 4),
products = c("Prod1", "Prod2", "Prod3", "Prod1", "Prod2")
)
merge(users, purchases, by="userId")
# userId name products
# 1 1 Moe Prod1
# 2 2 Larry Prod2
# 3 3 Curly Prod3
# 4 3 Curly Prod1
# 5 4 Harry Prod2
This type of join is known as an "inner join" and will include only items that match. For example, if one of the users didn't have a purchase, it would not be shown. Let's take a look at the same merge, but remove the purchase from Larry.
users <- data.frame(
userId = c(1, 2, 3, 4),
name = c("Moe", "Larry", "Curly", "Harry")
)
purchases <- data.frame(
userId = c(1, 3, 3, 4),
products = c("Prod1", "Prod3", "Prod1", "Prod2")
)
merge(users, purchases, by="userId")
# userId name products
# 1 1 Moe Prod1
# 2 3 Curly Prod3
# 3 3 Curly Prod1
# 4 4 Harry Prod2
If we want to show Larry, even though they don't have a purchase, we can use a left join
which will join and keep everything in our left table (which is users in this case). To do this, we use the all.x= TRUE
property on the merge function.
users <- data.frame(
userId = c(1, 2, 3, 4),
name = c("Moe", "Larry", "Curly", "Harry")
)
purchases <- data.frame(
userId = c(1, 3, 3, 4),
products = c("Prod1", "Prod3", "Prod1", "Prod2")
)
merge(users, purchases, by="userId", all.x= TRUE)
# userId name products
# 1 1 Moe Prod1
# 2 2 Larry <NA>
# 3 3 Curly Prod3
# 4 3 Curly Prod1
# 5 4 Harry Prod2
In a similar manner, if we have a product, but we don't have the user data, maybe the user was deleted, we can use a right join to display the product.
users <- data.frame(
userId = c(1, 2, 3, 4),
name = c("Moe", "Larry", "Curly", "Harry")
)
purchases <- data.frame(
userId = c(1, 3, 3, 4, 5),
products = c("Prod1", "Prod3", "Prod1", "Prod2", "OrphanProduct")
)
merge(users, purchases, by="userId", all.y=TRUE)
# userId name products
# 1 1 Moe Prod1
# 2 3 Curly Prod3
# 3 3 Curly Prod1
# 4 4 Harry Prod2
# 5 5 <NA> OrphanProduct
If we would like to do both the left and right, we can use the outer join
. This will show both tables even if there are missing values.
users <- data.frame(
userId = c(1, 2, 3, 4),
name = c("Moe", "Larry", "Curly", "Harry")
)
purchases <- data.frame(
userId = c(1, 3, 3, 4, 5),
products = c("Prod1", "Prod3", "Prod1", "Prod2", "OrphanProduct")
)
merge(users, purchases, by="userId", all=TRUE)
# userId name products
# 1 1 Moe Prod1
# 2 2 Larry <NA>
# 3 3 Curly Prod3
# 4 3 Curly Prod1
# 5 4 Harry Prod2
# 6 5 <NA> OrphanProduct
These joins will help you with most of your day to day tasks. There are a few other joins to look into. Also, many libraries have added functions that are a bit eaier to use. We will learn these later on.