Merging Datasets

The only package we need for merging is dplyr

library(dplyr)

Clear everything to make sure there’s nothing leftover in our environment

rm(list = ls())

You don’t need to pay much attention to the code below. We’re just creating two simple data frames using frame_data() that we can use for different merge operations.

teams <- frame_data(
  ~city, ~team,
  "Atlanta", "Hawks",
  "Boston", "Celtics",
  "Chicago", "Bulls"
)

cities <- frame_data(
  ~city, ~state,
  "Atlanta", "Georgia",
  "Boston", "Massachusetts",
  "Detroit", "Michigan"
)

Let’s see how we can merge the teams dataset with the cities dataset.

Left Join

teams %>%
  left_join(cities, by = "city")

Right Join

teams %>%
  right_join(cities, by = "city")

Inner Join

teams %>%
  inner_join(cities, by = "city")

Full Join

teams %>%
  full_join(cities, by = "city")

Different Column Names

In the previous example both our datasets included a column named city. But what if the names of the columns in the two datasets were not the same? For example, consider a states dataset that looks like this:

states <- frame_data(
  ~code, ~name,
  "GA", "Georgia",
  "MI", "Michigan",
  "MA", "Massachusetts",
  "IL", "Illinois"
)

What if we were to merge the cities dataset with states?

One option would be to rename the columns so their names would match, but you don’t really need to do that. You can simply tell the join functions the mapping between the different names.

cities %>%
  left_join(states, by = c("state" = "name"))

In the above example, we’re telling left_join() to merge using the state column from the cities data frame and name column from the states data frame.