Joining Data Sets

Joining Theory

Often we have separate data frames that can have common and differing variables for similar observations. These types of data sets are referred as relational data sets.

The dplyr package offers three sets of joining functions to provide alternative ways to join data frames. These are:

Mutating joins: This group of functions add new variables to one data frame from matching observations in another.
Filtering joins: This group of functions filter observations from one data frame based on whether or not they match an observation in the other table.
Set operations: This group of functions treat observations as if they were set elements.

Mutating Joins

The first set of functions to combine data sets is called “mutating joins”. left_join(), right_join(), inner_join(), and full_join() functions are in this group. The mutating join functions allow you to combine variables from two tables and add variables to the right (like mutate).

Note that mutating join functions add variables to the right. Therefore if you have a lot of variables already in the data, the new variables won’t get printed out.

Controlling How the Data Sets are Matched

Each mutating join takes an argument “by” that controls which variables are used to match observations in the two data sets. There are a few ways to specify it:

NULL: The default value. dplyr will will use all variables that appear in both tables, a natural join.

This code joins the planes column to flights2, joining on a common column called “tailnum” in both tables x and y.

# flights2 %>% left_join(planes, by = "tailnum")

This code joins the airport column to flights 2, joining on a common column “dest” in table x and column “faa” in column y. They are the same column but named differently.

# flights2 %>% left_join(airports, c("dest" = "faa"))

Filtering Joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types of filtering joins: semi_join() and anti_join().

BOTH OF THESE KEEP OBS IN X and GET RID OF OBS IN Y. THEY FILTER WHICH OBS IN X TO KEEP.

semi_join(x, y): keeps all observations in x that have a match in y.
anti_join(x, y): drops all observations in x that have a match in y.

Anti-joins are useful for diagnosing joint mismatches.

Mutating or Filtering Joins?

Note that mutating join functions add variables to the right. Therefore if you have a lot of variables already in the data, the new variables won’t get printed out.

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types of filtering joins: semi_join() and anti_join().

Set Operations

Set operations expect the x and y inputs to have the same variables and treat the observations like sets. There are three types of set operations:

intersect(x, y): return only observations in both x and y.
union(x, y): return unique observations in x and y.
setdiff(x, y): return observations in x, but not in y. Make sure to try (x,y) and (y,x) it only works one way (for some reason)

Merging Datasets

Often you may just need to merge data frames by row and column. The bind_rows() and bind_cols() bind the multiple data frames by row and column, respectively.

bind_rows(x, y): Append y to x as new rows.
bind_cols(x, y): Append y to x as new columns.