It’s rare to have a data analysis that only involves a single table of data. In practice, you have many tables that contribute to an analysis, and you need flexible tools to combine them. In dplyr, there are three families of verbs that work with two tables:

Joins

Joins are used to add additional variables by matching current values to values in another a table. For example, imagine you have a dataset of patients with disease codes, and the description of the disease codes. Then you can match the codes to their descriptions with a join:

patients %>% 
  left_join(diesase_codes)

dplyr provides four joins. They all have the same inputs:

They differ primarily in what happens when rows don’t match:

The left, right and full joins are collectively know as outer joins. When a row doesn’t match in an outer join, it fills the new variables in with missing values:

df1 <- data_frame(x = 1:2, y = 2:1)
df2 <- data_frame(x = 1, a = 10, b = "a")

left_join(df1, df2)
#> Joining by: "x"
#> Source: local data frame [2 x 4]
#> 
#>   x y  a  b
#> 1 1 2 10  a
#> 2 2 1 NA NA

Observations

While joins are primarily used to add new variables, they can also generate new observations. If there is a duplicate match, a join will add all possible combinations (the Cartesian products) of the observations:

df1 <- data_frame(x = c(1, 1, 2), y = 1:3)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))

left_join(df1, df2)
#> Joining by: "x"
#> Source: local data frame [5 x 3]
#> 
#>   x y z
#> 1 1 1 a
#> 2 1 1 b
#> 3 1 2 a
#> 4 1 2 b
#> 5 2 3 a

Filtering joins

Filtering joins are similar to regular joins, but they only affect the observations, not the columns. There are two types:

These are most useful for diagnosing join mismatches. For example, there are many flights in the nycflights13 dataset that don’t have a matching tail number in the planes table:

library("nycflights13")
flights %>% 
  anti_join(planes, by = "tailnum") %>% 
  count(tailnum, sort = TRUE)
#> Source: local data frame [722 x 2]
#> 
#>    tailnum    n
#> 1          2512
#> 2   N725MQ  575
#> 3   N722MQ  513
#> 4   N723MQ  507
#> 5   N713MQ  483
#> 6   N735MQ  396
#> 7   N0EGMQ  371
#> 8   N534MQ  364
#> 9   N542MQ  363
#> 10  N531MQ  349
#> ..     ...  ...

If you’re ever worried about what observations your joins will match, start with a semi_join() or anti_join(). It’s safer to do this, rather than comparing nrow() before and after the join because semi_join() and anti_join() never duplicate observations like the other joins, they only ever remove observations.

df1 <- data_frame(x = c(1, 1, 2), y = 1:3)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))

df1 %>% semi_join(df2) %>% nrow()
#> Joining by: "x"
#> [1] 3
df1 %>% left_join(df2) %>% nrow()
#> Joining by: "x"
#> [1] 5

You can easily imagine situations where the duplications and match failures could cancel out.

Set operations

The final type of two table verbs are the set operation. These expect the x and y inputs to have the same variables, and treat the observations like sets.

Here they are in action with some simple inputs:

df1 <- data_frame(x = 1:2, y = c(1L, 1L))
df1
#> Source: local data frame [2 x 2]
#> 
#>   x y
#> 1 1 1
#> 2 2 1
df2 <- data_frame(x = 1:2, y = 1:2)
df2
#> Source: local data frame [2 x 2]
#> 
#>   x y
#> 1 1 1
#> 2 2 2

intersect(df1, df2)
#> Source: local data frame [1 x 2]
#> 
#>   x y
#> 1 1 1
# Note that we get 3 rows, not 4
union(df1, df2)
#> Source: local data frame [3 x 2]
#> 
#>   x y
#> 1 2 2
#> 2 2 1
#> 3 1 1
setdiff(df1, df2)
#> Source: local data frame [1 x 2]
#> 
#>   x y
#> 1 2 1

Multiple-table verbs

dplyr does not provide any functions for working with three or more tables. Instead use Reduce(), as described in Advanced R, to iteratively combine the two-table verbs to handle as many tables as you need.

Coercion rules

When joining tables, dplyr is a little more conservative than base R about what types of variable it considers equivalent. Generally, apart from the special cases described below, dplyr joins will warn or error if the variables are not the same type.