Two-table verbs

It’s rare to have a data analysis that only involves a single table of data. In practice, you have many tables that contribute to an analysis, and you need flexible tools to combine them. In dplyr, there are three families of verbs that work with two-tables:

All functions behave similarly. The first two arguments are x and y and give the tables to combine, and they alway return a new table. If the types of x and y are different, the returned object will be the same type as x.

This discussion assumes that you have tidy data, where the rows are observations and the columns are variables.)

Matching joins

Matching joins allow you combine variables from two-tables, matching observations based on a subset of the variables. For example, imagine you have one table with flight information that uses two letter carrier codes. To add the full carrier names, you can use a join:

library("nycflights13")
flightnum <- flights %>% select(carrier, flight)
flightnum
#> Source: local data frame [336,776 x 2]
#> 
#>    carrier flight
#> 1       UA   1545
#> 2       UA   1714
#> 3       AA   1141
#> 4       B6    725
#> 5       DL    461
#> ..     ...    ...

flightnum %>% 
  left_join(airlines)
#> Joining by: "carrier"
#> Source: local data frame [336,776 x 3]
#> 
#>    carrier flight                   name
#> 1       UA   1545  United Air Lines Inc.
#> 2       UA   1714  United Air Lines Inc.
#> 3       AA   1141 American Airlines Inc.
#> 4       B6    725        JetBlue Airways
#> 5       DL    461   Delta Air Lines Inc.
#> ..     ...    ...                    ...

As well as x and y, each matching join takes an argument by that control which variables are used to match observations in the two-tables. There are a few ways to specify it:

Omit it: dplyr will default (with a message) to will use variables common to both tables (this is known as a natural join).
Give a character vector: by = "x". This will match the specified variables in both tables.
Give a named character vector: by = c("x" = "a"). This will match column x in table x to column a in table b

Each of these are illustrated below with different tables from nycflight13:

# Flights and weather weather join on date and origin
flights %>% left_join(weather)
#> Joining by: c("year", "month", "day", "origin", "hour")
#> Source: local data frame [336,776 x 25]
#> 
#>    year month day origin hour dep_time dep_delay arr_time arr_delay
#> 1  2013     1   1    EWR    5      517         2      830        11
#> 2  2013     1   1    LGA    5      533         4      850        20
#> 3  2013     1   1    JFK    5      542         2      923        33
#> 4  2013     1   1    JFK    5      544        -1     1004       -18
#> 5  2013     1   1    LGA    5      554        -6      812       -25
#> ..  ...   ... ...    ...  ...      ...       ...      ...       ...
#>    carrier
#> 1       UA
#> 2       UA
#> 3       AA
#> 4       B6
#> 5       DL
#> ..     ...
#> Variables not shown: tailnum (chr), flight (int), dest (chr), air_time
#>   (dbl), distance (dbl), minute (dbl), temp (dbl), dewp (dbl), humid
#>   (dbl), wind_dir (dbl), wind_speed (dbl), wind_gust (dbl), precip (dbl),
#>   pressure (dbl), visib (dbl)

# Both flights and planes have year columns, but they mean different things
# so we only want to join by the tailnum variable
flights %>% left_join(planes, by = "tailnum")
#> Source: local data frame [336,776 x 24]
#> 
#>    tailnum year.x month day dep_time dep_delay arr_time arr_delay carrier
#> 1   N14228   2013     1   1      517         2      830        11      UA
#> 2   N24211   2013     1   1      533         4      850        20      UA
#> 3   N619AA   2013     1   1      542         2      923        33      AA
#> 4   N804JB   2013     1   1      544        -1     1004       -18      B6
#> 5   N668DN   2013     1   1      554        -6      812       -25      DL
#> ..     ...    ...   ... ...      ...       ...      ...       ...     ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl), year.y (int), type
#>   (chr), manufacturer (chr), model (chr), engines (int), seats (int),
#>   speed (int), engine (chr)

# We could join airport to either destination or origin:
flights %>% select(origin, dest) %>% left_join(airports, c("dest" = "faa"))
#> Source: local data frame [336,776 x 8]
#> 
#>    dest origin                            name   lat    lon  alt tz dst
#> 1   IAH    EWR    George Bush Intercontinental 29.98 -95.34   97 -6   A
#> 2   IAH    LGA    George Bush Intercontinental 29.98 -95.34   97 -6   A
#> 3   MIA    JFK                      Miami Intl 25.79 -80.29    8 -5   A
#> 4   BQN    JFK                              NA    NA     NA   NA NA  NA
#> 5   ATL    LGA Hartsfield Jackson Atlanta Intl 33.64 -84.43 1026 -5   A
#> ..  ...    ...                             ...   ...    ...  ... .. ...
flights %>% select(origin, dest) %>% left_join(airports, c("origin" = "faa"))
#> Source: local data frame [336,776 x 8]
#> 
#>    origin dest                name   lat    lon alt tz dst
#> 1     EWR  IAH Newark Liberty Intl 40.69 -74.17  18 -5   A
#> 2     LGA  IAH          La Guardia 40.78 -73.87  22 -5   A
#> 3     JFK  MIA John F Kennedy Intl 40.64 -73.78  13 -5   A
#> 4     JFK  BQN John F Kennedy Intl 40.64 -73.78  13 -5   A
#> 5     LGA  ATL          La Guardia 40.78 -73.87  22 -5   A
#> ..    ...  ...                 ...   ...    ... ... .. ...

dplyr provides four matching joins. They differ in their behaviour when a match is not found:

inner_join(x, y) only includes observations that match in both x and y.
left_join(x, y) includes all observations in x, regardless of whether they match or not. This is the most commonly used join because it ensures that you don’t lose observations from your primary table.
right_join(x, y) includes all obserations in y. It’s equivalent to left_join(y, x), but the columns will be ordered differently.
full_join() includes all observations from x and y.

The left, right and full joins are collectively know as outer joins. When a row doesn’t match in an outer join, it fills in the new variables with missing values:

df1 <- data_frame(x = 1:2, y = 2:1)
df2 <- data_frame(x = 1, a = 10, b = "a")

left_join(df1, df2)
#> Joining by: "x"
#> Source: local data frame [2 x 4]
#> 
#>   x y  a  b
#> 1 1 2 10  a
#> 2 2 1 NA NA

(Note that currently dplyr only supports equi-joins. Support for non-equi joins is planned for a future release.)

Observations

While joins are primarily used to add new variables, they can also generate new observations. If a matche is not unique, a join will add all possible combinations (the Cartesian products) of the matching observations:

df1 <- data_frame(x = c(1, 1, 2), y = 1:3)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))

left_join(df1, df2)
#> Joining by: "x"
#> Source: local data frame [5 x 3]
#> 
#>   x y z
#> 1 1 1 a
#> 2 1 1 b
#> 3 1 2 a
#> 4 1 2 b
#> 5 2 3 a

Filtering joins

Filtering joins are similar to regular joins, but they only affect the observations, not the variables. There are two types:

semi_join(x, y) return all observations in x that have a match in y.
anti_join(x, y) return all observaitons in x that don’t have a match in y.

These are most useful for diagnosing join mismatches. For example, there are many flights in the nycflights13 dataset that don’t have a matching tail number in the planes table:

library("nycflights13")
flights %>% 
  anti_join(planes, by = "tailnum") %>% 
  count(tailnum, sort = TRUE)
#> Source: local data frame [722 x 2]
#> 
#>    tailnum    n
#> 1          2512
#> 2   N725MQ  575
#> 3   N722MQ  513
#> 4   N723MQ  507
#> 5   N713MQ  483
#> ..     ...  ...

If you’re ever worried about what observations your joins will match, start with a semi_join() or anti_join(). semi_join() and anti_join() never duplicate; they only ever remove observations.

df1 <- data_frame(x = c(1, 1, 3, 4), y = 1:4)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))

# Four rows to start with:
df1 %>% nrow()
#> [1] 4
# And we get four rows after the join
df1 %>% inner_join(df2, by = "x") %>% nrow()
#> [1] 4
# But only two rows actually match
df1 %>% semi_join(df2, by = "x") %>% nrow()
#> [1] 2

Set operations

The final type of two-table verbs are the set operation. These expect the x and y inputs to have the same variables, and treat the observations like sets.

intersect(x, y): return only rows in both x and y
union(x, y): return unique rows in x and y
setdiff(x, y): return rows in x, but not in y.

Here they are in action with some simple inputs:

df1 <- data_frame(x = 1:2, y = c(1L, 1L))
df1
#> Source: local data frame [2 x 2]
#> 
#>   x y
#> 1 1 1
#> 2 2 1
df2 <- data_frame(x = 1:2, y = 1:2)
df2
#> Source: local data frame [2 x 2]
#> 
#>   x y
#> 1 1 1
#> 2 2 2

intersect(df1, df2)
#> Source: local data frame [1 x 2]
#> 
#>   x y
#> 1 1 1
# Note that we get 3 rows, not 4
union(df1, df2)
#> Source: local data frame [3 x 2]
#> 
#>   x y
#> 1 2 2
#> 2 2 1
#> 3 1 1
setdiff(df1, df2)
#> Source: local data frame [1 x 2]
#> 
#>   x y
#> 1 2 1

Databases

Each function has a straightforward conversion to the equivalent SQL:

R	SQL
`inner_join()`	`SELECT * FROM x JOIN y ON x.a = y.a`
`left_join()`	`SELECT * FROM x LEFT JOIN y ON x.a = y.a`
`right_join()`	`SELECT * FROM x RIGHT JOIN y ON x.a = y.a`
`full_join()`	`SELECT * FROM x FULL JOIN y ON x.a = y.a`
`semi_join()`	`SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)`
`anti_join()`	`SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a)`
`intersect(x, y)`	`SELECT * FROM x INTERSECT SELECT * FROM y`
`union(x, y)`	`SELECT * FROM x UNION SELECT * FROM y`
`setdiff(x, y)`	`SELECT * FROM x EXCEPT SELECT * FROM y`

Note that x and y don’t have to be both database tables. If you specify copy = TRUE, dplyr will copy the y table into the same location as the x variable. This is useful if you’ve downloaded a summarised dataset and determined a subset of interest that you now want the full data for. You can use semi_join(x, y, copy = TRUE) to upload y to a temporary table in the same database as x, and then perform a efficient semi join in the database. If you’re working with large data, it maybe also be helpful to set auto_index = TRUE. That will automatically add an index on the join variables to the temporary table.

Coercion rules

When joining tables, dplyr is a little more conservative than base R about what types of variable it considers equivalent. This is mostly likely to cause you problems if you’re using factors:

Factors with different levels are coerced to character with a warning:

df1 <- data_frame(x = 1, y = factor("a"))
df2 <- data_frame(x = 2, y = factor("b"))
full_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Warning: joining factors with different levels, coercing to character vector
#> Warning: joining factors with different levels, coercing to character vector
#> Warning: joining factors with different levels, coercing to character vector
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
#>  $ x: num  1 2
#>  $ y: chr  "a" "b"

Factors with the same levels in a different order are coerced to character with a warning:

df1 <- data_frame(x = 1, y = factor("a", levels = c("a", "b")))
df2 <- data_frame(x = 2, y = factor("b", levels = c("b", "a")))
full_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Warning: joining factors with different levels, coercing to character vector
#> Warning: joining factors with different levels, coercing to character vector
#> Warning: joining factors with different levels, coercing to character vector
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
#>  $ x: num  1 2
#>  $ y: chr  "a" "b"

Factors are preserved only if the levels match exactly:

df1 <- data_frame(x = 1, y = factor("a", levels = c("a", "b")))
df2 <- data_frame(x = 2, y = factor("b", levels = c("a", "b")))
full_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
#>  $ x: num  1 2
#>  $ y: Factor w/ 2 levels "a","b": 1 2

A factor and a character are coerced to character with a warning:

df1 <- data_frame(x = 1, y = "a")
df2 <- data_frame(x = 2, y = factor("a"))
full_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Warning: joining factor and character vector, coercing into character vector
#> Warning: joining character vector and factor, coercing into character vector
#> Warning: joining character vector and factor, coercing into character vector
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
#>  $ x: num  1 2
#>  $ y: chr  "a" "a"

Otherwise logicals will be silently upcast to integer, and integer to numeric, but coercing to character will raise an error:

df1 <- data_frame(x = 1, y = 1L)
df2 <- data_frame(x = 2, y = 1.5)
full_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
#>  $ x: num  1 2
#>  $ y: num  1 1.5

df1 <- data_frame(x = 1, y = 1L)
df2 <- data_frame(x = 2, y = "a")
full_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Error: cannot join on columns 'y' x 'y': Can't join on 'y' x 'y' because
#> of incompatible types (character / integer)