It’s rare to have a data analysis that only involves a single table of data. In practice, you have many tables that contribute to an analysis, and you need flexible tools to combine them. In dplyr, there are three families of verbs that work with two tables:
Joins are used to add additional variables by matching current values to values in another a table. For example, imagine you have a dataset of patients with disease codes, and the description of the disease codes. Then you can match the codes to their descriptions with a join:
patients %>%
left_join(diesase_codes)
dplyr provides four joins. They all have the same inputs:
x
and y
by
.They differ primarily in what happens when rows don’t match:
inner_join()
, only includes observations that matchleft_join()
, includes all observations in x
, and matching from y
right_join()
, includes all obserations in y
, and matching from x
full_join()
, includes all observations from x
and y
.The left, right and full joins are collectively know as outer joins. When a row doesn’t match in an outer join, it fills the new variables in with missing values:
df1 <- data_frame(x = 1:2, y = 2:1)
df2 <- data_frame(x = 1, a = 10, b = "a")
left_join(df1, df2)
#> Joining by: "x"
#> Source: local data frame [2 x 4]
#>
#> x y a b
#> 1 1 2 10 a
#> 2 2 1 NA NA
While joins are primarily used to add new variables, they can also generate new observations. If there is a duplicate match, a join will add all possible combinations (the Cartesian products) of the observations:
df1 <- data_frame(x = c(1, 1, 2), y = 1:3)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))
left_join(df1, df2)
#> Joining by: "x"
#> Source: local data frame [5 x 3]
#>
#> x y z
#> 1 1 1 a
#> 2 1 1 b
#> 3 1 2 a
#> 4 1 2 b
#> 5 2 3 a
Filtering joins are similar to regular joins, but they only affect the observations, not the columns. There are two types:
semi_join(x, y)
return all observations in x
that have a match in y
.anti_join(x, y)
return all observaitons in x
that don’t have a match in y
.These are most useful for diagnosing join mismatches. For example, there are many flights in the nycflights13 dataset that don’t have a matching tail number in the planes table:
library("nycflights13")
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
#> Source: local data frame [722 x 2]
#>
#> tailnum n
#> 1 2512
#> 2 N725MQ 575
#> 3 N722MQ 513
#> 4 N723MQ 507
#> 5 N713MQ 483
#> 6 N735MQ 396
#> 7 N0EGMQ 371
#> 8 N534MQ 364
#> 9 N542MQ 363
#> 10 N531MQ 349
#> .. ... ...
If you’re ever worried about what observations your joins will match, start with a semi_join()
or anti_join()
. It’s safer to do this, rather than comparing nrow()
before and after the join because semi_join()
and anti_join()
never duplicate observations like the other joins, they only ever remove observations.
df1 <- data_frame(x = c(1, 1, 2), y = 1:3)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))
df1 %>% semi_join(df2) %>% nrow()
#> Joining by: "x"
#> [1] 3
df1 %>% left_join(df2) %>% nrow()
#> Joining by: "x"
#> [1] 5
You can easily imagine situations where the duplications and match failures could cancel out.
The final type of two table verbs are the set operation. These expect the x
and y
inputs to have the same variables, and treat the observations like sets.
intersect(x, y)
: return only rows in both x
and y
union(x, y)
: return unique rows in x
and y
setdiff(x, y)
: return rows in x
, but not in y
.Here they are in action with some simple inputs:
df1 <- data_frame(x = 1:2, y = c(1L, 1L))
df1
#> Source: local data frame [2 x 2]
#>
#> x y
#> 1 1 1
#> 2 2 1
df2 <- data_frame(x = 1:2, y = 1:2)
df2
#> Source: local data frame [2 x 2]
#>
#> x y
#> 1 1 1
#> 2 2 2
intersect(df1, df2)
#> Source: local data frame [1 x 2]
#>
#> x y
#> 1 1 1
# Note that we get 3 rows, not 4
union(df1, df2)
#> Source: local data frame [3 x 2]
#>
#> x y
#> 1 2 2
#> 2 2 1
#> 3 1 1
setdiff(df1, df2)
#> Source: local data frame [1 x 2]
#>
#> x y
#> 1 2 1
dplyr does not provide any functions for working with three or more tables. Instead use Reduce()
, as described in Advanced R, to iteratively combine the two-table verbs to handle as many tables as you need.
When joining tables, dplyr is a little more conservative than base R about what types of variable it considers equivalent. Generally, apart from the special cases described below, dplyr joins will warn or error if the variables are not the same type.
Factors and characters are coerced to character with a warning:
df1 <- data_frame(x = 1, y = "a")
df2 <- data_frame(x = 2, y = factor("a"))
outer_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Warning: joining factor and character vector, coercing into character vector
#> Warning: joining character vector and factor, coercing into character vector
#> Warning: joining character vector and factor, coercing into character vector
#> Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 2 variables:
#> $ x: num 1 2
#> $ y: chr "a" "a"
Factors with different levels are coerced to character with a warning:
df1 <- data_frame(x = 1, y = factor("a"))
df2 <- data_frame(x = 2, y = factor("b"))
outer_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Warning: joining factors with different levels, coercing to character vector
#> Warning: joining factors with different levels, coercing to character vector
#> Warning: joining factors with different levels, coercing to character vector
#> Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 2 variables:
#> $ x: num 1 2
#> $ y: chr "a" "b"
Factors with the same levels in a different order are coerced to character with a warning:
df1 <- data_frame(x = 1, y = factor("a", levels = c("a", "b")))
df2 <- data_frame(x = 2, y = factor("b", levels = c("b", "a")))
outer_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Warning: joining factors with different levels, coercing to character vector
#> Warning: joining factors with different levels, coercing to character vector
#> Warning: joining factors with different levels, coercing to character vector
#> Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 2 variables:
#> $ x: num 1 2
#> $ y: chr "a" "b"
Only if the levels match, will the factor will be preserved:
df1 <- data_frame(x = 1, y = factor("a", levels = c("a", "b")))
df2 <- data_frame(x = 2, y = factor("b", levels = c("a", "b")))
outer_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 2 variables:
#> $ x: num 1 2
#> $ y: Factor w/ 2 levels "a","b": 1 2
Integers and logicals will be silently coerced to numeric:
df1 <- data_frame(x = 1, y = 1L)
df2 <- data_frame(x = 2, y = 1.5)
outer_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 2 variables:
#> $ x: num 1 2
#> $ y: num 1 1.5
Combining logicals, integers or numerics with a character vector fails with an error:
df1 <- data_frame(x = 1, y = 1L)
df2 <- data_frame(x = 2, y = "a")
outer_join(df1, df2) %>% str()
#> Joining by: c("x", "y")
#> Error: cannot join on columns 'y' x 'y' : Can't join on 'y' x 'y' because
#> of incompatible types (character/integer)