Defining the key columns
So far, the pairs of tables have always been joined by a single
variable, and that variable has the same name in both tables. That
constraint was encoded by by
= “key”. You can use other
values for by
to connect the tables in other ways:
- The default,
by = NULL
, uses all variables that appear
in both tables, the so called natural join. For
example, the flights
and weather
tables match
on their common variables: year
, month
,
day
, hour
and origin
.
flights2 %>%
left_join(weather)
## # A tibble: 336,776 Ă— 18
## year month day hour origin dest tailnum carrier temp dewp humid
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
## 7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
## 8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
## 9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
## # ℹ 336,766 more rows
## # ℹ 7 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
- A character vector,
by
= “x”. This is like a natural
join, but uses only some of the common variables. For example,
flights
and planes
have year
variables, but they mean different things so we only want to join by
tailnum
.
flights2 %>%
left_join(planes, by = "tailnum")
## # A tibble: 336,776 Ă— 16
## year.x month day hour origin dest tailnum carrier year.y type
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixed wing mult…
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixed wing mult…
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixed wing mult…
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixed wing mult…
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixed wing mult…
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixed wing mult…
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixed wing mult…
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixed wing mult…
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixed wing mult…
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
## # ℹ 336,766 more rows
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## # seats <int>, speed <int>, engine <chr>
Note that the year
variables (which appear in both input
data frames, but are not constrained to be equal) are disambiguated in
the output with a suffix. We can fix this by use
rename()
.
flights2 %>%
left_join(planes, by = "tailnum") %>%
rename("flights_year" = 'year.x', "plane_made_year" = 'year.y') %>%
glimpse()
## Rows: 336,776
## Columns: 16
## $ flights_year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
## $ plane_made_year <int> 1999, 1998, 1990, 2012, 1991, 2012, 2000, 1998, 2004, …
## $ type <chr> "Fixed wing multi engine", "Fixed wing multi engine", …
## $ manufacturer <chr> "BOEING", "BOEING", "BOEING", "AIRBUS", "BOEING", "BOE…
## $ model <chr> "737-824", "737-824", "757-223", "A320-232", "757-232"…
## $ engines <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, NA, 2, 2, 2…
## $ seats <int> 149, 149, 178, 200, 178, 191, 200, 55, 200, NA, 200, 2…
## $ speed <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ engine <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "T…
- A named character vector:
by
= c(“a” = “b”). This will
match variable a
in table x to variable b
in
table y. The variables from x will be used in the output.
For example, if we want to draw a map we need to combine the
flights
data with the airports
data which
contains the location (lat and lon) of each airport. Each flight has an
origin and destination airport, so we need to specify which one we want
to join to:
flights2 %>%
left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 Ă— 15
## year month day hour origin dest tailnum carrier name lat lon alt
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Georg… 30.0 -95.3 97
## 2 2013 1 1 5 LGA IAH N24211 UA Georg… 30.0 -95.3 97
## 3 2013 1 1 5 JFK MIA N619AA AA Miami… 25.8 -80.3 8
## 4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
## 5 2013 1 1 6 LGA ATL N668DN DL Harts… 33.6 -84.4 1026
## 6 2013 1 1 5 EWR ORD N39463 UA Chica… 42.0 -87.9 668
## 7 2013 1 1 6 EWR FLL N516JB B6 Fort … 26.1 -80.2 9
## 8 2013 1 1 6 LGA IAD N829AS EV Washi… 38.9 -77.5 313
## 9 2013 1 1 6 JFK MCO N593JB B6 Orlan… 28.4 -81.3 96
## 10 2013 1 1 6 LGA ORD N3ALAA AA Chica… 42.0 -87.9 668
## # ℹ 336,766 more rows
## # ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
flights2 %>%
left_join(airports, c("origin" = "faa"))
## # A tibble: 336,776 Ă— 15
## year month day hour origin dest tailnum carrier name lat lon alt
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Newar… 40.7 -74.2 18
## 2 2013 1 1 5 LGA IAH N24211 UA La Gu… 40.8 -73.9 22
## 3 2013 1 1 5 JFK MIA N619AA AA John … 40.6 -73.8 13
## 4 2013 1 1 5 JFK BQN N804JB B6 John … 40.6 -73.8 13
## 5 2013 1 1 6 LGA ATL N668DN DL La Gu… 40.8 -73.9 22
## 6 2013 1 1 5 EWR ORD N39463 UA Newar… 40.7 -74.2 18
## 7 2013 1 1 6 EWR FLL N516JB B6 Newar… 40.7 -74.2 18
## 8 2013 1 1 6 LGA IAD N829AS EV La Gu… 40.8 -73.9 22
## 9 2013 1 1 6 JFK MCO N593JB B6 John … 40.6 -73.8 13
## 10 2013 1 1 6 LGA ORD N3ALAA AA La Gu… 40.8 -73.9 22
## # ℹ 336,766 more rows
## # ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>