library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(nycflights13)
So far you have been working with data in one table. Usually stored as a tibble or a dataframe. But IRL (and even the the Metaverse :)) it is a rare thing to work with only one table. Most frequently this takes the forms of working with a relational base. And while there exist specialized languages, most notably SQL, to work with relational data bases, it is useful to be able to perform some of the same functions in your development language, whether that be R, Python, Julia, Matlap, Tableau, even Excel.
Last week you worked with the nycflights13 data base, but you worked only with the flights table. I believe I mentioned the fact that nycflights13 had other tables as well.
nycflights13 and tidyverse. Use install.packages but do the installation in the console pane.airports
## # A tibble: 1,458 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Yo…
## 2 06A Moton Field Municipal A… 32.5 -85.7 264 -6 A America/Chicago
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chicago
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_Yo…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_Yo…
## 6 0A9 Elizabethton Municipal … 36.4 -82.2 1593 -5 A America/New_Yo…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/New_Yo…
## 8 0G7 Finger Lakes Regional A… 42.9 -76.8 492 -5 A America/New_Yo…
## 9 0P2 Shoestring Aviation Air… 39.8 -76.6 1000 -5 U America/New_Yo…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_An…
## # … with 1,448 more rows
variable.names(planes)
## [1] "tailnum" "year" "type" "manufacturer" "model"
## [6] "engines" "seats" "speed" "engine"
You will find Chapter 13 in R for Data Science useful background reading.
The following passage is from R4DS:
"To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:
Mutating joins, which add new variables to one data frame from matching observations in another.
Filtering joins, which filter observations from one data frame base
Set operations, which treat observations as if they were set elements.
The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases. If you’ve used a database before, you’ve almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialized to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis.d on whether or not they match an observation in the other table.
Examine the following table and verify that the associations depicted between the data tables is correct. (also from R4DS)
variable.names(weather)
## [1] "origin" "year" "month" "day" "hour"
## [6] "temp" "dewp" "humid" "wind_dir" "wind_speed"
## [11] "wind_gust" "precip" "pressure" "visib" "time_hour"
variable.names(flights)
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
variable.names(airports)
## [1] "faa" "name" "lat" "lon" "alt" "tz" "dst" "tzone"
flights_route <- flights %>%
inner_join(select(airports,
origin = faa,
origin_lat = lat,
origin_lon = lon),
by = "origin") %>%
inner_join(select(airports,
dest = faa,
dest_lat = lat,
dest_lon = lon),
by = "dest") %>%
select("origin", "dest", "origin_lat", "origin_lon", "dest_lat", "dest_lon")
flights_route
## # A tibble: 329,174 x 6
## origin dest origin_lat origin_lon dest_lat dest_lon
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 EWR IAH 40.7 -74.2 30.0 -95.3
## 2 LGA IAH 40.8 -73.9 30.0 -95.3
## 3 JFK MIA 40.6 -73.8 25.8 -80.3
## 4 LGA ATL 40.8 -73.9 33.6 -84.4
## 5 EWR ORD 40.7 -74.2 42.0 -87.9
## 6 EWR FLL 40.7 -74.2 26.1 -80.2
## 7 LGA IAD 40.8 -73.9 38.9 -77.5
## 8 JFK MCO 40.6 -73.8 28.4 -81.3
## 9 LGA ORD 40.8 -73.9 42.0 -87.9
## 10 JFK PBI 40.6 -73.8 26.7 -80.1
## # … with 329,164 more rows
weather and airports, not shown in the figure above. What is the relationship and how would it appear?variable.names(airports)
## [1] "faa" "name" "lat" "lon" "alt" "tz" "dst" "tzone"
variable.names(weather)
## [1] "origin" "year" "month" "day" "hour"
## [6] "temp" "dewp" "humid" "wind_dir" "wind_speed"
## [11] "wind_gust" "precip" "pressure" "visib" "time_hour"
weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?Connecting variables are called keys. By definition a key is a variable or set of variables that uniquely identifies an observation.
primary key uniquely identifies an observation in its own table.
foreign key uniquely identifies an observation in another table.
nycflights13?flights %>%
count(year, month, day, hour, flight) %>%
filter(n > 1) %>%
nrow()
## [1] 2945
flights?If a table does not have a primary key then if a primary key is needed, then one can be created using mutate() and row_number(), such a created key is called a surrogate key.
flights?flights %>%
arrange(year, month, day, sched_dep_time, carrier, flight) %>%
mutate(flight_sur = row_number()) %>%
glimpse()
## Rows: 336,776
## Columns: 20
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 559, 558, 559, 558, 558, 557, …
## $ sched_dep_time <int> 515, 529, 540, 545, 558, 559, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -4, 0, -2, -1, -2, -2, -3, NA, 1, 0, -5, -…
## $ arr_time <int> 830, 850, 923, 1004, 740, 702, 753, 941, 849, 853, 838,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 728, 706, 745, 910, 851, 856, 846,…
## $ arr_delay <dbl> 11, 20, 33, -18, 12, -4, 8, 31, -2, -3, -8, NA, -6, -7,…
## $ carrier <chr> "UA", "UA", "AA", "B6", "UA", "B6", "AA", "AA", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 1696, 1806, 301, 707, 49, 71, 79…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N39463", "N708…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "EWR", "JFK", "LGA", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ORD", "BOS", "ORD", "DFW",…
## $ air_time <dbl> 227, 227, 160, 183, 150, 44, 138, 257, 149, 158, 140, N…
## $ distance <dbl> 1400, 1416, 1089, 1576, 719, 187, 733, 1389, 1028, 1005…
## $ hour <dbl> 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 58, 59, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## $ flight_sur <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
Lahman::Batting#Lahman::Batting
Lahman::Batting %>%
count(playerID) %>%
filter(n > 1) %>%
nrow()
## [1] 14748
babynames::babynames#babynames::babynames
babynames::babynames %>%
count(year, sex, name) %>%
filter(n > 1) %>%
nrow()
## [1] 0
nasaweather::atmos#nasaweather::atmos
nasaweather::atmos %>%
count(lat, long, year, month) %>%
filter(n > 1) %>%
nrow()
## [1] 0
fueleconomy::vehicles#fueleconomy::vehicles
fueleconomy::vehicles %>%
count(id) %>%
filter(n > 1) %>%
nrow()
## [1] 0
ggplot2::diamonds#ggplot2::diamonds
ggplot2::diamonds %>%
count(carat, cut, color, clarity) %>%
filter(n > 1) %>%
nrow()
## [1] 8080
A mutating join allows you to combine variables from two tables.
Example:
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # … with 336,766 more rows
Suppose you wand to add the full airline name to the flights2 data.
You can combine airlines and flights2 data frames with left_join()
flights2 %>%
select(-origin,-dest) %>%
left_join(airlines,by = "carrier")
## # A tibble: 336,776 x 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
## 4 2013 1 1 5 N804JB B6 JetBlue Airways
## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 N516JB B6 JetBlue Airways
## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 N593JB B6 JetBlue Airways
## 10 2013 1 1 6 N3ALAA AA American Airlines Inc.
## # … with 336,766 more rows
Consider the data frames:
We can make this data as follows:
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
x
## # A tibble: 3 x 2
## key val_x
## <dbl> <chr>
## 1 1 x1
## 2 2 x2
## 3 3 x3
y
## # A tibble: 3 x 2
## key val_y
## <dbl> <chr>
## 1 1 y1
## 2 2 y2
## 3 4 y3
This type of join is called an inner join which means pairs of observations are matched when their keys are equal.
x %>%
inner_join(y, by = "key")
## # A tibble: 2 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
#> # A tibble: 2 x 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
notice: unmatched rows are not included in the result.
While an inner join keeps only observations that appear in both tables and outer join keeps all observations that appear in one of the tables.
All the diagrams above assume that there are no duplicate keys in a data set. Is this even true for flights? Can you think of an example?
Here is a diagram from Wickham and Grolemund illustrating the situation
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)
left_join(x, y, by = "key")
## # A tibble: 4 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 2 x3 y2
## 4 1 x4 y1
#> # A tibble: 4 x 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
#> 3 2 x3 y2
#> 4 1 x4 y1
Note: in this case, key is a foreign key in x and a primary key in y.
If both tables have duplicate keys the keys do not uniquely identify an observation. In this case a join results in all possible combinations (cartesian product)
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
3, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
2, "y3",
3, "y4"
)
left_join(x, y, by = "key")
## # A tibble: 6 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 2 x2 y3
## 4 2 x3 y2
## 5 2 x3 y3
## 6 3 x4 y4
In the above examples the key values were single variables and in left_jointhe ke was encoded by by = "key". But one can use other values for by= the default is by = NULL. What does this mean?
#let's get flights2 back to eliminate extraneous columns
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2 %>%
left_join(weather) #note I'm taking all the defaults
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
## year month day hour origin dest tailnum carrier temp dewp humid
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
## 7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
## 8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
## 9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
## # … with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
## Joining, by = c("year", "month", "day", "hour", "origin")
It means that we join by mathcing on all common variables, see console message above.
On the other hand if match on a key, for example tailnum is a foreign key in flights for planes
flights2 %>%
left_join(planes, by="tailnum")
## # A tibble: 336,776 x 16
## year.x month day hour origin dest tailnum carrier year.y type
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixed wing mult…
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixed wing mult…
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixed wing mult…
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixed wing mult…
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixed wing mult…
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixed wing mult…
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixed wing mult…
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixed wing mult…
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixed wing mult…
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
## # … with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
suppose we wanted to draw a map of all destination air ports for flights from NewYork in 2013
flights2%>%
left_join(airports, c("dest"="faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon alt
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Georg… 30.0 -95.3 97
## 2 2013 1 1 5 LGA IAH N24211 UA Georg… 30.0 -95.3 97
## 3 2013 1 1 5 JFK MIA N619AA AA Miami… 25.8 -80.3 8
## 4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
## 5 2013 1 1 6 LGA ATL N668DN DL Harts… 33.6 -84.4 1026
## 6 2013 1 1 5 EWR ORD N39463 UA Chica… 42.0 -87.9 668
## 7 2013 1 1 6 EWR FLL N516JB B6 Fort … 26.1 -80.2 9
## 8 2013 1 1 6 LGA IAD N829AS EV Washi… 38.9 -77.5 313
## 9 2013 1 1 6 JFK MCO N593JB B6 Orlan… 28.4 -81.3 96
## 10 2013 1 1 6 LGA ORD N3ALAA AA Chica… 42.0 -87.9 668
## # … with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## # tzone <chr>
This work is based on “R for Data Science” by Hadley and Grolemund. Tt is licensed under
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/us/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.