library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(nycflights13)

Working with Relational Data

So far you have been working with data in one table. Usually stored as a tibble or a dataframe. But IRL (and even the the Metaverse :)) it is a rare thing to work with only one table. Most frequently this takes the forms of working with a relational base. And while there exist specialized languages, most notably SQL, to work with relational data bases, it is useful to be able to perform some of the same functions in your development language, whether that be R, Python, Julia, Matlap, Tableau, even Excel.

Last week you worked with the nycflights13 data base, but you worked only with the flights table. I believe I mentioned the fact that nycflights13 had other tables as well.

Task 1

1. Since this is a new project, you will have to install both nycflights13 and tidyverse. Use install.packages but do the installation in the console pane.

2. Load these packages in the first code chunk of this project.

3. Use the help system to answer the following questions:

Answer) 5 (airlines, airports, flights, planes, weather)

b. How many airports are represented in the table?

Answer) 1,458 airports
airports
## # A tibble: 1,458 x 8
##    faa   name                       lat    lon   alt    tz dst   tzone          
##    <chr> <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr> <chr>          
##  1 04G   Lansdowne Airport         41.1  -80.6  1044    -5 A     America/New_Yo…
##  2 06A   Moton Field Municipal A…  32.5  -85.7   264    -6 A     America/Chicago
##  3 06C   Schaumburg Regional       42.0  -88.1   801    -6 A     America/Chicago
##  4 06N   Randall Airport           41.4  -74.4   523    -5 A     America/New_Yo…
##  5 09J   Jekyll Island Airport     31.1  -81.4    11    -5 A     America/New_Yo…
##  6 0A9   Elizabethton Municipal …  36.4  -82.2  1593    -5 A     America/New_Yo…
##  7 0G6   Williams County Airport   41.5  -84.5   730    -5 A     America/New_Yo…
##  8 0G7   Finger Lakes Regional A…  42.9  -76.8   492    -5 A     America/New_Yo…
##  9 0P2   Shoestring Aviation Air…  39.8  -76.6  1000    -5 U     America/New_Yo…
## 10 0S9   Jefferson County Intl     48.1 -123.    108    -8 A     America/Los_An…
## # … with 1,448 more rows

c. What are the variables in planes table?

Answer) tailnum, year, type, manufacturer, model, engines, seats, speed, engine
variable.names(planes)
## [1] "tailnum"      "year"         "type"         "manufacturer" "model"       
## [6] "engines"      "seats"        "speed"        "engine"

Introduction to relational data

You will find Chapter 13 in R for Data Science useful background reading.

The following passage is from R4DS:

"To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:

  • Mutating joins, which add new variables to one data frame from matching observations in another.

  • Filtering joins, which filter observations from one data frame base

  • Set operations, which treat observations as if they were set elements.

The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases. If you’ve used a database before, you’ve almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialized to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis.d on whether or not they match an observation in the other table.

Task 2

Examine the following table and verify that the associations depicted between the data tables is correct. (also from R4DS)

1. What information can be gained that would not be in the flights data table? (be specific)

Answer) Since the keys of (year, month, day, hour, origin) are linked to the both weather table and the flights table, we can get weather information that was never known from the flights data table.
variable.names(weather)
##  [1] "origin"     "year"       "month"      "day"        "hour"      
##  [6] "temp"       "dewp"       "humid"      "wind_dir"   "wind_speed"
## [11] "wind_gust"  "precip"     "pressure"   "visib"      "time_hour"
variable.names(flights)
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

2. Suppose you wanted to approximate the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?

Answer) I need the flights and the airports tables. And I need the keys which are (origin, dest) from the flights table, and (faa, lat, lon) from the airports table and combine with inner_join().
variable.names(airports)
## [1] "faa"   "name"  "lat"   "lon"   "alt"   "tz"    "dst"   "tzone"
flights_route <- flights %>%
  inner_join(select(airports,
                    origin = faa, 
                    origin_lat = lat, 
                    origin_lon = lon),
             by = "origin") %>%
  inner_join(select(airports,
                    dest = faa,
                    dest_lat = lat,
                    dest_lon = lon),
             by = "dest") %>%
  select("origin", "dest", "origin_lat", "origin_lon", "dest_lat", "dest_lon")

flights_route
## # A tibble: 329,174 x 6
##    origin dest  origin_lat origin_lon dest_lat dest_lon
##    <chr>  <chr>      <dbl>      <dbl>    <dbl>    <dbl>
##  1 EWR    IAH         40.7      -74.2     30.0    -95.3
##  2 LGA    IAH         40.8      -73.9     30.0    -95.3
##  3 JFK    MIA         40.6      -73.8     25.8    -80.3
##  4 LGA    ATL         40.8      -73.9     33.6    -84.4
##  5 EWR    ORD         40.7      -74.2     42.0    -87.9
##  6 EWR    FLL         40.7      -74.2     26.1    -80.2
##  7 LGA    IAD         40.8      -73.9     38.9    -77.5
##  8 JFK    MCO         40.6      -73.8     28.4    -81.3
##  9 LGA    ORD         40.8      -73.9     42.0    -87.9
## 10 JFK    PBI         40.6      -73.8     26.7    -80.1
## # … with 329,164 more rows

3. There is connection between weather and airports, not shown in the figure above. What is the relationship and how would it appear?

Answer) There is connection between origin in the weather table and origin that can be linked to faa(foreign key) in the airports table.
variable.names(airports)
## [1] "faa"   "name"  "lat"   "lon"   "alt"   "tz"    "dst"   "tzone"
variable.names(weather)
##  [1] "origin"     "year"       "month"      "day"        "hour"      
##  [6] "temp"       "dewp"       "humid"      "wind_dir"   "wind_speed"
## [11] "wind_gust"  "precip"     "pressure"   "visib"      "time_hour"

4. weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?

Answer) The keys which are year, month, day, hour, origin from the weather table are the foreign keys, and the keys which are year, month, day, hour, dest from the flights table are the primary keys. That would also match the key connecting weather with destinations.

Connecting variables are called keys. By definition a key is a variable or set of variables that uniquely identifies an observation.

Types of keys

  1. primary key uniquely identifies an observation in its own table.

  2. foreign key uniquely identifies an observation in another table.

Definition: A primary key and a matching foreign key in another table define a relation.

Taks 3

1. A variable can be both a primary and foreign key? Can you give an example from nycflights13?

Answer) Yes, a variable can be both a primary and foreign key. The keys which are (year, month, day, hour) from the weather table are the foreign keys, but the same keys which are (year, month, day, hour) from the flights table are the primary keys.

2. How can you verify that a key is a primary key? (Hint, you will probably need to write some code!)

Answer) The primary key for flights is (year, month, day, hour, flight). The columns (year, month, day, hour) are not a primary key since flights can fly on different days. The foreign key connects the keys of two relational tables. Uniquely identifies data in relational tables.
flights %>%
  count(year, month, day, hour, flight) %>%
  filter(n > 1) %>%
  nrow()
## [1] 2945

3. What is the primary key for flights?

Answer) The primary key for flights is (year, month, day, hour, flight).

If a table does not have a primary key then if a primary key is needed, then one can be created using mutate() and row_number(), such a created key is called a surrogate key.


Task 4

1. Are relations one-to-one or one-to-many?

Answer) The relations are one-to-many in common. For example, each plane has many travels. (one plane : many travels)

2. Can there be a many-to-many relations?

Answer) Of course, there can be a many-to-many relations. If you see the nycflight13 data, there is a many-to-many relation between airlines and planes.

3. What variables would you use to create a surrogate key for flights?

flights %>%
  arrange(year, month, day, sched_dep_time, carrier, flight) %>%
  mutate(flight_sur = row_number()) %>%
  glimpse()
## Rows: 336,776
## Columns: 20
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 559, 558, 559, 558, 558, 557, …
## $ sched_dep_time <int> 515, 529, 540, 545, 558, 559, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -4, 0, -2, -1, -2, -2, -3, NA, 1, 0, -5, -…
## $ arr_time       <int> 830, 850, 923, 1004, 740, 702, 753, 941, 849, 853, 838,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 728, 706, 745, 910, 851, 856, 846,…
## $ arr_delay      <dbl> 11, 20, 33, -18, 12, -4, 8, 31, -2, -3, -8, NA, -6, -7,…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "UA", "B6", "AA", "AA", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 1696, 1806, 301, 707, 49, 71, 79…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N39463", "N708…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "EWR", "JFK", "LGA", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ORD", "BOS", "ORD", "DFW",…
## $ air_time       <dbl> 227, 227, 160, 183, 150, 44, 138, 257, 149, 158, 140, N…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 719, 187, 733, 1389, 1028, 1005…
## $ hour           <dbl> 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 58, 59, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## $ flight_sur     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …

4. Identify the keys in the following datasets:

* Lahman::Batting

Answer) The primary key is (playerID). Other keys are many duplicated values.
#Lahman::Batting

Lahman::Batting %>%
  count(playerID) %>%
  filter(n > 1) %>%
  nrow()
## [1] 14748

* babynames::babynames

Answer) The primary keys are (year, sex, name) since it only can be distinguished the babies. There might have many duplicated values.
#babynames::babynames

babynames::babynames %>%
  count(year, sex, name) %>%
  filter(n > 1) %>%
  nrow()
## [1] 0

* nasaweather::atmos

Answer) The primary keys are (lat, long, year, month). We can get the information with those keys.
#nasaweather::atmos

nasaweather::atmos %>%
  count(lat, long, year, month) %>%
  filter(n > 1) %>%
  nrow()
## [1] 0

* fueleconomy::vehicles

Answer) The primary key is (id). This key is only can be distinguished each data.
#fueleconomy::vehicles

fueleconomy::vehicles %>%
  count(id) %>%
  filter(n > 1) %>%
  nrow()
## [1] 0

* ggplot2::diamonds

Answer) The primary keys are (carat, cut, color, clarity). We can be distinguished each data with those keys.
#ggplot2::diamonds

ggplot2::diamonds %>%
  count(carat, cut, color, clarity) %>%
  filter(n > 1) %>%
  nrow()
## [1] 8080

Mutating joins

A mutating join allows you to combine variables from two tables.

Example:

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # … with 336,766 more rows

Suppose you wand to add the full airline name to the flights2 data.

You can combine airlines and flights2 data frames with left_join()

flights2 %>%
  select(-origin,-dest) %>%
  left_join(airlines,by = "carrier")
## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows

Understanding joins

Consider the data frames:

We can make this data as follows:

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)
x
## # A tibble: 3 x 2
##     key val_x
##   <dbl> <chr>
## 1     1 x1   
## 2     2 x2   
## 3     3 x3
y
## # A tibble: 3 x 2
##     key val_y
##   <dbl> <chr>
## 1     1 y1   
## 2     2 y2   
## 3     4 y3

This type of join is called an inner join which means pairs of observations are matched when their keys are equal.

x %>% 
  inner_join(y, by = "key")
## # A tibble: 2 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2
#> # A tibble: 2 x 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2

notice: unmatched rows are not included in the result.

Outer Joins

While an inner join keeps only observations that appear in both tables and outer join keeps all observations that appear in one of the tables.

What Happens when we have duplicate keys

All the diagrams above assume that there are no duplicate keys in a data set. Is this even true for flights? Can you think of an example?

Here is a diagram from Wickham and Grolemund illustrating the situation

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
left_join(x, y, by = "key")
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x3    y2   
## 4     1 x4    y1
#> # A tibble: 4 x 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2   
#> 3     2 x3    y2   
#> 4     1 x4    y1

Note: in this case, key is a foreign key in x and a primary key in y.

If both tables have duplicate keys the keys do not uniquely identify an observation. In this case a join results in all possible combinations (cartesian product)

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)
left_join(x, y, by = "key")
## # A tibble: 6 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x2    y3   
## 4     2 x3    y2   
## 5     2 x3    y3   
## 6     3 x4    y4

Defining Key Columns

In the above examples the key values were single variables and in left_jointhe ke was encoded by by = "key". But one can use other values for by= the default is by = NULL. What does this mean?

#let's get flights2 back to eliminate extraneous columns
flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)

flights2 %>%
  left_join(weather) #note I'm taking all the defaults
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # … with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>
## Joining, by = c("year", "month", "day", "hour", "origin")

It means that we join by mathcing on all common variables, see console message above.

On the other hand if match on a key, for example tailnum is a foreign key in flights for planes

flights2 %>%
  left_join(planes, by="tailnum")
## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type            
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult…
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult…
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult…
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult…
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult…
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult…
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult…
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult…
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult…
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>            
## # … with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>

suppose we wanted to draw a map of all destination air ports for flights from NewYork in 2013

flights2%>%
  left_join(airports, c("dest"="faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Georg…  30.0 -95.3    97
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Georg…  30.0 -95.3    97
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami…  25.8 -80.3     8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts…  33.6 -84.4  1026
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chica…  42.0 -87.9   668
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort …  26.1 -80.2     9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi…  38.9 -77.5   313
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan…  28.4 -81.3    96
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica…  42.0 -87.9   668
## # … with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## #   tzone <chr>

Task 5 Homework - Exercises from 13.3.6

Answer) I couldn’t find Exercise 13.3.6, but I did all of Exercise 13 problems for extra credits in other RPubs. Thank you :)


This work is based on “R for Data Science” by Hadley and Grolemund. Tt is licensed under

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/us/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.