L08 Relational Data

Data Science 1 with R (STAT 301-1)

Author

YOUR NAME

Load Packages

Code
# Loading package(s)
library(tidyverse)
library(dplyr)
library(nycflights13)
library(Lahman)
library(babynames)
library(nasaweather)
library(fueleconomy)
library(ggplot2)
library(janitor)
data(flights)
data(airports)
data(weather)
data(Batting)
data(babynames)
data(atmos)
data(vehicles)
data(diamonds)
data(common)

Datasets

All datasets are found within R packages, which students should be able to identify and download as needed.

Exercises

Exercise 1

Consider the tables/datasets contained in nycflights13. Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?

Code
flights %>%
  left_join(airports, c("dest" = "faa")) %>%
  select(dest, origin, lat, lon)
# A tibble: 336,776 × 4
   dest  origin   lat   lon
   <chr> <chr>  <dbl> <dbl>
 1 IAH   EWR     30.0 -95.3
 2 IAH   LGA     30.0 -95.3
 3 MIA   JFK     25.8 -80.3
 4 BQN   JFK     NA    NA  
 5 ATL   LGA     33.6 -84.4
 6 ORD   EWR     42.0 -87.9
 7 FLL   EWR     26.1 -80.2
 8 IAD   LGA     38.9 -77.5
 9 MCO   JFK     28.4 -81.3
10 ORD   LGA     42.0 -87.9
# … with 336,766 more rows

Solution

To draw the route each plane flies from its origin to its destination, we would need lattitude, longitude, origin, and destination. Thus, we would need to combine the airports and flights tables.

Exercise 2

A relationship between weather and airports is possible. What is the relationship and how would it appear in a diagram (i.e., which variables should be matched)?

Code
weather %>%
  left_join(airports, c("origin" = "faa")) %>%
  select(origin, year:visib, lat, lon)
# A tibble: 26,115 × 16
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_g…¹
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>    <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4        NA
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06       NA
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5        NA
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7        NA
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7        NA
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5        NA
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0        NA
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4        NA
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0        NA
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8        NA
# … with 26,105 more rows, 5 more variables: precip <dbl>, pressure <dbl>,
#   visib <dbl>, lat <dbl>, lon <dbl>, and abbreviated variable name ¹​wind_gust

Origin and faa can be matched between weathers and airports. In a diagram, there should be a line connecting these two variables to one another.

Exercise 3

Add a surrogate key to flights.

Code
flights %>% 
  arrange(year, month, day, sched_dep_time, carrier, flight) %>%
  mutate(ID = row_number()) %>%
  select(ID, year, month, day, sched_dep_time, carrier, flight)
# A tibble: 336,776 × 7
      ID  year month   day sched_dep_time carrier flight
   <int> <int> <int> <int>          <int> <chr>    <int>
 1     1  2013     1     1            515 UA        1545
 2     2  2013     1     1            529 UA        1714
 3     3  2013     1     1            540 AA        1141
 4     4  2013     1     1            545 B6         725
 5     5  2013     1     1            558 UA        1696
 6     6  2013     1     1            559 B6        1806
 7     7  2013     1     1            600 AA         301
 8     8  2013     1     1            600 AA         707
 9     9  2013     1     1            600 B6          49
10    10  2013     1     1            600 B6          71
# … with 336,766 more rows

Exercise 4

For each of the following datasets, identify any key column[s] and specify whether they are primary or foreign keys. You might need to install some packages and read some documentation.

  • Lahman::Batting,
  • babynames::babynames
  • nasaweather::atmos
  • fueleconomy::vehicles
  • ggplot2::diamonds
Code
Batting %>%
  group_by(playerID, yearID, stint) %>%
  filter(n() > 1)
# A tibble: 0 × 22
# Groups:   playerID, yearID, stint [0]
# … with 22 variables: playerID <chr>, yearID <int>, stint <int>, teamID <fct>,
#   lgID <fct>, G <int>, AB <int>, R <int>, H <int>, X2B <int>, X3B <int>,
#   HR <int>, RBI <int>, SB <int>, CS <int>, BB <int>, SO <int>, IBB <int>,
#   HBP <int>, SH <int>, SF <int>, GIDP <int>
Code
atmos %>%
  group_by(lat, long, year, month) %>%
  filter(n() > 1)
# A tibble: 0 × 11
# Groups:   lat, long, year, month [0]
# … with 11 variables: lat <dbl>, long <dbl>, year <int>, month <int>,
#   surftemp <dbl>, temp <dbl>, pressure <dbl>, ozone <dbl>, cloudlow <dbl>,
#   cloudmid <dbl>, cloudhigh <dbl>
Code
babynames %>%
  group_by(sex, name, year) %>%
  filter(n() > 1)
# A tibble: 0 × 5
# Groups:   sex, name, year [0]
# … with 5 variables: year <dbl>, sex <chr>, name <chr>, n <int>, prop <dbl>
Code
vehicles %>%
  group_by(id) %>%
  filter(n() > 1)
# A tibble: 0 × 12
# Groups:   id [0]
# … with 12 variables: id <dbl>, make <chr>, model <chr>, year <dbl>,
#   class <chr>, trans <chr>, drive <chr>, cyl <dbl>, displ <dbl>, fuel <chr>,
#   hwy <dbl>, cty <dbl>
Code
diamonds %>%
  distinct() %>%
  nrow()
[1] 53794
Code
diamonds %>%
  nrow()
[1] 53940

For the batting, atmos, babynames, and vehicles datasets, a primary key is present to uniquely identify each data point, and the vehicles dataset is the only one with the primary key being one variable. For the diamonds dataset, though, there are less distinct rows than multiple rows, meaning there is no primary key.

Exercise 5

Is there a relationship between the age of a plane and its average arrival delay?

Code
planes %>%
  left_join(flights, by = "tailnum") %>%
  clean_names() %>%
  mutate(age = `year_y` - `year_x`) %>%
  filter(!is.na(age)) %>%
  group_by(age) %>%
  summarise(averagedelay = mean(arr_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = age, y= averagedelay)) + geom_point() + geom_smooth(method=lm, se=FALSE) + xlim (0, 40)

Solution

There appears to be a slight negative relationship between the age of a plane and its average arrival delay. I zoomed in on the x-axis to condense the plot to remove unusual values. There initially is a slight increase in average arrival delay as age inceases between 0-10 years, but the plot then shifts downwards to demonstrate the negative relationship.

Exercise 6

What weather conditions make it more likely to see a departure delay?

Code
flights %>%
  inner_join(weather, by= c("origin" = "origin", "year" = "year", "month" = "month", "day" = "day", "hour" = "hour")) %>%
  group_by(precip) %>%
  summarise(averagedelay = mean(dep_delay, na.rm = TRUE)) %>% 
  ggplot(aes(x = precip, y= averagedelay)) + geom_point() + geom_smooth(method=lm, se=FALSE) + ylim (0, 100)

Solution

When more precipitation is present, there is more likely to be a longer delay. This is a very logical connection to make, though the relationship is less steep if you use the ylim() function to zoom in on the y-axis.

Exercise 7

What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? Hint: one variable explains ~90% of the problems.

Code
flights %>%
  filter(is.na(`tailnum`))
# A tibble: 2,512 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     2       NA       1545      NA      NA    1910      NA AA     
 2  2013     1     2       NA       1601      NA      NA    1735      NA UA     
 3  2013     1     3       NA        857      NA      NA    1209      NA UA     
 4  2013     1     3       NA        645      NA      NA     952      NA UA     
 5  2013     1     4       NA        845      NA      NA    1015      NA 9E     
 6  2013     1     4       NA       1830      NA      NA    2044      NA 9E     
 7  2013     1     5       NA        840      NA      NA    1001      NA 9E     
 8  2013     1     7       NA        820      NA      NA     958      NA 9E     
 9  2013     1     8       NA       1645      NA      NA    1838      NA US     
10  2013     1     9       NA        755      NA      NA    1012      NA 9E     
# … with 2,502 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
Code
flights %>%
  anti_join(planes, by = "tailnum") %>%
  group_by(carrier) %>%
  summarise(totalmissing = n()) %>%
  arrange(desc(totalmissing))
# A tibble: 10 × 2
   carrier totalmissing
   <chr>          <int>
 1 MQ             25397
 2 AA             22558
 3 UA              1693
 4 9E              1044
 5 B6               830
 6 US               699
 7 FL               187
 8 DL               110
 9 F9                50
10 WN                38

The documentation states that American Airways and Envoy Air report fleet numbers rather than tail numbers, meaning they can’t be matched; this led be to group together by carrier after using the anti_join function to clear out every flight that matched by tailnum between the flights and planes dataset. Ultimately, it did turn out that these two carriers accounted for a vast majoirty of the missing tailnums.

Exercise 8

Filter flights to show only the flights of planes that flew at least 100 times.

Code
flights %>%
  filter(!is.na(tailnum)) %>%
  group_by(tailnum) %>%
  summarise(totalflights = n()) %>%
  filter(totalflights > 100)
# A tibble: 1,200 × 2
   tailnum totalflights
   <chr>          <int>
 1 N0EGMQ           371
 2 N10156           153
 3 N10575           289
 4 N11106           129
 5 N11107           148
 6 N11109           148
 7 N11113           138
 8 N11119           148
 9 N11121           154
10 N11127           124
# … with 1,190 more rows

Exercise 9

Combine fueleconomy::vehicles and fueleconomy::common to find the records for only the most common models.

Code
vehicles %>%
  semi_join(common, by = c("make", "model"))
# A tibble: 14,531 × 12
      id make  model    year class     trans drive   cyl displ fuel    hwy   cty
   <dbl> <chr> <chr>   <dbl> <chr>     <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
 1  1833 Acura Integra  1986 Subcompa… Auto… Fron…     4   1.6 Regu…    28    22
 2  1834 Acura Integra  1986 Subcompa… Manu… Fron…     4   1.6 Regu…    28    23
 3  3037 Acura Integra  1987 Subcompa… Auto… Fron…     4   1.6 Regu…    28    22
 4  3038 Acura Integra  1987 Subcompa… Manu… Fron…     4   1.6 Regu…    28    23
 5  4183 Acura Integra  1988 Subcompa… Auto… Fron…     4   1.6 Regu…    27    22
 6  4184 Acura Integra  1988 Subcompa… Manu… Fron…     4   1.6 Regu…    28    23
 7  5303 Acura Integra  1989 Subcompa… Auto… Fron…     4   1.6 Regu…    27    22
 8  5304 Acura Integra  1989 Subcompa… Manu… Fron…     4   1.6 Regu…    28    23
 9  6442 Acura Integra  1990 Subcompa… Auto… Fron…     4   1.8 Regu…    24    20
10  6443 Acura Integra  1990 Subcompa… Manu… Fron…     4   1.8 Regu…    26    21
# … with 14,521 more rows

Solution

The semi_join function only brings in the observations with matching makes and models, so we can use it to find the most common models.

Exercise 10

What does anti_join(flights, airports, by = c("dest" = "faa")) tell you?

What does anti_join(airports, flights, by = c("faa" = "dest")) tell you?

Code
anti_join(flights, airports, by = c("dest" = "faa"))
# A tibble: 7,602 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      544        545      -1    1004    1022     -18 B6     
 2  2013     1     1      615        615       0    1039    1100     -21 B6     
 3  2013     1     1      628        630      -2    1137    1140      -3 AA     
 4  2013     1     1      701        700       1    1123    1154     -31 UA     
 5  2013     1     1      711        715      -4    1151    1206     -15 B6     
 6  2013     1     1      820        820       0    1254    1310     -16 B6     
 7  2013     1     1      820        820       0    1249    1329     -40 DL     
 8  2013     1     1      840        845      -5    1311    1350     -39 AA     
 9  2013     1     1      909        810      59    1331    1315      16 AA     
10  2013     1     1      913        918      -5    1346    1416     -30 UA     
# … with 7,592 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
Code
anti_join(airports, flights, by = c("faa" = "dest"))
# A tibble: 1,357 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# … with 1,347 more rows

Solution

‘anti_join(flights, airports, by = c(“dest” = “faa”))’ only shows flights in the airport dataset where the “faa” variable does not match with the destination; since the FAA list only shows flights inside the U.S, these would be foreign flights. anti_join(airports, flights, by = c("faa" = "dest")) only shows flights where the airport was not a destination for any flight in the flights dataset, and since the flights dataset only shows New York City flights, this would be every flight that did not leave from New York City.

Exercise 11

We might assume there’s a relationship between plane and airline where each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.

Code
flights %>%
  select(carrier, tailnum) %>%
  group_by(tailnum) %>%
  summarize(n = length(unique(carrier))) %>%
  filter(n > 1)
# A tibble: 18 × 2
   tailnum     n
   <chr>   <int>
 1 N146PQ      2
 2 N153PQ      2
 3 N176PQ      2
 4 N181PQ      2
 5 N197PQ      2
 6 N200PQ      2
 7 N228PQ      2
 8 N232PQ      2
 9 N933AT      2
10 N935AT      2
11 N977AT      2
12 N978AT      2
13 N979AT      2
14 N981AT      2
15 N989AT      2
16 N990AT      2
17 N994AT      2
18 <NA>        7

Solution

Since there were specific planes flown by multiple carrier, each plane is not flown by a single airline. Thus, we would reject this hypothesis.

Challenge

Challenge is optional for all students, but we recommend trying them out!

Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference with the weather data. Can you see any patterns?