All datasets are found within R packages, which students should be able to identify and download as needed.
Exercises
Exercise 1
Consider the tables/datasets contained in nycflights13. Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?
Code
flights %>%left_join(airports, c("dest"="faa")) %>%select(dest, origin, lat, lon)
# A tibble: 336,776 × 4
dest origin lat lon
<chr> <chr> <dbl> <dbl>
1 IAH EWR 30.0 -95.3
2 IAH LGA 30.0 -95.3
3 MIA JFK 25.8 -80.3
4 BQN JFK NA NA
5 ATL LGA 33.6 -84.4
6 ORD EWR 42.0 -87.9
7 FLL EWR 26.1 -80.2
8 IAD LGA 38.9 -77.5
9 MCO JFK 28.4 -81.3
10 ORD LGA 42.0 -87.9
# … with 336,766 more rows
Solution
To draw the route each plane flies from its origin to its destination, we would need lattitude, longitude, origin, and destination. Thus, we would need to combine the airports and flights tables.
Exercise 2
A relationship between weather and airports is possible. What is the relationship and how would it appear in a diagram (i.e., which variables should be matched)?
Code
weather %>%left_join(airports, c("origin"="faa")) %>%select(origin, year:visib, lat, lon)
# A tibble: 26,115 × 16
origin year month day hour temp dewp humid wind_dir wind_speed wind_g…¹
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0 NA
8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4 NA
9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0 NA
10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8 NA
# … with 26,105 more rows, 5 more variables: precip <dbl>, pressure <dbl>,
# visib <dbl>, lat <dbl>, lon <dbl>, and abbreviated variable name ¹wind_gust
Origin and faa can be matched between weathers and airports. In a diagram, there should be a line connecting these two variables to one another.
# A tibble: 336,776 × 7
ID year month day sched_dep_time carrier flight
<int> <int> <int> <int> <int> <chr> <int>
1 1 2013 1 1 515 UA 1545
2 2 2013 1 1 529 UA 1714
3 3 2013 1 1 540 AA 1141
4 4 2013 1 1 545 B6 725
5 5 2013 1 1 558 UA 1696
6 6 2013 1 1 559 B6 1806
7 7 2013 1 1 600 AA 301
8 8 2013 1 1 600 AA 707
9 9 2013 1 1 600 B6 49
10 10 2013 1 1 600 B6 71
# … with 336,766 more rows
Exercise 4
For each of the following datasets, identify any key column[s] and specify whether they are primary or foreign keys. You might need to install some packages and read some documentation.
# A tibble: 0 × 5
# Groups: sex, name, year [0]
# … with 5 variables: year <dbl>, sex <chr>, name <chr>, n <int>, prop <dbl>
Code
vehicles %>%group_by(id) %>%filter(n() >1)
# A tibble: 0 × 12
# Groups: id [0]
# … with 12 variables: id <dbl>, make <chr>, model <chr>, year <dbl>,
# class <chr>, trans <chr>, drive <chr>, cyl <dbl>, displ <dbl>, fuel <chr>,
# hwy <dbl>, cty <dbl>
Code
diamonds %>%distinct() %>%nrow()
[1] 53794
Code
diamonds %>%nrow()
[1] 53940
For the batting, atmos, babynames, and vehicles datasets, a primary key is present to uniquely identify each data point, and the vehicles dataset is the only one with the primary key being one variable. For the diamonds dataset, though, there are less distinct rows than multiple rows, meaning there is no primary key.
Exercise 5
Is there a relationship between the age of a plane and its average arrival delay?
There appears to be a slight negative relationship between the age of a plane and its average arrival delay. I zoomed in on the x-axis to condense the plot to remove unusual values. There initially is a slight increase in average arrival delay as age inceases between 0-10 years, but the plot then shifts downwards to demonstrate the negative relationship.
Exercise 6
What weather conditions make it more likely to see a departure delay?
When more precipitation is present, there is more likely to be a longer delay. This is a very logical connection to make, though the relationship is less steep if you use the ylim() function to zoom in on the y-axis.
Exercise 7
What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? Hint: one variable explains ~90% of the problems.
Code
flights %>%filter(is.na(`tailnum`))
# A tibble: 2,512 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 2 NA 1545 NA NA 1910 NA AA
2 2013 1 2 NA 1601 NA NA 1735 NA UA
3 2013 1 3 NA 857 NA NA 1209 NA UA
4 2013 1 3 NA 645 NA NA 952 NA UA
5 2013 1 4 NA 845 NA NA 1015 NA 9E
6 2013 1 4 NA 1830 NA NA 2044 NA 9E
7 2013 1 5 NA 840 NA NA 1001 NA 9E
8 2013 1 7 NA 820 NA NA 958 NA 9E
9 2013 1 8 NA 1645 NA NA 1838 NA US
10 2013 1 9 NA 755 NA NA 1012 NA 9E
# … with 2,502 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Code
flights %>%anti_join(planes, by ="tailnum") %>%group_by(carrier) %>%summarise(totalmissing =n()) %>%arrange(desc(totalmissing))
# A tibble: 10 × 2
carrier totalmissing
<chr> <int>
1 MQ 25397
2 AA 22558
3 UA 1693
4 9E 1044
5 B6 830
6 US 699
7 FL 187
8 DL 110
9 F9 50
10 WN 38
The documentation states that American Airways and Envoy Air report fleet numbers rather than tail numbers, meaning they can’t be matched; this led be to group together by carrier after using the anti_join function to clear out every flight that matched by tailnum between the flights and planes dataset. Ultimately, it did turn out that these two carriers accounted for a vast majoirty of the missing tailnums.
Exercise 8
Filter flights to show only the flights of planes that flew at least 100 times.
# A tibble: 1,357 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
# … with 1,347 more rows
Solution
‘anti_join(flights, airports, by = c(“dest” = “faa”))’ only shows flights in the airport dataset where the “faa” variable does not match with the destination; since the FAA list only shows flights inside the U.S, these would be foreign flights. anti_join(airports, flights, by = c("faa" = "dest")) only shows flights where the airport was not a destination for any flight in the flights dataset, and since the flights dataset only shows New York City flights, this would be every flight that did not leave from New York City.
Exercise 11
We might assume there’s a relationship between plane and airline where each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.