Disclaimer: The contents of this document come from Chapter 10. Relational Database with dplyr of R for Data Science (Wickham & Grolemund, 2017). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.
This document is also published on RPubs.
install.packages("tidyverse", repos = "http://cran.us.r-project.org", dependencies = TRUE)
install.packages("nycflights13", repos = "http://cran.us.r-project.org", dependencies = TRUE)
library(tidyverse)
library(nycflights13)

0. WeeklyHW #3

No 2. Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

airport_most_flights <- flights %>%
  filter(!is.na(dep_delay)==TRUE) %>% 
  group_by(dest) %>%
  summarise(count = n()) %>%
  mutate(rank = dense_rank(desc(count))) %>%  # ranking airports by # of flights 
  filter(rank <= 30) # top 30 destinations by # of flights in 2013 
## Warning: package 'bindrcpp' was built under R version 3.5.2
flights %>% 
  semi_join(airport_most_flights, by = "dest") %>% # extract observations common in both tables, no additional variables  
  group_by(month, dest) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  group_by(dest) %>%
  #filter(n()==12) %>% # not necessariy because the top 30 airports have flights for 12 months 
  ggplot(aes(x = factor(month), y = reorder(dest, avg_dep_delay), fill = avg_dep_delay)) +
  geom_tile() +
  labs(x="Month",y="Top 30 Airports from NY, 2013", fill="Delay in Depature") + 
  scale_fill_continuous(type = "viridis", direction = -1) 

No 4. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.

diamonds %>%
  count(cut(carat, c(0, 0.35, 0.53, 0.9, 1.13, 6))) %>%
  mutate(prop = round(n/sum(n)*100, 1))
## # A tibble: 5 x 3
##   `cut(carat, c(0, 0.35, 0.53, 0.9, 1.13, 6))`     n  prop
##   <fct>                                        <int> <dbl>
## 1 (0,0.35]                                     11058  20.5
## 2 (0.35,0.53]                                  10527  19.5
## 3 (0.53,0.9]                                   12017  22.3
## 4 (0.9,1.13]                                    9651  17.9
## 5 (1.13,6]                                     10687  19.8
diamonds %>% 
  filter(carat > 0) %>% 
  ggplot(aes(x = cut, y = log(price+1), color = cut(carat, c(0, 0.35, 0.53, 0.9, 1.13, 6)))) +
  geom_boxplot() +
  theme(legend.position="bottom") + 
  labs(x="Quality of Cutting",y="Price (log-tranformed)", color="Carat (size)")  

1. Intro

Terms:

  1. Relational tables: multiple tables that are somehow related to one another.

  2. Relational Database Management System (RDMS): their operations are not alawys designed for data analysis.

What we do:

  1. Mutating join, which add new variables to one data frame from matching observations in another.

  2. Filtering join, which filter observations from one data frame based on whether or not they match an observation in the other table.

3. Set operation, which treat observations as if they were set elements.

Why we do:

  1. When answering questinos, it’s rare that we fine answers from a single table. Rather, we need multiple relational tables.

2. An Overview of nycflights13

Four “additional”" relational tables (i.e., tibbles)

airlines
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.
airports
## # A tibble: 1,458 x 8
##    faa   name                   lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                <dbl>  <dbl> <int> <dbl> <chr> <chr>        
##  1 04G   Lansdowne Airport     41.1  -80.6  1044    -5 A     America/New_~
##  2 06A   Moton Field Municip~  32.5  -85.7   264    -6 A     America/Chic~
##  3 06C   Schaumburg Regional   42.0  -88.1   801    -6 A     America/Chic~
##  4 06N   Randall Airport       41.4  -74.4   523    -5 A     America/New_~
##  5 09J   Jekyll Island Airpo~  31.1  -81.4    11    -5 A     America/New_~
##  6 0A9   Elizabethton Munici~  36.4  -82.2  1593    -5 A     America/New_~
##  7 0G6   Williams County Air~  41.5  -84.5   730    -5 A     America/New_~
##  8 0G7   Finger Lakes Region~  42.9  -76.8   492    -5 A     America/New_~
##  9 0P2   Shoestring Aviation~  39.8  -76.6  1000    -5 U     America/New_~
## 10 0S9   Jefferson County In~  48.1 -123.    108    -8 A     America/Los_~
## # ... with 1,448 more rows
planes
## # A tibble: 3,322 x 9
##    tailnum  year type      manufacturer  model  engines seats speed engine
##    <chr>   <int> <chr>     <chr>         <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wi~ EMBRAER       EMB-1~       2    55    NA Turbo~
##  2 N102UW   1998 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  3 N103US   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  4 N104UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  5 N10575   2002 Fixed wi~ EMBRAER       EMB-1~       2    55    NA Turbo~
##  6 N105UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  7 N107US   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  8 N108UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  9 N109UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
## 10 N110UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
## # ... with 3,312 more rows
weather
## # A tibble: 26,115 x 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

3. Keys

When joining tables, we join two tables a time and use keys, (a set of) variable(s) that is(are) common in two tables.

planes %>%
  count(tailnum) %>%
  filter(n>1)
## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>
# when multiple variables work as primary keys 
weather %>%
  count(year, month, day, hour, origin) %>%
  filter(n>1)
## # A tibble: 3 x 6
##    year month   day  hour origin     n
##   <dbl> <dbl> <int> <int> <chr>  <int>
## 1  2013    11     3     1 EWR        2
## 2  2013    11     3     1 JFK        2
## 3  2013    11     3     1 LGA        2

What if there is no primary key? It’s possible, and we create a pseudo-primary key in this case, a surrogate key.

flights %>%
  count(year, month, day, flight) %>%
  filter(n>1) 
## # A tibble: 29,768 x 5
##     year month   day flight     n
##    <int> <int> <int>  <int> <int>
##  1  2013     1     1      1     2
##  2  2013     1     1      3     2
##  3  2013     1     1      4     2
##  4  2013     1     1     11     3
##  5  2013     1     1     15     2
##  6  2013     1     1     21     2
##  7  2013     1     1     27     4
##  8  2013     1     1     31     2
##  9  2013     1     1     32     2
## 10  2013     1     1     35     2
## # ... with 29,758 more rows
flights %>%
  count(year, month, day, tailnum) %>%
  filter(n>1)
## # A tibble: 64,928 x 5
##     year month   day tailnum     n
##    <int> <int> <int> <chr>   <int>
##  1  2013     1     1 N0EGMQ      2
##  2  2013     1     1 N11189      2
##  3  2013     1     1 N11536      2
##  4  2013     1     1 N11544      3
##  5  2013     1     1 N11551      2
##  6  2013     1     1 N12540      2
##  7  2013     1     1 N12567      2
##  8  2013     1     1 N13123      2
##  9  2013     1     1 N13538      3
## 10  2013     1     1 N13566      3
## # ... with 64,918 more rows

A fundemental relation is one-to-many. One-to-one is its special case. Many-to-many relations are modeled as a combination of one-to-many and many-to-one relations: e.g., airport and airline.

4. Mutating Joins

This is to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.

# first make a narrower table 
flights2 <- flights %>%
  select(year:day, hour, origin, dest, tailnum, carrier) 
flights2 
## # A tibble: 336,776 x 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ... with 336,766 more rows
# next, bring in variables in airline to flights2 
flights2 %>%
  select(-origin, -dest) %>%
  left_join(airlines, by = "carrier") # airline is the second argument, whose variables are brought in  
## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows
# the same operation by base R functions ... 
flights2 %>% 
  select(-origin, -dest) %>%
  mutate(name = airlines$name[match(carrier, airlines$carrier)])
## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows
Four Main Types of Joins

x <- tribble(
  ~key, ~val_x, 
  1, "x1", 
  2, "x2", 
  3, "x3"
)

y <- tribble(
  ~key, ~val_y, 
  1, "y1", 
  2, "y2", 
  4, "y3"
)
1. Inner Joins

Unmatched rows are not included in the result, which may be dangerous.

x %>%
  inner_join(y, by = "key")
## # A tibble: 2 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2
2. Outer Joins
  • A left join keeps all observations in x.
  • A right join keeps all observations in y.
  • A full join keeps all observations in x and y.
3. Duplicate Keys

What if keys do not uniquely identify observations?

If only one table has this problem…

x <- tribble(
  ~key, ~val_x, 
  1, "x1",
  2, "x2", 
  2, "x3", 
  1, "x4"
)
y <- tribble(
  ~key, ~val_y, 
  1, "y1", 
  2, "y2"
)
left_join(x, y, by = "key")
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x3    y2   
## 4     1 x4    y1

If both tables have this problem, you get all possible combications, the Cartesian product.

x <- tribble(
  ~key, ~val_x, 
  1, "x1", 
  2, "x2", 
  2, "x3", 
  3, "x4"
)
y <- tribble(
  ~key, ~val_y, 
  1, "y1", 
  2, "y2", 
  2, "y3", 
  3, "y4"
)
left_join(x, y, by = "key")
## # A tibble: 6 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x2    y3   
## 4     2 x3    y2   
## 5     2 x3    y3   
## 6     3 x4    y4
Defining the Key Columns
  • The default, by = NULL uses all variables that appear in both tables.
flights2 %>%
  left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <dbl> <dbl> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>
  • A character vector, by = "some_key" uses only some of the common variables.
flights2 %>%
  left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type 
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe~
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe~
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe~
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe~
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe~
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe~
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixe~
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixe~
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixe~
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA> 
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
  • A named character vector, by = c("a" = "b") matches variable a in table x to variable b in table y.
flights2 %>%
  left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name    lat   lon
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Geor~  30.0 -95.3
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Geor~  30.0 -95.3
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miam~  25.8 -80.3
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>   NA    NA  
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Hart~  33.6 -84.4
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chic~  42.0 -87.9
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort~  26.1 -80.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Wash~  38.9 -77.5
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orla~  28.4 -81.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chic~  42.0 -87.9
## # ... with 336,766 more rows, and 4 more variables: alt <int>, tz <dbl>,
## #   dst <chr>, tzone <chr>
flights2 %>%
  left_join(airports, c("origin" = "faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name    lat   lon
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Newa~  40.7 -74.2
##  2  2013     1     1     5 LGA    IAH   N24211  UA      La G~  40.8 -73.9
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      John~  40.6 -73.8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      John~  40.6 -73.8
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      La G~  40.8 -73.9
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Newa~  40.7 -74.2
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Newa~  40.7 -74.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      La G~  40.8 -73.9
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      John~  40.6 -73.8
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      La G~  40.8 -73.9
## # ... with 336,766 more rows, and 4 more variables: alt <int>, tz <dbl>,
## #   dst <chr>, tzone <chr>
Exercises
  1. Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:
install.packages("maps", repos = "http://cran.us.r-project.org", dependencies = TRUE)
library(maps)
airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point() +
    coord_quickmap()

(Don’t worry if you don’t understand what semi_join() does — you’ll learn about it next.)

You might want to use the size or colour of the points to display the average delay for each airport.

  1. Add the location of the origin and destination (i.e. the lat and lon) to flights.

  2. Is there a relationship between the age of a plane and its delays?

  3. What weather conditions make it more likely to see a delay?

  4. What happened on June 13 2013? Display the spatial pattern of delays, and then use Google to cross-reference with the weather.

5. Filtering Joins

Filtering joins work the same way, but returning only matched observations, instead of appending new variables from the other table.

# Make a filtered summary table
# top 10 destination airports by the numnber of flights 
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest 
## # A tibble: 10 x 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705
# use base R 
# two issues: limited if we have multiple keys 
flights %>%
  filter(dest %in% top_dest$dest)
## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540         2      923
##  2  2013     1     1      554            600        -6      812
##  3  2013     1     1      554            558        -4      740
##  4  2013     1     1      555            600        -5      913
##  5  2013     1     1      557            600        -3      838
##  6  2013     1     1      558            600        -2      753
##  7  2013     1     1      558            600        -2      924
##  8  2013     1     1      558            600        -2      923
##  9  2013     1     1      559            559         0      702
## 10  2013     1     1      600            600         0      851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
# use filtering joins 
flights %>%
  semi_join(top_dest)
## Joining, by = "dest"
## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540         2      923
##  2  2013     1     1      554            600        -6      812
##  3  2013     1     1      554            558        -4      740
##  4  2013     1     1      555            600        -5      913
##  5  2013     1     1      557            600        -3      838
##  6  2013     1     1      558            600        -2      753
##  7  2013     1     1      558            600        -2      924
##  8  2013     1     1      558            600        -2      923
##  9  2013     1     1      559            559         0      702
## 10  2013     1     1      600            600         0      851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

anti_join() works in the opposite way.

anti_join() is useful when we want to check observations in a table, which do not have a match with the other table (i.e., join mismatches).

flights %>%
  anti_join(planes, by = "tailnum") %>% 
  count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # ... with 712 more rows
Exercises
  1. What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)

  2. Filter flights to only show flights with planes that have flown at least 100 flights.

3. Combine fueleconomy::vehicles and fueleconomy::common to find only the records for the most common models.

  1. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

  2. What does anti_join(flights, airports, by = c("dest" = "faa")) tell you? What does anti_join(airports, flights, by = c("faa" = "dest")) tell you?

  3. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.