Load Libraries


library(tidyverse)
library(nycflights13)

Introduction


It’s rare that a data analysis involves only a single data source. Typically you can have multiple data sources which give multiple tables of data, and you must combine them to answer the questions that you’re interested in.

Collectively, multiple tables of data are called relational data because it is the relations, not just the individual data sets, that are important.

Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair.

Example of relational data - nycflights13


We will use the nycflights13 package to learn about relational data. Other than flights, nycflights13 contains four other tibbles that are related to the flights table that we used in data transformation:

airlines lets you look up the full carrier name from its abbreviated code:


airlines
## # A tibble: 16 Ă— 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

airports gives information about each airport, identified by the faa airport code:


airports
## # A tibble: 1,458 Ă— 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # ℹ 1,448 more rows

planes gives information about each plane, identified by its tailnum:


planes
## # A tibble: 3,322 Ă— 9
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## # ℹ 3,312 more rows

weather gives the weather at each NYC airport for each hour:


weather
## # A tibble: 26,115 Ă— 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

One way to show the relationships between the different tables is with a drawing:

Here the relations between tables are represented by key matching, which we will discuss in the next.

Lab Exercise: Explain how you understand the diagram above.

Keys


The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.

In simple cases, a single variable is sufficient to identify an observation. For example, each plane is uniquely identified by its tailnum.

To verify this, we can check the number of rows in planes data set and the amount of unique values in planes$tailnum. If the two numbers are exactly the same, then tainum can be used to uniquely identify a plane.

planes
## # A tibble: 3,322 Ă— 9
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## # ℹ 3,312 more rows
nrow(planes)
## [1] 3322
length(unique(planes$tailnum))
## [1] 3322

Primary Key


In other cases, multiple variables may be needed. For example, to identify an observation in weather you need five variables: year, month, day, hour, and origin.

Primary Key: A primary key uniquely identifies an observation in its own table. For example, planes$tailnum is a primary key for planes because it uniquely identifies each plane in the planes table.

Verification of primary keys


Once you’ve identified the primary keys in your tables, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to count() the primary keys and look for entries where n is greater than one:

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)
## # A tibble: 0 Ă— 2
## # ℹ 2 variables: tailnum <chr>, n <int>
weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)
## # A tibble: 3 Ă— 6
##    year month   day  hour origin     n
##   <int> <int> <int> <int> <chr>  <int>
## 1  2013    11     3     1 EWR        2
## 2  2013    11     3     1 JFK        2
## 3  2013    11     3     1 LGA        2

So we see that in the weather data set, there are a few records that match the same time and the same airport. Let’s look at one of them in detail:

temp <- filter(weather, year == 2013, month == 11, day == 3, hour == 1)
glimpse(temp)
## Rows: 6
## Columns: 15
## $ origin     <chr> "EWR", "EWR", "JFK", "JFK", "LGA", "LGA"
## $ year       <int> 2013, 2013, 2013, 2013, 2013, 2013
## $ month      <int> 11, 11, 11, 11, 11, 11
## $ day        <int> 3, 3, 3, 3, 3, 3
## $ hour       <int> 1, 1, 1, 1, 1, 1
## $ temp       <dbl> 51.98, 50.00, 53.96, 51.98, 55.04, 53.96
## $ dewp       <dbl> 39.02, 39.02, 37.94, 37.94, 39.02, 39.92
## $ humid      <dbl> 61.15, 65.80, 54.51, 58.62, 54.67, 58.89
## $ wind_dir   <dbl> 310, 290, 320, 310, 330, 310
## $ wind_speed <dbl> 6.90468, 5.75390, 9.20624, 6.90468, 9.20624, 8.05546
## $ wind_gust  <dbl> NA, NA, NA, NA, NA, NA
## $ precip     <dbl> 0, 0, 0, 0, 0, 0
## $ pressure   <dbl> 1009.8, 1010.5, 1009.8, 1010.5, 1009.3, 1010.2
## $ visib      <dbl> 10, 10, 10, 10, 10, 10
## $ time_hour  <dttm> 2013-11-03 01:00:00, 2013-11-03 01:00:00, 2013-11-03 01:00:…

Now it is clear that there are indeed multiple records for 1:00am on Nov 3 in 2013 for all three airports. We may think how to handle these records. Since there aren’t many of such records, let’s ignore the few exceptions for now and still consider the combination of year, month, day, hour, and origin as a primary key for weather data.

No primary key available


Sometimes a table doesn’t have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it.

For example, there are some samples in the diamonds data set that are identical in all columns. This can be found by using the duplicated() function, which returns a vector of TRUE of FALSE of whether there are duplicates of each row.

sum(duplicated(diamonds))
## [1] 146
diamonds[duplicated(diamonds), ] # Verify the duplicates
## # A tibble: 146 Ă— 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.79 Ideal     G     SI1      62.3    57  2898  5.9   5.85  3.66
##  2  0.79 Ideal     G     SI1      62.3    57  2898  5.9   5.85  3.66
##  3  0.79 Ideal     G     SI1      62.3    57  2898  5.9   5.85  3.66
##  4  0.79 Ideal     G     SI1      62.3    57  2898  5.9   5.85  3.66
##  5  1.52 Good      E     I1       57.3    58  3105  7.53  7.42  4.28
##  6  1    Fair      E     SI2      67      53  3136  6.19  6.13  4.13
##  7  1    Fair      F     SI2      65.1    55  3265  6.26  6.23  4.07
##  8  0.9  Very Good I     VS2      58.4    62  3334  6.29  6.35  3.69
##  9  1    Ideal     E     SI2      62.9    56  3450  6.32  6.3   3.97
## 10  1    Fair      H     SI1      65.5    57  3511  6.26  6.21  4.08
## # ℹ 136 more rows

Another example


Another example is the flights data set. What’s the primary key in the flights table? You might think it would be the date plus the flight or tail number, but neither of those are unique:

flights %>% 
  count(year, month, day, flight) %>% 
  filter(n > 1)
## # A tibble: 29,768 Ă— 5
##     year month   day flight     n
##    <int> <int> <int>  <int> <int>
##  1  2013     1     1      1     2
##  2  2013     1     1      3     2
##  3  2013     1     1      4     2
##  4  2013     1     1     11     3
##  5  2013     1     1     15     2
##  6  2013     1     1     21     2
##  7  2013     1     1     27     4
##  8  2013     1     1     31     2
##  9  2013     1     1     32     2
## 10  2013     1     1     35     2
## # ℹ 29,758 more rows
flights %>% 
  count(year, month, day, tailnum) %>% 
  filter(n > 1)
## # A tibble: 64,928 Ă— 5
##     year month   day tailnum     n
##    <int> <int> <int> <chr>   <int>
##  1  2013     1     1 N0EGMQ      2
##  2  2013     1     1 N11189      2
##  3  2013     1     1 N11536      2
##  4  2013     1     1 N11544      3
##  5  2013     1     1 N11551      2
##  6  2013     1     1 N12540      2
##  7  2013     1     1 N12567      2
##  8  2013     1     1 N13123      2
##  9  2013     1     1 N13538      3
## 10  2013     1     1 N13566      3
## # ℹ 64,918 more rows

So the same flight or plane may fly more than once on the same day. They cannot be used as the primary key.

create a surrogate key


If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key.

flights %>%
  mutate(id = row_number()) %>% # The `row_number()` function returns the row number of each observation
  select(id, everything()) %>%
  arrange(id) %>%
  print()
## # A tibble: 336,776 Ă— 20
##       id  year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1     1  2013     1     1      517            515         2      830
##  2     2  2013     1     1      533            529         4      850
##  3     3  2013     1     1      542            540         2      923
##  4     4  2013     1     1      544            545        -1     1004
##  5     5  2013     1     1      554            600        -6      812
##  6     6  2013     1     1      554            558        -4      740
##  7     7  2013     1     1      555            600        -5      913
##  8     8  2013     1     1      557            600        -3      709
##  9     9  2013     1     1      557            600        -3      838
## 10    10  2013     1     1      558            600        -2      753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We successfully create a surrogate key for the flights table.

Lab homework (Required):


Answer whether there exists a primary key in the mpg data set with codes.

Foreign Key and Relation


A foreign key uniquely identifies an observation in another table. For example, flights$tailnum is a foreign key to the planes table because it appears in the flights table where it matches each flight to a unique plane in the planes data set.

A primary key and the corresponding foreign key in another table form a relation.

Relations are typically one-to-many. For example, each flight has one plane, but each plane has many flights. In other data, you’ll occasionally see a 1-to-1 relationship. You can think of this as a special case of 1-to-many. You can model many-to-many relations with a many-to-1 relation plus a 1-to-many relation. For example, in this data there’s a many-to-many relationship between airlines and airports: each airline flies to many airports; each airport hosts many airlines.

Lab Exercise:


Identify the following relationship to be one-to-many, one-to-one, or many-to-many

  1. origin in flights and faa in airports

  2. The variable student_name in a table student information for the college and a table grade score of our course.

Work with relational data


The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases. If you’ve used a database before, you’ve almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different.

Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis.

There are three families of verbs designed to work with relational data from a pair of tables:

  • Mutating joins, which add new variables to one data frame from matching observations in another.

  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.

  • Set operations, which treat observations as if they were set elements.

Mutating joins


The first tool we’ll look at for combining a pair of tables is the mutating join. A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.

Let’s work on a part of flights data as an example.

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 Ă— 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ℹ 336,766 more rows

left_join()


Imagine you want to add the full airline name to the flights2 data. You can combine the airlines and flights2 data frames with left_join():

flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier")
## # A tibble: 336,776 Ă— 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # ℹ 336,766 more rows

Here carrier is a foreign key in flights2 to airlines. What happened is that we keep all observations in flights2 (since it is left_join) , then add other columns of airlines (only one column name in this example) by matching the carrier column values.

This is hard to generalise when we need to match multiple variables. We will use the following visual presentations to help us understand the four mutating join functions: the inner join, and the three outer joins.

Understanding joins


To help us learn how joins work, let’s look at the following visual representation:

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

The coloured column represents the “key” variable: these are used to match the rows between the tables. The grey column represents the “value” column that is carried along for the ride. In these examples we use a single key variable, but the idea generalises in a straightforward way to multiple keys and multiple values.

Inner join


The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal:

x %>% 
  inner_join(y, by = "key")
## # A tibble: 2 Ă— 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2

For inner join, unmatched rows are not included in the result. This means that generally inner joins are usually not appropriate for use in analysis because it’s too easy to lose observations.

Outer join


An inner join keeps observations that appear in both tables. An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

  • A left join keeps all observations in x.
  • A right join keeps all observations in y.
  • A full join keeps all observations in x and y.

In outer joins, the values of new columns are filled with NA if there is no key matching. Another way to depict the different types of joins is with a Venn diagram:

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

Duplicate keys


The Venn diagram cannot well illustrate what will happen when there are duplicated keys. There are two possibilities:


  1. One table has duplicate keys. This is useful when you want to add in additional information as there is typically a one-to-many relationship.

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
left_join(x, y, by = "key")
## # A tibble: 4 Ă— 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x3    y2   
## 4     1 x4    y1

Duplicate keys (Cont’d)


  1. Both tables have duplicate keys. This is usually an error because in neither table do the keys uniquely identify an observation. When you join duplicated keys, you get all possible combinations, the Cartesian product:

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)
left_join(x, y, by = "key")
## # A tibble: 6 Ă— 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x2    y3   
## 4     2 x3    y2   
## 5     2 x3    y3   
## 6     3 x4    y4

Defining the key columns


So far, the pairs of tables have always been joined by a single variable, and that variable has the same name in both tables. That constraint was encoded by by = “key”. You can use other values for by to connect the tables in other ways:

  • The default, by = NULL, uses all variables that appear in both tables, the so called natural join. For example, the flights and weather tables match on their common variables: year, month, day, hour and origin.
flights2 %>% 
  left_join(weather)
## # A tibble: 336,776 Ă— 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # ℹ 336,766 more rows
## # ℹ 7 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
  • A character vector, by = “x”. This is like a natural join, but uses only some of the common variables. For example, flights and planes have year variables, but they mean different things so we only want to join by tailnum.
flights2 %>% 
  left_join(planes, by = "tailnum")
## # A tibble: 336,776 Ă— 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type            
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult…
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult…
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult…
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult…
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult…
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult…
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult…
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult…
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult…
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>            
## # ℹ 336,766 more rows
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## #   seats <int>, speed <int>, engine <chr>

Note that the year variables (which appear in both input data frames, but are not constrained to be equal) are disambiguated in the output with a suffix. We can fix this by use rename().

flights2 %>% 
  left_join(planes, by = "tailnum") %>%
  rename("flights_year" = 'year.x', "plane_made_year" = 'year.y') %>%
  glimpse()
## Rows: 336,776
## Columns: 16
## $ flights_year    <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ hour            <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
## $ origin          <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
## $ dest            <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
## $ tailnum         <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
## $ carrier         <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
## $ plane_made_year <int> 1999, 1998, 1990, 2012, 1991, 2012, 2000, 1998, 2004, …
## $ type            <chr> "Fixed wing multi engine", "Fixed wing multi engine", …
## $ manufacturer    <chr> "BOEING", "BOEING", "BOEING", "AIRBUS", "BOEING", "BOE…
## $ model           <chr> "737-824", "737-824", "757-223", "A320-232", "757-232"…
## $ engines         <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, NA, 2, 2, 2…
## $ seats           <int> 149, 149, 178, 200, 178, 191, 200, 55, 200, NA, 200, 2…
## $ speed           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ engine          <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "T…
  • A named character vector: by = c(“a” = “b”). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.

For example, if we want to draw a map we need to combine the flights data with the airports data which contains the location (lat and lon) of each airport. Each flight has an origin and destination airport, so we need to specify which one we want to join to:

flights2 %>% 
  left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 Ă— 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Georg…  30.0 -95.3    97
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Georg…  30.0 -95.3    97
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami…  25.8 -80.3     8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts…  33.6 -84.4  1026
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chica…  42.0 -87.9   668
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort …  26.1 -80.2     9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi…  38.9 -77.5   313
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan…  28.4 -81.3    96
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica…  42.0 -87.9   668
## # ℹ 336,766 more rows
## # ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
flights2 %>% 
  left_join(airports, c("origin" = "faa"))
## # A tibble: 336,776 Ă— 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Newar…  40.7 -74.2    18
##  2  2013     1     1     5 LGA    IAH   N24211  UA      La Gu…  40.8 -73.9    22
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      John …  40.6 -73.8    13
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      John …  40.6 -73.8    13
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      La Gu…  40.8 -73.9    22
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Newar…  40.7 -74.2    18
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Newar…  40.7 -74.2    18
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      La Gu…  40.8 -73.9    22
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      John …  40.6 -73.8    13
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      La Gu…  40.8 -73.9    22
## # ℹ 336,766 more rows
## # ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

Lab Homework (Required):


Add airport information to the data set flights2 for both origin and destination. Rename your columns to differentiate information for dest and origin.

Filtering joins


Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

  • semi_join(x, y) keeps all observations in x that have a match in y.

  • anti_join(x, y) drops all observations in x that have a match in y.

semi_join()


Semi-joins are useful for matching filtered summary tables back to the original rows. For example, imagine you’ve found the top ten most popular destinations:

top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest
## # A tibble: 10 Ă— 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705

Now you want to find each flight that went to one of those destinations. You could construct a filter yourself:

flights %>% 
  filter(dest %in% top_dest$dest)
## # A tibble: 141,145 Ă— 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      554            600        -6      812            837
##  3  2013     1     1      554            558        -4      740            728
##  4  2013     1     1      555            600        -5      913            854
##  5  2013     1     1      557            600        -3      838            846
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            559         0      702            706
## 10  2013     1     1      600            600         0      851            858
## # ℹ 141,135 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

But it’s difficult to extend that approach to multiple variables. For example, imagine that you’d found the 10 days with highest average delays. How would you construct the filter statement that used year, month, and day to match it back to flights?

Instead you can use a semi-join, which connects the two tables like a mutating join, but instead of adding new columns, only keeps the rows in x that have a match in y:

flights %>% 
  semi_join(top_dest)
## # A tibble: 141,145 Ă— 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      554            600        -6      812            837
##  3  2013     1     1      554            558        -4      740            728
##  4  2013     1     1      555            600        -5      913            854
##  5  2013     1     1      557            600        -3      838            846
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            559         0      702            706
## 10  2013     1     1      600            600         0      851            858
## # ℹ 141,135 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Graphically, a semi-join looks like this:

anti_join()


The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match:

Anti-joins are useful for diagnosing join mismatches. For example, when connecting flights and planes, you might be interested to know that there are many flights that don’t have a match in planes:

flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)
## # A tibble: 722 Ă— 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # ℹ 712 more rows

Example 1


The following code creates a graph of all destination airports in the US map:

airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point() +
    coord_quickmap()

Here the x- and y- axis are longitude and latitude respectively. The two points outside the continental map of US refer to the airport in Alaska and Hawaii.

Now let’s compute the average arrival delay of each destination airport and mark it in the map.

flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(dest) %>%
  summarise(avg_delay = mean(arr_delay)) -> delay_data

airports %>%
  semi_join(delay_data, by = c("faa" = "dest")) %>%
  left_join(delay_data, by = c("faa" = "dest")) %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point(aes(color = avg_delay)) +
    scale_color_gradient(   low = "green",   high = "red") +
    coord_quickmap()  

Lab Exercise:


Create a airport map similar to the one above with the color representing the number of flights per day from NYC to each airport.

Example 2


As another example, let’s study whether there is a relationship between the age of a plane and its delays?

To perform this analysis, we need to

  1. Filter flights data to remove all NA values in dep_delay and arr_delay.
  2. Left join flights data with planes data by tailnum.
  3. Compute the age of planes and add a new column by mutate()
  4. Fliter all data where the plane age is missing.
  5. Group the data by tail_num and compute the average delay for each plane.
  6. Create a scatter plot of average delay vs plane age.
flights3 <- flights %>%
  filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
  left_join(planes, by = 'tailnum') %>%
  rename("flights_year" = 'year.x', "plane_made_year" = 'year.y') %>%
  mutate(plane_age = 2013 - plane_made_year) %>%
  filter(!is.na(plane_age)) %>%
  group_by(tailnum) %>%
  summarise(avg_dep_delay = mean(dep_delay), avg_arr_delay = mean(arr_delay), plane_age = mean(plane_age))
  
cor(flights3$avg_arr_delay, flights3$plane_age)
## [1] -0.01639459
ggplot(flights3, aes(x = plane_age, y = avg_arr_delay)) + 
    geom_point() + 
    geom_smooth() + 
    ylim(0, 50) + 
    labs(title = "plane age vs departure delay time", x = "plane age (yrs)", y = "average arrival delay (minutes)") + 
    theme(plot.title = element_text(hjust = 0.5, size = rel(1.2), margin = margin(0,0,15,0)), axis.title.x = element_text(size = rel(1.0), margin = margin(10,0,0,0)), axis.title.y = element_text(size = rel(1.0), margin = margin(0,10,0,0)), axis.text = element_text(size = rel(1.0)), plot.margin = margin(1,1,1,1,"cm"))

As a conclusion, we don’t see significant correlation between the two variables.