Working with Relational Data

So far you have been working with data in one table. Usually stored as a tibble or a dataframe. But IRL (and even the the Metaverse :)) it is a rare thing to work with only one table. Most frequently this takes the froms of working with a relational base. And while there exist specialized languages, most notably SQL, to work with relational data bases, it is useful to be able to perform some of the same functions in your development language, whether that be R, Pyton, Julia, Matlap, Tableau, even Excel.

Last week you worked with the nycflights13 data base, but you worked only with the flights table. I believe I mentioned the fact that nycflights13 had other tables as well.

Task 1

Since this is a new project, you will have to install both nycflights13 and tidyverse. Use install.packages but do the installation in the console pane.

#Done

Load these packages in the first code chunck of this project.

#Done

Use the help system to answer the following questions:

How many tables are there in the database?

Your answer:

#There are 5 tables:
#airlines, airports, flights, planes, and weather

How many airports are represented in the table?

Your answer:

#airports
count(airports)

## # A tibble: 1 x 1
##       n
##   <int>
## 1  1458

What are the variables in planes table?

Your answer:

#planes
#tailnum, (manufactured) year, type, manufacturer/model, (number) engines/seats, speed, (type) engine

Introduction to relational data

You will find Chapter 13 in R for Data Science usefull background reading.

The following passage is from R4DS:

"To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:

Mutating joins, which add new variables to one data frame from matching observations in another.
Filtering joins, which filter observations from one data frame base
Set operations, which treat observations as if they were set elements.

The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases. If you’ve used a database before, you’ve almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis.d on whether or not they match an observation in the other table.

Task 2

Examine the following table and verify that the associations depicted between the data tables is correct. (also from R4DS)

1. What information can be gained that would not be in the flights data table? (be specific)

#The timezone of the airport can be gotten from the airports table
#Information regarding the manufacturer and year made, number of seats and engines, engine type, etc. can be gotten from the planes table
#weather information for each day can be obtained from the weather table
#The full name of the airlines can be gotten from the airlines table

Suppose you wanted to approximate the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?

#I would need to know the loaction of the orgin and destination airport
#Then I could assume an "as the crow flies" path.
#To get this, I would have to combine the flights table with the airports table, matching the origin and destination airports with their lat and lon

There is connection between weather and airports, not shown in the figure above. What is the relationship and how would it appear?

#weather
#airports
#filter(airports, faa == 'EWR')

#You could tie the specific latitude and longitude of the weather described in the weather table by matching it with the given NYC area airport. Specifically, the origin column in airports is a foreign key matching the faa column in airports (faa being the primary key for the airports table). However, you could only do this with the NYC area airports as the weather table only has data for them.

weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?

#flights
#weather

#If all airport weather was included in the weather table- year, month, day, hour, and origin/dest airport could be linked (via another join with the faa column in airports) between the flights and weather tables. This would make those columns a foreign key in the flights table. It would allow one to look at weather data from the origin and destination airports, giving insight into delays and cancellations etc.

Connecting variables are called keys. By definition a key is a variable or set of variables that uniquely identifies an observation.

Types of keys

primary key uniquely identifies an observation in its own table.
foreign key uniquely identifies an observation in another table.

Definition: A primary key and a matching foreign key in another table define a relation.

Taks 3

A variable can be both a primary and foreign key? Can you give an example from nycflights13?

#faa is a primary key for the airports table, but a foreign key for the weather table (designated "origin" in weather)

How can you verify that a key is a primary key? (Hint, you will probably need to write some code!)

#you could do a count of that primary key, n should never be greater than 1
count(airports, faa)

## # A tibble: 1,458 x 2
##    faa       n
##    <chr> <int>
##  1 04G       1
##  2 06A       1
##  3 06C       1
##  4 06N       1
##  5 09J       1
##  6 0A9       1
##  7 0G6       1
##  8 0G7       1
##  9 0P2       1
## 10 0S9       1
## # ... with 1,448 more rows

filter(count(airports, faa), n > 1)

## # A tibble: 0 x 2
## # ... with 2 variables: faa <chr>, n <int>

#each faa value only ever corresponds to, at most, 1 element

What is the primary key for flights?

#flights
#At first it is tempting to think tailnum will be a primary key
filter(flights, tailnum == 'N14228')

## # A tibble: 111 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     8     1435           1440        -5     1717           1746
##  3  2013     1     9      717            700        17      812            815
##  4  2013     1     9     1143           1144        -1     1425           1445
##  5  2013     1    13      835            824        11     1030            951
##  6  2013     1    16     1829           1730        59     2117           2023
##  7  2013     1    22     1902           1808        54     2214           2106
##  8  2013     1    23     1050           1056        -6     1143           1208
##  9  2013     1    23     1533           1529         4     1641           1645
## 10  2013     1    25      724            720         4     1000           1023
## # ... with 101 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

#however, it can be seen that there are many flights with the same tailnum- it just designates a route, not an individual flight

#There is no one column that is the primary key for flights
#Even combining- year, month, day with a flight or tailnum doesn't work, as some flights/tailnums are used multiple times
filter(count(flights, year, month, day, flight), n > 1)

## # A tibble: 29,768 x 5
##     year month   day flight     n
##    <int> <int> <int>  <int> <int>
##  1  2013     1     1      1     2
##  2  2013     1     1      3     2
##  3  2013     1     1      4     2
##  4  2013     1     1     11     3
##  5  2013     1     1     15     2
##  6  2013     1     1     21     2
##  7  2013     1     1     27     4
##  8  2013     1     1     31     2
##  9  2013     1     1     32     2
## 10  2013     1     1     35     2
## # ... with 29,758 more rows

filter(count(flights, year, month, day, tailnum), n > 1)

## # A tibble: 64,928 x 5
##     year month   day tailnum     n
##    <int> <int> <int> <chr>   <int>
##  1  2013     1     1 N0EGMQ      2
##  2  2013     1     1 N11189      2
##  3  2013     1     1 N11536      2
##  4  2013     1     1 N11544      3
##  5  2013     1     1 N11551      2
##  6  2013     1     1 N12540      2
##  7  2013     1     1 N12567      2
##  8  2013     1     1 N13123      2
##  9  2013     1     1 N13538      3
## 10  2013     1     1 N13566      3
## # ... with 64,918 more rows

#Therfore you have to use the rownum/index of the flight from the dataframe
mutate(flights, unique_flight_id = row_number()) %>%
  select(unique_flight_id, everything())

## # A tibble: 336,776 x 20
##    unique_flight_id  year month   day dep_time sched_dep_time dep_delay arr_time
##               <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1                1  2013     1     1      517            515         2      830
##  2                2  2013     1     1      533            529         4      850
##  3                3  2013     1     1      542            540         2      923
##  4                4  2013     1     1      544            545        -1     1004
##  5                5  2013     1     1      554            600        -6      812
##  6                6  2013     1     1      554            558        -4      740
##  7                7  2013     1     1      555            600        -5      913
##  8                8  2013     1     1      557            600        -3      709
##  9                9  2013     1     1      557            600        -3      838
## 10               10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

#—

If a table does not have a primary key then if a primary key is needed, then one can be created using mutate() and row_number(), such a created key is called a surrogate key.

#— ### Task 4

Are relations one-to-one or one-to-many?

#Most relations are one-to-many - the key(primary or foreign), links one value in a given table to multiple values in another table

Can there be a many-to-many relations?

#Yes, many-to-many relationships generally flow through a single connection though
#For example, you may have a table of students in a class and a table of student test scores- a many-to-many relationship
#but these two groups of values would be linked through a single class

What variables would you use to create a surrogate key for flights?

#I used the row_number() function to create a surrogate key for flights
#none of the variables could combine to uniquely identify every flight

Identify the keys in the following datasets:

Lahman::Batting
babynames::babynames
nasaweather::atmos
fueleconomy::vehicles
ggplot2::diamonds

library(Lahman)

## Warning: package 'Lahman' was built under R version 4.1.2

library(babynames)

## Warning: package 'babynames' was built under R version 4.1.2

library(nasaweather)

## 
## Attaching package: 'nasaweather'

## The following object is masked from 'package:dplyr':
## 
##     storms

library(fueleconomy)

## Warning: package 'fueleconomy' was built under R version 4.1.2

library(ggplot2)

Lahman::Batting

#playerID, yearID, and stint combine to make the primary key, i.e. no more than 1 instance of such a combination occurs
#for context, a player was chosen to show where they had multiple stints in a year, thus a specific stint would also need to be selected to uniquely identify a row
filter(count(Batting, playerID, yearID, stint), n > 1)

## [1] playerID yearID   stint    n       
## <0 rows> (or 0-length row.names)

filter(Batting, playerID == 'alyeabr01', yearID == 1972)

##    playerID yearID stint teamID lgID  G AB R H X2B X3B HR RBI SB CS BB SO IBB
## 1 alyeabr01   1972     1    OAK   AL 10 13 1 3   1   0  0   0  0  0  1  0   0
## 2 alyeabr01   1972     2    SLN   NL 13 19 0 3   1   0  0   1  0  0  0  6   0
## 3 alyeabr01   1972     3    OAK   AL 10 18 2 3   0   0  1   2  0  0  2  5   0
##   HBP SH SF GIDP
## 1   0  0  0    1
## 2   0  0  0    0
## 3   0  0  0    0

babynames::babynames

#babynames
filter(count(babynames, year, sex, name), n > 1)

## # A tibble: 0 x 4
## # ... with 4 variables: year <dbl>, sex <chr>, name <chr>, n <int>

#year, sex, and name combine to make a primary key for babynames

nasaweather::atmos

#atmos
filter(count(atmos, lat, long, year, month), n > 1)

## # A tibble: 0 x 5
## # ... with 5 variables: lat <dbl>, long <dbl>, year <int>, month <int>, n <int>

#lat, long, year, and month combine to make a primary key for atmos

fueleconomy::vehicles

#vehicles
filter(count(vehicles, id), n > 1)

## # A tibble: 0 x 2
## # ... with 2 variables: id <dbl>, n <int>

#id is the primary key for vehicles

ggplot2::diamonds

#diamonds
filter(count(diamonds, carat, cut, color, clarity, depth, table, price, x, y, z), n > 1)

## # A tibble: 143 x 11
##    carat cut       color clarity depth table price     x     y     z     n
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
##  1  0.3  Good      J     VS1      63.4    57   394  4.23  4.26  2.69     2
##  2  0.3  Very Good G     VS2      63      55   526  4.29  4.31  2.71     2
##  3  0.3  Very Good J     VS1      63.4    57   506  4.26  4.23  2.69     2
##  4  0.3  Premium   D     SI1      62.2    58   709  4.31  4.28  2.67     2
##  5  0.3  Ideal     G     VS2      63      55   675  4.31  4.29  2.71     2
##  6  0.3  Ideal     G     IF       62.1    55   863  4.32  4.35  2.69     2
##  7  0.3  Ideal     H     SI1      62.2    57   450  4.26  4.29  2.66     2
##  8  0.3  Ideal     H     SI1      62.2    57   450  4.27  4.28  2.66     2
##  9  0.31 Good      D     SI1      63.5    56   571  4.29  4.31  2.73     2
## 10  0.31 Very Good D     SI1      63.5    56   732  4.31  4.29  2.73     2
## # ... with 133 more rows

#no primary key can be made from the variables, thus using row_num is necessarry
mutate(diamonds, 'diamond_id' = row_number()) %>%
  select(diamond_id, everything())

## # A tibble: 53,940 x 11
##    diamond_id carat cut       color clarity depth table price     x     y     z
##         <int> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1          1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2          2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3          3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4          4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5          5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6          6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7          7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8          8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9          9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10         10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

Mutating joins

A mutating join allows you to combine variables from two tables.

Example:

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2

## # A tibble: 336,776 x 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ... with 336,766 more rows

Suppose you wand to add the full airline name to the flights2 data.

You can combine airlines and flights2 data frames with left_join()

flights2 %>%
  select(-origin,-dest) %>%
  left_join(airlines,by = "carrier")

## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows

Understanding joins

Consider the data frames: We can make this data as follows:

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)
x

## # A tibble: 3 x 2
##     key val_x
##   <dbl> <chr>
## 1     1 x1   
## 2     2 x2   
## 3     3 x3

## # A tibble: 3 x 2
##     key val_y
##   <dbl> <chr>
## 1     1 y1   
## 2     2 y2   
## 3     4 y3

This type of join is called an inner join which means pairs of observations are matched when their keys are equal.

x %>% 
  inner_join(y, by = "key")

## # A tibble: 2 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2

#> # A tibble: 2 x 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2

notice: unmatched rows are not included in the result.

Outer Joins

While an inner join keeps only observations that appear in both tables and outer join keeps all observations that appear in one of the tables.

What Happens when we have duplicate keys

All the diagrams above assume that there are no duplicate keys in a data set. Is this even true for flights? Can you think of an example?

#As seen from a question in a previous section, this is not true for flights. No combination of variables uniquely identified a flight. It was necessary to create a new flight_id key using row_num()

Here is a diagram from Wickham and Grolemund illustrating the situation

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
left_join(x, y, by = "key")

## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x3    y2   
## 4     1 x4    y1

#> # A tibble: 4 x 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2   
#> 3     2 x3    y2   
#> 4     1 x4    y1

Note: in this case, key is a foreign key in x and a primary key in y.

If boh tables have duplicate keys the keys do not uniquely identify an observation. In this case a join results in all possible combinations (cartesian product)

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)
left_join(x, y, by = "key")

## # A tibble: 6 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x2    y3   
## 4     2 x3    y2   
## 5     2 x3    y3   
## 6     3 x4    y4

Defining Key Columns

In the above examples the key values were single variables and in left_jointhe ke was encoded by by = "key". But one can use other values for by= the default is by = NULL. What does this mean?

#let's get flights2 back to eliminate extraneous columns
flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)

flights2 %>%
  left_join(weather) #note I'm taking all the defaults

## Joining, by = c("year", "month", "day", "hour", "origin")

## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

It means that we join by mathcing on all common variables, see console message above.

On the other hand if match on a key, for example tailnum is a foreign key in flights for planes

flights2 %>%
  left_join(planes, by="tailnum")

## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type            
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult~
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult~
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult~
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult~
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult~
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult~
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult~
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult~
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult~
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>            
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>

suppose we wanted to draw a map of all destination air ports for flights from NewYork in 2013

flights2%>%
  left_join(airports, c("dest"="faa"))

## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Georg~  30.0 -95.3    97
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Georg~  30.0 -95.3    97
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami~  25.8 -80.3     8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts~  33.6 -84.4  1026
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chica~  42.0 -87.9   668
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort ~  26.1 -80.2     9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi~  38.9 -77.5   313
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan~  28.4 -81.3    96
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica~  42.0 -87.9   668
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## #   tzone <chr>

##Task 5 Homework - Exercises from 13.3.6

#Doing questions from 13.4.6

Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:

airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point() +
    coord_quickmap()

…

#Computing average delay by destination
#flights3 <- group_by(flights, dest)
#summarise(flights3, avg_delay = mean(arr_delay, na.rm = TRUE))

flights %>%
  group_by(dest) %>%
  summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) -> delay_data
#delay_data

…

#joining with airports table
delay_data %>%
  left_join(airports, by = c('dest' = 'faa')) -> delay_data2
delay_data2

## # A tibble: 105 x 9
##    dest  avg_delay name                  lat    lon   alt    tz dst   tzone     
##    <chr>     <dbl> <chr>               <dbl>  <dbl> <dbl> <dbl> <chr> <chr>     
##  1 ABQ        4.38 Albuquerque Intern~  35.0 -107.   5355    -7 A     America/D~
##  2 ACK        4.85 Nantucket Mem        41.3  -70.1    48    -5 A     America/N~
##  3 ALB       14.4  Albany Intl          42.7  -73.8   285    -5 A     America/N~
##  4 ANC       -2.5  Ted Stevens Anchor~  61.2 -150.    152    -9 A     America/A~
##  5 ATL       11.3  Hartsfield Jackson~  33.6  -84.4  1026    -5 A     America/N~
##  6 AUS        6.02 Austin Bergstrom I~  30.2  -97.7   542    -6 A     America/C~
##  7 AVL        8.00 Asheville Regional~  35.4  -82.5  2165    -5 A     America/N~
##  8 BDL        7.05 Bradley Intl         41.9  -72.7   173    -5 A     America/N~
##  9 BGR        8.03 Bangor Intl          44.8  -68.8   192    -5 A     America/N~
## 10 BHM       16.9  Birmingham Intl      33.6  -86.8   644    -6 A     America/C~
## # ... with 95 more rows

…

#Displaying graph of delay data
delay_data2 %>%
  ggplot(aes(lon, lat, color = avg_delay)) +
    borders("state") +
    geom_point() +
    coord_quickmap()

## Warning: Removed 4 rows containing missing values (geom_point).

Add the location of the origin and destination (i.e. the lat and lon) to flights.

#joing flights with airports, adding lat and lon
select(airports, faa, lat, lon) -> latlon
#latlon <- rename(latlon, origin_lat = lat, origin_lon = lon)
#latlon

flights %>%
  left_join(latlon, by = c('origin' = 'faa')) %>%
  left_join(latlon, by = c('dest' = 'faa'), suffix = c('_origin', '_dest')) %>%
  select(origin, lat_origin, lon_origin, dest, lat_dest, lon_dest, everything())

## # A tibble: 336,776 x 23
##    origin lat_origin lon_origin dest  lat_dest lon_dest  year month   day
##    <chr>       <dbl>      <dbl> <chr>    <dbl>    <dbl> <int> <int> <int>
##  1 EWR          40.7      -74.2 IAH       30.0    -95.3  2013     1     1
##  2 LGA          40.8      -73.9 IAH       30.0    -95.3  2013     1     1
##  3 JFK          40.6      -73.8 MIA       25.8    -80.3  2013     1     1
##  4 JFK          40.6      -73.8 BQN       NA       NA    2013     1     1
##  5 LGA          40.8      -73.9 ATL       33.6    -84.4  2013     1     1
##  6 EWR          40.7      -74.2 ORD       42.0    -87.9  2013     1     1
##  7 EWR          40.7      -74.2 FLL       26.1    -80.2  2013     1     1
##  8 LGA          40.8      -73.9 IAD       38.9    -77.5  2013     1     1
##  9 JFK          40.6      -73.8 MCO       28.4    -81.3  2013     1     1
## 10 LGA          40.8      -73.9 ORD       42.0    -87.9  2013     1     1
## # ... with 336,766 more rows, and 14 more variables: dep_time <int>,
## #   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Is there a relationship between the age of a plane and its delays?

#select plane info to join
select(planes, tailnum, year) -> plane_data
plane_data <- rename(plane_data, plane_built = year)
#plane_data

#join plane data with flights and graph (arr_delay)
flights %>%
  left_join(plane_data, by = 'tailnum') %>%
  mutate(plane_age = year - plane_built) %>%
  group_by(plane_age) %>%
  summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  ggplot(mapping = aes(x = plane_age, y = avg_arr_delay)) +
  geom_point()

## Warning: Removed 1 rows containing missing values (geom_point).

#join plane data with flights and graph (dep_delay)
flights %>%
  left_join(plane_data, by = 'tailnum') %>%
  mutate(plane_age = year - plane_built) %>%
  group_by(plane_age) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(mapping = aes(x = plane_age, y = avg_dep_delay)) +
  geom_point()

## Warning: Removed 1 rows containing missing values (geom_point).

#This data would indicate that the most significant delays (both arr and dep) happen around the 10 year mark from a plane's construction.

What weather conditions make it more likely to see a delay?

#weather
#flights
#Natural join -- dep_delay with percepitation
left_join(flights, weather) %>%
  group_by(precip) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(mapping = aes(x = precip, y = avg_dep_delay)) +
  geom_point()

## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")

## Warning: Removed 1 rows containing missing values (geom_point).

#dep delay by temperature
left_join(flights, weather) %>%
  group_by(temp) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(mapping = aes(x = temp, y = avg_dep_delay)) +
  geom_point()

## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")

## Warning: Removed 1 rows containing missing values (geom_point).

#dep delay by wind speed
left_join(flights, weather) %>%
  group_by(wind_speed) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(mapping = aes(x = wind_speed, y = avg_dep_delay)) +
  geom_point()

## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")

## Warning: Removed 1 rows containing missing values (geom_point).

#dep delay by temperature
left_join(flights, weather) %>%
  group_by(visib) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(mapping = aes(x = visib, y = avg_dep_delay)) +
  geom_point()

## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")

## Warning: Removed 1 rows containing missing values (geom_point).

#From these graphs: 
#There is a correlation between increased precipitation and longer dep_delay, but it is not very tight
#temperature has no correlation until temps are greater than 75, then there is a slight correlation with increased dep_delay
#There is a pretty strong/tight correlation with wind speed
#There is a correlation with visibility below 2.5

What happened on June 13 2013? Display the spatial pattern of delays, and then use Google to cross-reference with the weather.

#according to google, there were a series of severe weather patterns that caused all kinds of storms across the US.
#https://en.wikipedia.org/wiki/June_12%E2%80%9313,_2013_derecho_series
#https://en.wikipedia.org/wiki/Portal:Current_events/2013_June_13

#filter for the given day, group and summarise for avg_arr_delay, join with location data, then graph
filter(flights, year == 2013, month == 6, day == 13) %>%
  group_by(dest) %>%
  summarise(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  left_join(airports, by = c('dest' = 'faa')) %>%
  ggplot(mapping = aes(x = lon, y = lat, color = avg_arr_delay)) +
  borders('state') +
  geom_point() +
  coord_quickmap()

## Warning: Removed 4 rows containing missing values (geom_point).

#percentage of detinations that saw more than 60 min avg. delay
filter(flights, year == 2013, month == 6, day == 13) %>%
  group_by(dest) %>%
  summarise(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) -> delay_table
#delay_table
sum(delay_table$avg_arr_delay > 60, na.rm = TRUE) / count(delay_table)

##           n
## 1 0.5505618

#~55% of airports in the US saw avg arrival delays greater than 60 min

###13.5.1 Exercises

What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)

filter(flights, is.na(tailnum))

## # A tibble: 2,512 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     2       NA           1545        NA       NA           1910
##  2  2013     1     2       NA           1601        NA       NA           1735
##  3  2013     1     3       NA            857        NA       NA           1209
##  4  2013     1     3       NA            645        NA       NA            952
##  5  2013     1     4       NA            845        NA       NA           1015
##  6  2013     1     4       NA           1830        NA       NA           2044
##  7  2013     1     5       NA            840        NA       NA           1001
##  8  2013     1     7       NA            820        NA       NA            958
##  9  2013     1     8       NA           1645        NA       NA           1838
## 10  2013     1     9       NA            755        NA       NA           1012
## # ... with 2,502 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

#The flights without a tailnum also are missing any arrival and dep info
#It would appear that these are cancelled flights

anti_join(flights, planes, by = 'tailnum') %>%
  group_by(carrier) %>%
  count(carrier)

## # A tibble: 10 x 2
## # Groups:   carrier [10]
##    carrier     n
##    <chr>   <int>
##  1 9E       1044
##  2 AA      22558
##  3 B6        830
##  4 DL        110
##  5 F9         50
##  6 FL        187
##  7 MQ      25397
##  8 UA       1693
##  9 US        699
## 10 WN         38

#Most of the missing flights come from American (AA) and Envoy Air (MQ)
#From the package description:
#Plane metadata for all plane tailnumbers found in the FAA aircraft registry. American Airways (AA) and Envoy Air (MQ) report fleet numbers rather than tail numbers so can't be matched.

Filter flights to only show flights with planes that have flown at least 100 flights.

#filter for 100+ flights
filter(flights, !is.na(tailnum)) %>%
  group_by(tailnum) %>%
  count(tailnum) %>%
  filter(n >= 100) -> flights100
flights100

## # A tibble: 1,217 x 2
## # Groups:   tailnum [1,217]
##    tailnum     n
##    <chr>   <int>
##  1 N0EGMQ    371
##  2 N10156    153
##  3 N10575    289
##  4 N11106    129
##  5 N11107    148
##  6 N11109    148
##  7 N11113    138
##  8 N11119    148
##  9 N11121    154
## 10 N11127    124
## # ... with 1,207 more rows

#semi-join to only select flights which match filter
semi_join(flights, flights100, by = 'tailnum')

## # A tibble: 228,390 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      544            545        -1     1004           1022
##  4  2013     1     1      554            558        -4      740            728
##  5  2013     1     1      555            600        -5      913            854
##  6  2013     1     1      557            600        -3      709            723
##  7  2013     1     1      557            600        -3      838            846
##  8  2013     1     1      558            600        -2      849            851
##  9  2013     1     1      558            600        -2      853            856
## 10  2013     1     1      558            600        -2      923            937
## # ... with 228,380 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Combine fueleconomy::vehicles and fueleconomy::common to find only the records for the most common models.

#vehicles
#common
semi_join(vehicles, common, by = c('make', 'model'))

## # A tibble: 14,531 x 12
##       id make  model    year class   trans  drive    cyl displ fuel    hwy   cty
##    <dbl> <chr> <chr>   <dbl> <chr>   <chr>  <chr>  <dbl> <dbl> <chr> <dbl> <dbl>
##  1  1833 Acura Integra  1986 Subcom~ Autom~ Front~     4   1.6 Regu~    28    22
##  2  1834 Acura Integra  1986 Subcom~ Manua~ Front~     4   1.6 Regu~    28    23
##  3  3037 Acura Integra  1987 Subcom~ Autom~ Front~     4   1.6 Regu~    28    22
##  4  3038 Acura Integra  1987 Subcom~ Manua~ Front~     4   1.6 Regu~    28    23
##  5  4183 Acura Integra  1988 Subcom~ Autom~ Front~     4   1.6 Regu~    27    22
##  6  4184 Acura Integra  1988 Subcom~ Manua~ Front~     4   1.6 Regu~    28    23
##  7  5303 Acura Integra  1989 Subcom~ Autom~ Front~     4   1.6 Regu~    27    22
##  8  5304 Acura Integra  1989 Subcom~ Manua~ Front~     4   1.6 Regu~    28    23
##  9  6442 Acura Integra  1990 Subcom~ Autom~ Front~     4   1.8 Regu~    24    20
## 10  6443 Acura Integra  1990 Subcom~ Manua~ Front~     4   1.8 Regu~    26    21
## # ... with 14,521 more rows

Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

#NOTE
#I only am looking at 24 hours
#I imagine doing some kind of mutate() with an if statement, saying, look for any days that are +/- 1 from the current day, then combine each as one 48-hour period add together (one for the + and one for the -)

#Find worst 24 hour delays
group_by(flights, origin, year, month, day) %>%
  summarize(avg_daily_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_daily_delay)) -> worst_daily_delay

## `summarise()` has grouped output by 'origin', 'year', 'month'. You can override using the `.groups` argument.

worst_daily_delay

## # A tibble: 1,095 x 5
## # Groups:   origin, year, month [36]
##    origin  year month   day avg_daily_delay
##    <chr>  <int> <int> <int>           <dbl>
##  1 LGA     2013     3     8           106. 
##  2 EWR     2013     3     8            97.8
##  3 LGA     2013     9     2            80.5
##  4 LGA     2013    12     5            75.2
##  5 JFK     2013     7    10            63.6
##  6 LGA     2013     7     1            62.4
##  7 EWR     2013    12     5            61.2
##  8 JFK     2013     7     1            59.6
##  9 EWR     2013     9    12            58.7
## 10 EWR     2013     5    23            58.2
## # ... with 1,085 more rows

#filter for the worst 24 hours
filter(worst_daily_delay, year == 2013, month == 3, day == 8) -> worst_day

#Join with weather data - shows all three airports' hourly weather from the worst delayed day...Then show the avg weather from the day for each airport
semi_join(weather, worst_day, by = c('origin', 'year', 'month', 'day')) %>%
  group_by(origin) %>%
  summarize(avg_wind_spd = mean(wind_speed, na.rm = TRUE), avg_visib = mean(visib, na.rm = TRUE), avg_prec = mean(precip, na.rm = TRUE))

## # A tibble: 3 x 4
##   origin avg_wind_spd avg_visib avg_prec
##   <chr>         <dbl>     <dbl>    <dbl>
## 1 EWR            13.5      4.53   0.0396
## 2 JFK            19.5      5.18   0.0075
## 3 LGA            17.7      5.02   0.0104

#From previous questions, delays were associate with visibility below 2.5
#Delays were also pretty linearly correlated with increasing windseepd above ~5
#Additionally, any percipitation above ~0.15 also had some correlation
#From the chart here, the only data that matches that previous work is higher wind speed

What does anti_join(flights, airports, by = c(“dest” = “faa”)) tell you? What does anti_join(airports, flights, by = c(“faa” = “dest”)) tell you?

#This shows flights were the destination airport had no match in the airports' faa column
anti_join(flights, airports, by = c("dest" = "faa"))

## # A tibble: 7,602 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      544            545        -1     1004           1022
##  2  2013     1     1      615            615         0     1039           1100
##  3  2013     1     1      628            630        -2     1137           1140
##  4  2013     1     1      701            700         1     1123           1154
##  5  2013     1     1      711            715        -4     1151           1206
##  6  2013     1     1      820            820         0     1254           1310
##  7  2013     1     1      820            820         0     1249           1329
##  8  2013     1     1      840            845        -5     1311           1350
##  9  2013     1     1      909            810        59     1331           1315
## 10  2013     1     1      913            918        -5     1346           1416
## # ... with 7,592 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

#This shows airports in the airports' faa column which were never destinations of the flights in the flights table
#It's not surprising that many of them are in the immediate NYC area
anti_join(airports, flights, by = c("faa" = "dest"))

## # A tibble: 1,357 x 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/~
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/~
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/~
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/~
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/~
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/~
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/~
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/~
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/~
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/~
## # ... with 1,347 more rows

You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.

#Off the top of my head, a plane may have been owned by multiple airlines
#The plane will only show up once in any airlines' data, but it may show up in multiple airlines
#additionally, similar to the batting table from before, a plane might have multiple "stints" with a given airlines if it was bought/sold several times

#distinct routes by carrier
flights %>%
  filter(!is.na(tailnum)) %>%
  distinct(tailnum, carrier) -> routes
routes

## # A tibble: 4,060 x 2
##    carrier tailnum
##    <chr>   <chr>  
##  1 UA      N14228 
##  2 UA      N24211 
##  3 AA      N619AA 
##  4 B6      N804JB 
##  5 DL      N668DN 
##  6 UA      N39463 
##  7 B6      N516JB 
##  8 EV      N829AS 
##  9 B6      N593JB 
## 10 AA      N3ALAA 
## # ... with 4,050 more rows

#tailnumbers, i.e. planes which have a route with more than 1 carrier
filter(count(routes, tailnum), n > 1) -> multi_carrier
multi_carrier

## # A tibble: 17 x 2
##    tailnum     n
##    <chr>   <int>
##  1 N146PQ      2
##  2 N153PQ      2
##  3 N176PQ      2
##  4 N181PQ      2
##  5 N197PQ      2
##  6 N200PQ      2
##  7 N228PQ      2
##  8 N232PQ      2
##  9 N933AT      2
## 10 N935AT      2
## 11 N977AT      2
## 12 N978AT      2
## 13 N979AT      2
## 14 N981AT      2
## 15 N989AT      2
## 16 N990AT      2
## 17 N994AT      2

This work is based on “R for Data Science” by Hadley and Grolemund. Tt is licensed under

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/us/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.

Week of November 15

Andrew Christian

Working with Relational Data

Task 1

Introduction to relational data

Task 2

Types of keys

Definition: A primary key and a matching foreign key in another table define a relation.

Taks 3

Mutating joins

Understanding joins

Outer Joins

What Happens when we have duplicate keys

Defining Key Columns