library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.1.2
So far you have been working with data in one table. Usually stored as a tibble or a dataframe. But IRL (and even the the Metaverse :)) it is a rare thing to work with only one table. Most frequently this takes the froms of working with a relational base. And while there exist specialized languages, most notably SQL, to work with relational data bases, it is useful to be able to perform some of the same functions in your development language, whether that be R, Pyton, Julia, Matlap, Tableau, even Excel.
Last week you worked with the nycflights13 data base, but you worked only with the flights table. I believe I mentioned the fact that nycflights13 had other tables as well.
nycflights13 and tidyverse. Use install.packages but do the installation in the console pane.#Done
#Done
Your answer:
#There are 5 tables:
#airlines, airports, flights, planes, and weather
Your answer:
#airports
count(airports)
## # A tibble: 1 x 1
## n
## <int>
## 1 1458
Your answer:
#planes
#tailnum, (manufactured) year, type, manufacturer/model, (number) engines/seats, speed, (type) engine
You will find Chapter 13 in R for Data Science usefull background reading.
The following passage is from R4DS:
"To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:
Mutating joins, which add new variables to one data frame from matching observations in another.
Filtering joins, which filter observations from one data frame base
Set operations, which treat observations as if they were set elements.
The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases. If you’ve used a database before, you’ve almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis.d on whether or not they match an observation in the other table.
Examine the following table and verify that the associations depicted between the data tables is correct. (also from R4DS)
1. What information can be gained that would not be in the flights data table? (be specific)
#The timezone of the airport can be gotten from the airports table
#Information regarding the manufacturer and year made, number of seats and engines, engine type, etc. can be gotten from the planes table
#weather information for each day can be obtained from the weather table
#The full name of the airlines can be gotten from the airlines table
#I would need to know the loaction of the orgin and destination airport
#Then I could assume an "as the crow flies" path.
#To get this, I would have to combine the flights table with the airports table, matching the origin and destination airports with their lat and lon
weather and airports, not shown in the figure above. What is the relationship and how would it appear?#weather
#airports
#filter(airports, faa == 'EWR')
#You could tie the specific latitude and longitude of the weather described in the weather table by matching it with the given NYC area airport. Specifically, the origin column in airports is a foreign key matching the faa column in airports (faa being the primary key for the airports table). However, you could only do this with the NYC area airports as the weather table only has data for them.
weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?#flights
#weather
#If all airport weather was included in the weather table- year, month, day, hour, and origin/dest airport could be linked (via another join with the faa column in airports) between the flights and weather tables. This would make those columns a foreign key in the flights table. It would allow one to look at weather data from the origin and destination airports, giving insight into delays and cancellations etc.
Connecting variables are called keys. By definition a key is a variable or set of variables that uniquely identifies an observation.
primary key uniquely identifies an observation in its own table.
foreign key uniquely identifies an observation in another table.
nycflights13?#faa is a primary key for the airports table, but a foreign key for the weather table (designated "origin" in weather)
#you could do a count of that primary key, n should never be greater than 1
count(airports, faa)
## # A tibble: 1,458 x 2
## faa n
## <chr> <int>
## 1 04G 1
## 2 06A 1
## 3 06C 1
## 4 06N 1
## 5 09J 1
## 6 0A9 1
## 7 0G6 1
## 8 0G7 1
## 9 0P2 1
## 10 0S9 1
## # ... with 1,448 more rows
filter(count(airports, faa), n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: faa <chr>, n <int>
#each faa value only ever corresponds to, at most, 1 element
flights?#flights
#At first it is tempting to think tailnum will be a primary key
filter(flights, tailnum == 'N14228')
## # A tibble: 111 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 8 1435 1440 -5 1717 1746
## 3 2013 1 9 717 700 17 812 815
## 4 2013 1 9 1143 1144 -1 1425 1445
## 5 2013 1 13 835 824 11 1030 951
## 6 2013 1 16 1829 1730 59 2117 2023
## 7 2013 1 22 1902 1808 54 2214 2106
## 8 2013 1 23 1050 1056 -6 1143 1208
## 9 2013 1 23 1533 1529 4 1641 1645
## 10 2013 1 25 724 720 4 1000 1023
## # ... with 101 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#however, it can be seen that there are many flights with the same tailnum- it just designates a route, not an individual flight
#There is no one column that is the primary key for flights
#Even combining- year, month, day with a flight or tailnum doesn't work, as some flights/tailnums are used multiple times
filter(count(flights, year, month, day, flight), n > 1)
## # A tibble: 29,768 x 5
## year month day flight n
## <int> <int> <int> <int> <int>
## 1 2013 1 1 1 2
## 2 2013 1 1 3 2
## 3 2013 1 1 4 2
## 4 2013 1 1 11 3
## 5 2013 1 1 15 2
## 6 2013 1 1 21 2
## 7 2013 1 1 27 4
## 8 2013 1 1 31 2
## 9 2013 1 1 32 2
## 10 2013 1 1 35 2
## # ... with 29,758 more rows
filter(count(flights, year, month, day, tailnum), n > 1)
## # A tibble: 64,928 x 5
## year month day tailnum n
## <int> <int> <int> <chr> <int>
## 1 2013 1 1 N0EGMQ 2
## 2 2013 1 1 N11189 2
## 3 2013 1 1 N11536 2
## 4 2013 1 1 N11544 3
## 5 2013 1 1 N11551 2
## 6 2013 1 1 N12540 2
## 7 2013 1 1 N12567 2
## 8 2013 1 1 N13123 2
## 9 2013 1 1 N13538 3
## 10 2013 1 1 N13566 3
## # ... with 64,918 more rows
#Therfore you have to use the rownum/index of the flight from the dataframe
mutate(flights, unique_flight_id = row_number()) %>%
select(unique_flight_id, everything())
## # A tibble: 336,776 x 20
## unique_flight_id year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <int> <dbl> <int>
## 1 1 2013 1 1 517 515 2 830
## 2 2 2013 1 1 533 529 4 850
## 3 3 2013 1 1 542 540 2 923
## 4 4 2013 1 1 544 545 -1 1004
## 5 5 2013 1 1 554 600 -6 812
## 6 6 2013 1 1 554 558 -4 740
## 7 7 2013 1 1 555 600 -5 913
## 8 8 2013 1 1 557 600 -3 709
## 9 9 2013 1 1 557 600 -3 838
## 10 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
#—
If a table does not have a primary key then if a primary key is needed, then one can be created using mutate() and row_number(), such a created key is called a surrogate key.
#— ### Task 4
#Most relations are one-to-many - the key(primary or foreign), links one value in a given table to multiple values in another table
#Yes, many-to-many relationships generally flow through a single connection though
#For example, you may have a table of students in a class and a table of student test scores- a many-to-many relationship
#but these two groups of values would be linked through a single class
flights?#I used the row_number() function to create a surrogate key for flights
#none of the variables could combine to uniquely identify every flight
Lahman::Battingbabynames::babynamesnasaweather::atmosfueleconomy::vehiclesggplot2::diamondslibrary(Lahman)
## Warning: package 'Lahman' was built under R version 4.1.2
library(babynames)
## Warning: package 'babynames' was built under R version 4.1.2
library(nasaweather)
##
## Attaching package: 'nasaweather'
## The following object is masked from 'package:dplyr':
##
## storms
library(fueleconomy)
## Warning: package 'fueleconomy' was built under R version 4.1.2
library(ggplot2)
#playerID, yearID, and stint combine to make the primary key, i.e. no more than 1 instance of such a combination occurs
#for context, a player was chosen to show where they had multiple stints in a year, thus a specific stint would also need to be selected to uniquely identify a row
filter(count(Batting, playerID, yearID, stint), n > 1)
## [1] playerID yearID stint n
## <0 rows> (or 0-length row.names)
filter(Batting, playerID == 'alyeabr01', yearID == 1972)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
## 1 alyeabr01 1972 1 OAK AL 10 13 1 3 1 0 0 0 0 0 1 0 0
## 2 alyeabr01 1972 2 SLN NL 13 19 0 3 1 0 0 1 0 0 0 6 0
## 3 alyeabr01 1972 3 OAK AL 10 18 2 3 0 0 1 2 0 0 2 5 0
## HBP SH SF GIDP
## 1 0 0 0 1
## 2 0 0 0 0
## 3 0 0 0 0
#babynames
filter(count(babynames, year, sex, name), n > 1)
## # A tibble: 0 x 4
## # ... with 4 variables: year <dbl>, sex <chr>, name <chr>, n <int>
#year, sex, and name combine to make a primary key for babynames
#atmos
filter(count(atmos, lat, long, year, month), n > 1)
## # A tibble: 0 x 5
## # ... with 5 variables: lat <dbl>, long <dbl>, year <int>, month <int>, n <int>
#lat, long, year, and month combine to make a primary key for atmos
#vehicles
filter(count(vehicles, id), n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: id <dbl>, n <int>
#id is the primary key for vehicles
#diamonds
filter(count(diamonds, carat, cut, color, clarity, depth, table, price, x, y, z), n > 1)
## # A tibble: 143 x 11
## carat cut color clarity depth table price x y z n
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
## 1 0.3 Good J VS1 63.4 57 394 4.23 4.26 2.69 2
## 2 0.3 Very Good G VS2 63 55 526 4.29 4.31 2.71 2
## 3 0.3 Very Good J VS1 63.4 57 506 4.26 4.23 2.69 2
## 4 0.3 Premium D SI1 62.2 58 709 4.31 4.28 2.67 2
## 5 0.3 Ideal G VS2 63 55 675 4.31 4.29 2.71 2
## 6 0.3 Ideal G IF 62.1 55 863 4.32 4.35 2.69 2
## 7 0.3 Ideal H SI1 62.2 57 450 4.26 4.29 2.66 2
## 8 0.3 Ideal H SI1 62.2 57 450 4.27 4.28 2.66 2
## 9 0.31 Good D SI1 63.5 56 571 4.29 4.31 2.73 2
## 10 0.31 Very Good D SI1 63.5 56 732 4.31 4.29 2.73 2
## # ... with 133 more rows
#no primary key can be made from the variables, thus using row_num is necessarry
mutate(diamonds, 'diamond_id' = row_number()) %>%
select(diamond_id, everything())
## # A tibble: 53,940 x 11
## diamond_id carat cut color clarity depth table price x y z
## <int> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
A mutating join allows you to combine variables from two tables.
Example:
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ... with 336,766 more rows
Suppose you wand to add the full airline name to the flights2 data.
You can combine airlines and flights2 data frames with left_join()
flights2 %>%
select(-origin,-dest) %>%
left_join(airlines,by = "carrier")
## # A tibble: 336,776 x 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
## 4 2013 1 1 5 N804JB B6 JetBlue Airways
## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 N516JB B6 JetBlue Airways
## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 N593JB B6 JetBlue Airways
## 10 2013 1 1 6 N3ALAA AA American Airlines Inc.
## # ... with 336,766 more rows
Consider the data frames: We can make this data as follows:
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
x
## # A tibble: 3 x 2
## key val_x
## <dbl> <chr>
## 1 1 x1
## 2 2 x2
## 3 3 x3
y
## # A tibble: 3 x 2
## key val_y
## <dbl> <chr>
## 1 1 y1
## 2 2 y2
## 3 4 y3
This type of join is called an inner join which means pairs of observations are matched when their keys are equal.
x %>%
inner_join(y, by = "key")
## # A tibble: 2 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
#> # A tibble: 2 x 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
notice: unmatched rows are not included in the result.
While an inner join keeps only observations that appear in both tables and outer join keeps all observations that appear in one of the tables.
All the diagrams above assume that there are no duplicate keys in a data set. Is this even true for flights? Can you think of an example?
#As seen from a question in a previous section, this is not true for flights. No combination of variables uniquely identified a flight. It was necessary to create a new flight_id key using row_num()
Here is a diagram from Wickham and Grolemund illustrating the situation
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)
left_join(x, y, by = "key")
## # A tibble: 4 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 2 x3 y2
## 4 1 x4 y1
#> # A tibble: 4 x 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
#> 3 2 x3 y2
#> 4 1 x4 y1
Note: in this case, key is a foreign key in x and a primary key in y.
If boh tables have duplicate keys the keys do not uniquely identify an observation. In this case a join results in all possible combinations (cartesian product)
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
3, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
2, "y3",
3, "y4"
)
left_join(x, y, by = "key")
## # A tibble: 6 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 2 x2 y3
## 4 2 x3 y2
## 5 2 x3 y3
## 6 3 x4 y4
In the above examples the key values were single variables and in left_jointhe ke was encoded by by = "key". But one can use other values for by= the default is by = NULL. What does this mean?
#let's get flights2 back to eliminate extraneous columns
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2 %>%
left_join(weather) #note I'm taking all the defaults
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
## year month day hour origin dest tailnum carrier temp dewp humid
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
## 7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
## 8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
## 9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
It means that we join by mathcing on all common variables, see console message above.
On the other hand if match on a key, for example tailnum is a foreign key in flights for planes
flights2 %>%
left_join(planes, by="tailnum")
## # A tibble: 336,776 x 16
## year.x month day hour origin dest tailnum carrier year.y type
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixed wing mult~
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixed wing mult~
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixed wing mult~
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixed wing mult~
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixed wing mult~
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixed wing mult~
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixed wing mult~
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixed wing mult~
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixed wing mult~
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
suppose we wanted to draw a map of all destination air ports for flights from NewYork in 2013
flights2%>%
left_join(airports, c("dest"="faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon alt
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Georg~ 30.0 -95.3 97
## 2 2013 1 1 5 LGA IAH N24211 UA Georg~ 30.0 -95.3 97
## 3 2013 1 1 5 JFK MIA N619AA AA Miami~ 25.8 -80.3 8
## 4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
## 5 2013 1 1 6 LGA ATL N668DN DL Harts~ 33.6 -84.4 1026
## 6 2013 1 1 5 EWR ORD N39463 UA Chica~ 42.0 -87.9 668
## 7 2013 1 1 6 EWR FLL N516JB B6 Fort ~ 26.1 -80.2 9
## 8 2013 1 1 6 LGA IAD N829AS EV Washi~ 38.9 -77.5 313
## 9 2013 1 1 6 JFK MCO N593JB B6 Orlan~ 28.4 -81.3 96
## 10 2013 1 1 6 LGA ORD N3ALAA AA Chica~ 42.0 -87.9 668
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## # tzone <chr>
##Task 5 Homework - Exercises from 13.3.6
#Doing questions from 13.4.6
airports %>%
semi_join(flights, c("faa" = "dest")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point() +
coord_quickmap()
…
#Computing average delay by destination
#flights3 <- group_by(flights, dest)
#summarise(flights3, avg_delay = mean(arr_delay, na.rm = TRUE))
flights %>%
group_by(dest) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) -> delay_data
#delay_data
…
#joining with airports table
delay_data %>%
left_join(airports, by = c('dest' = 'faa')) -> delay_data2
delay_data2
## # A tibble: 105 x 9
## dest avg_delay name lat lon alt tz dst tzone
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ABQ 4.38 Albuquerque Intern~ 35.0 -107. 5355 -7 A America/D~
## 2 ACK 4.85 Nantucket Mem 41.3 -70.1 48 -5 A America/N~
## 3 ALB 14.4 Albany Intl 42.7 -73.8 285 -5 A America/N~
## 4 ANC -2.5 Ted Stevens Anchor~ 61.2 -150. 152 -9 A America/A~
## 5 ATL 11.3 Hartsfield Jackson~ 33.6 -84.4 1026 -5 A America/N~
## 6 AUS 6.02 Austin Bergstrom I~ 30.2 -97.7 542 -6 A America/C~
## 7 AVL 8.00 Asheville Regional~ 35.4 -82.5 2165 -5 A America/N~
## 8 BDL 7.05 Bradley Intl 41.9 -72.7 173 -5 A America/N~
## 9 BGR 8.03 Bangor Intl 44.8 -68.8 192 -5 A America/N~
## 10 BHM 16.9 Birmingham Intl 33.6 -86.8 644 -6 A America/C~
## # ... with 95 more rows
…
#Displaying graph of delay data
delay_data2 %>%
ggplot(aes(lon, lat, color = avg_delay)) +
borders("state") +
geom_point() +
coord_quickmap()
## Warning: Removed 4 rows containing missing values (geom_point).
#joing flights with airports, adding lat and lon
select(airports, faa, lat, lon) -> latlon
#latlon <- rename(latlon, origin_lat = lat, origin_lon = lon)
#latlon
flights %>%
left_join(latlon, by = c('origin' = 'faa')) %>%
left_join(latlon, by = c('dest' = 'faa'), suffix = c('_origin', '_dest')) %>%
select(origin, lat_origin, lon_origin, dest, lat_dest, lon_dest, everything())
## # A tibble: 336,776 x 23
## origin lat_origin lon_origin dest lat_dest lon_dest year month day
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <int> <int> <int>
## 1 EWR 40.7 -74.2 IAH 30.0 -95.3 2013 1 1
## 2 LGA 40.8 -73.9 IAH 30.0 -95.3 2013 1 1
## 3 JFK 40.6 -73.8 MIA 25.8 -80.3 2013 1 1
## 4 JFK 40.6 -73.8 BQN NA NA 2013 1 1
## 5 LGA 40.8 -73.9 ATL 33.6 -84.4 2013 1 1
## 6 EWR 40.7 -74.2 ORD 42.0 -87.9 2013 1 1
## 7 EWR 40.7 -74.2 FLL 26.1 -80.2 2013 1 1
## 8 LGA 40.8 -73.9 IAD 38.9 -77.5 2013 1 1
## 9 JFK 40.6 -73.8 MCO 28.4 -81.3 2013 1 1
## 10 LGA 40.8 -73.9 ORD 42.0 -87.9 2013 1 1
## # ... with 336,766 more rows, and 14 more variables: dep_time <int>,
## # sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
#select plane info to join
select(planes, tailnum, year) -> plane_data
plane_data <- rename(plane_data, plane_built = year)
#plane_data
#join plane data with flights and graph (arr_delay)
flights %>%
left_join(plane_data, by = 'tailnum') %>%
mutate(plane_age = year - plane_built) %>%
group_by(plane_age) %>%
summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = plane_age, y = avg_arr_delay)) +
geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).
#join plane data with flights and graph (dep_delay)
flights %>%
left_join(plane_data, by = 'tailnum') %>%
mutate(plane_age = year - plane_built) %>%
group_by(plane_age) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = plane_age, y = avg_dep_delay)) +
geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).
#This data would indicate that the most significant delays (both arr and dep) happen around the 10 year mark from a plane's construction.
#weather
#flights
#Natural join -- dep_delay with percepitation
left_join(flights, weather) %>%
group_by(precip) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = precip, y = avg_dep_delay)) +
geom_point()
## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")
## Warning: Removed 1 rows containing missing values (geom_point).
#dep delay by temperature
left_join(flights, weather) %>%
group_by(temp) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = temp, y = avg_dep_delay)) +
geom_point()
## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")
## Warning: Removed 1 rows containing missing values (geom_point).
#dep delay by wind speed
left_join(flights, weather) %>%
group_by(wind_speed) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = wind_speed, y = avg_dep_delay)) +
geom_point()
## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")
## Warning: Removed 1 rows containing missing values (geom_point).
#dep delay by temperature
left_join(flights, weather) %>%
group_by(visib) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = visib, y = avg_dep_delay)) +
geom_point()
## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")
## Warning: Removed 1 rows containing missing values (geom_point).
#From these graphs:
#There is a correlation between increased precipitation and longer dep_delay, but it is not very tight
#temperature has no correlation until temps are greater than 75, then there is a slight correlation with increased dep_delay
#There is a pretty strong/tight correlation with wind speed
#There is a correlation with visibility below 2.5
#according to google, there were a series of severe weather patterns that caused all kinds of storms across the US.
#https://en.wikipedia.org/wiki/June_12%E2%80%9313,_2013_derecho_series
#https://en.wikipedia.org/wiki/Portal:Current_events/2013_June_13
#filter for the given day, group and summarise for avg_arr_delay, join with location data, then graph
filter(flights, year == 2013, month == 6, day == 13) %>%
group_by(dest) %>%
summarise(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
left_join(airports, by = c('dest' = 'faa')) %>%
ggplot(mapping = aes(x = lon, y = lat, color = avg_arr_delay)) +
borders('state') +
geom_point() +
coord_quickmap()
## Warning: Removed 4 rows containing missing values (geom_point).
#percentage of detinations that saw more than 60 min avg. delay
filter(flights, year == 2013, month == 6, day == 13) %>%
group_by(dest) %>%
summarise(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) -> delay_table
#delay_table
sum(delay_table$avg_arr_delay > 60, na.rm = TRUE) / count(delay_table)
## n
## 1 0.5505618
#~55% of airports in the US saw avg arrival delays greater than 60 min
###13.5.1 Exercises
filter(flights, is.na(tailnum))
## # A tibble: 2,512 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 2 NA 1545 NA NA 1910
## 2 2013 1 2 NA 1601 NA NA 1735
## 3 2013 1 3 NA 857 NA NA 1209
## 4 2013 1 3 NA 645 NA NA 952
## 5 2013 1 4 NA 845 NA NA 1015
## 6 2013 1 4 NA 1830 NA NA 2044
## 7 2013 1 5 NA 840 NA NA 1001
## 8 2013 1 7 NA 820 NA NA 958
## 9 2013 1 8 NA 1645 NA NA 1838
## 10 2013 1 9 NA 755 NA NA 1012
## # ... with 2,502 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#The flights without a tailnum also are missing any arrival and dep info
#It would appear that these are cancelled flights
anti_join(flights, planes, by = 'tailnum') %>%
group_by(carrier) %>%
count(carrier)
## # A tibble: 10 x 2
## # Groups: carrier [10]
## carrier n
## <chr> <int>
## 1 9E 1044
## 2 AA 22558
## 3 B6 830
## 4 DL 110
## 5 F9 50
## 6 FL 187
## 7 MQ 25397
## 8 UA 1693
## 9 US 699
## 10 WN 38
#Most of the missing flights come from American (AA) and Envoy Air (MQ)
#From the package description:
#Plane metadata for all plane tailnumbers found in the FAA aircraft registry. American Airways (AA) and Envoy Air (MQ) report fleet numbers rather than tail numbers so can't be matched.
#filter for 100+ flights
filter(flights, !is.na(tailnum)) %>%
group_by(tailnum) %>%
count(tailnum) %>%
filter(n >= 100) -> flights100
flights100
## # A tibble: 1,217 x 2
## # Groups: tailnum [1,217]
## tailnum n
## <chr> <int>
## 1 N0EGMQ 371
## 2 N10156 153
## 3 N10575 289
## 4 N11106 129
## 5 N11107 148
## 6 N11109 148
## 7 N11113 138
## 8 N11119 148
## 9 N11121 154
## 10 N11127 124
## # ... with 1,207 more rows
#semi-join to only select flights which match filter
semi_join(flights, flights100, by = 'tailnum')
## # A tibble: 228,390 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 544 545 -1 1004 1022
## 4 2013 1 1 554 558 -4 740 728
## 5 2013 1 1 555 600 -5 913 854
## 6 2013 1 1 557 600 -3 709 723
## 7 2013 1 1 557 600 -3 838 846
## 8 2013 1 1 558 600 -2 849 851
## 9 2013 1 1 558 600 -2 853 856
## 10 2013 1 1 558 600 -2 923 937
## # ... with 228,380 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#vehicles
#common
semi_join(vehicles, common, by = c('make', 'model'))
## # A tibble: 14,531 x 12
## id make model year class trans drive cyl displ fuel hwy cty
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1833 Acura Integra 1986 Subcom~ Autom~ Front~ 4 1.6 Regu~ 28 22
## 2 1834 Acura Integra 1986 Subcom~ Manua~ Front~ 4 1.6 Regu~ 28 23
## 3 3037 Acura Integra 1987 Subcom~ Autom~ Front~ 4 1.6 Regu~ 28 22
## 4 3038 Acura Integra 1987 Subcom~ Manua~ Front~ 4 1.6 Regu~ 28 23
## 5 4183 Acura Integra 1988 Subcom~ Autom~ Front~ 4 1.6 Regu~ 27 22
## 6 4184 Acura Integra 1988 Subcom~ Manua~ Front~ 4 1.6 Regu~ 28 23
## 7 5303 Acura Integra 1989 Subcom~ Autom~ Front~ 4 1.6 Regu~ 27 22
## 8 5304 Acura Integra 1989 Subcom~ Manua~ Front~ 4 1.6 Regu~ 28 23
## 9 6442 Acura Integra 1990 Subcom~ Autom~ Front~ 4 1.8 Regu~ 24 20
## 10 6443 Acura Integra 1990 Subcom~ Manua~ Front~ 4 1.8 Regu~ 26 21
## # ... with 14,521 more rows
#NOTE
#I only am looking at 24 hours
#I imagine doing some kind of mutate() with an if statement, saying, look for any days that are +/- 1 from the current day, then combine each as one 48-hour period add together (one for the + and one for the -)
#Find worst 24 hour delays
group_by(flights, origin, year, month, day) %>%
summarize(avg_daily_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(desc(avg_daily_delay)) -> worst_daily_delay
## `summarise()` has grouped output by 'origin', 'year', 'month'. You can override using the `.groups` argument.
worst_daily_delay
## # A tibble: 1,095 x 5
## # Groups: origin, year, month [36]
## origin year month day avg_daily_delay
## <chr> <int> <int> <int> <dbl>
## 1 LGA 2013 3 8 106.
## 2 EWR 2013 3 8 97.8
## 3 LGA 2013 9 2 80.5
## 4 LGA 2013 12 5 75.2
## 5 JFK 2013 7 10 63.6
## 6 LGA 2013 7 1 62.4
## 7 EWR 2013 12 5 61.2
## 8 JFK 2013 7 1 59.6
## 9 EWR 2013 9 12 58.7
## 10 EWR 2013 5 23 58.2
## # ... with 1,085 more rows
#filter for the worst 24 hours
filter(worst_daily_delay, year == 2013, month == 3, day == 8) -> worst_day
#Join with weather data - shows all three airports' hourly weather from the worst delayed day...Then show the avg weather from the day for each airport
semi_join(weather, worst_day, by = c('origin', 'year', 'month', 'day')) %>%
group_by(origin) %>%
summarize(avg_wind_spd = mean(wind_speed, na.rm = TRUE), avg_visib = mean(visib, na.rm = TRUE), avg_prec = mean(precip, na.rm = TRUE))
## # A tibble: 3 x 4
## origin avg_wind_spd avg_visib avg_prec
## <chr> <dbl> <dbl> <dbl>
## 1 EWR 13.5 4.53 0.0396
## 2 JFK 19.5 5.18 0.0075
## 3 LGA 17.7 5.02 0.0104
#From previous questions, delays were associate with visibility below 2.5
#Delays were also pretty linearly correlated with increasing windseepd above ~5
#Additionally, any percipitation above ~0.15 also had some correlation
#From the chart here, the only data that matches that previous work is higher wind speed
#This shows flights were the destination airport had no match in the airports' faa column
anti_join(flights, airports, by = c("dest" = "faa"))
## # A tibble: 7,602 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 544 545 -1 1004 1022
## 2 2013 1 1 615 615 0 1039 1100
## 3 2013 1 1 628 630 -2 1137 1140
## 4 2013 1 1 701 700 1 1123 1154
## 5 2013 1 1 711 715 -4 1151 1206
## 6 2013 1 1 820 820 0 1254 1310
## 7 2013 1 1 820 820 0 1249 1329
## 8 2013 1 1 840 845 -5 1311 1350
## 9 2013 1 1 909 810 59 1331 1315
## 10 2013 1 1 913 918 -5 1346 1416
## # ... with 7,592 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#This shows airports in the airports' faa column which were never destinations of the flights in the flights table
#It's not surprising that many of them are in the immediate NYC area
anti_join(airports, flights, by = c("faa" = "dest"))
## # A tibble: 1,357 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/~
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/~
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/~
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/~
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/~
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/~
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/~
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/~
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/~
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/~
## # ... with 1,347 more rows
#Off the top of my head, a plane may have been owned by multiple airlines
#The plane will only show up once in any airlines' data, but it may show up in multiple airlines
#additionally, similar to the batting table from before, a plane might have multiple "stints" with a given airlines if it was bought/sold several times
#distinct routes by carrier
flights %>%
filter(!is.na(tailnum)) %>%
distinct(tailnum, carrier) -> routes
routes
## # A tibble: 4,060 x 2
## carrier tailnum
## <chr> <chr>
## 1 UA N14228
## 2 UA N24211
## 3 AA N619AA
## 4 B6 N804JB
## 5 DL N668DN
## 6 UA N39463
## 7 B6 N516JB
## 8 EV N829AS
## 9 B6 N593JB
## 10 AA N3ALAA
## # ... with 4,050 more rows
#tailnumbers, i.e. planes which have a route with more than 1 carrier
filter(count(routes, tailnum), n > 1) -> multi_carrier
multi_carrier
## # A tibble: 17 x 2
## tailnum n
## <chr> <int>
## 1 N146PQ 2
## 2 N153PQ 2
## 3 N176PQ 2
## 4 N181PQ 2
## 5 N197PQ 2
## 6 N200PQ 2
## 7 N228PQ 2
## 8 N232PQ 2
## 9 N933AT 2
## 10 N935AT 2
## 11 N977AT 2
## 12 N978AT 2
## 13 N979AT 2
## 14 N981AT 2
## 15 N989AT 2
## 16 N990AT 2
## 17 N994AT 2
This work is based on “R for Data Science” by Hadley and Grolemund. Tt is licensed under
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/us/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.