Question One: Solve problem 13.5.1.5 from the textbook. The first line of code (25) is telling us about the flights that are not in our flights and airports datasets. The second line of code (26) is telling us about the flights that are not in our flights dataset that includes in country flights from FAA. Therefore it tells us the flights that go out of the country.

anti_join(flights, airports, by = c("dest" = "faa"))
## # A tibble: 7,602 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      544            545        -1     1004           1022
##  2  2013     1     1      615            615         0     1039           1100
##  3  2013     1     1      628            630        -2     1137           1140
##  4  2013     1     1      701            700         1     1123           1154
##  5  2013     1     1      711            715        -4     1151           1206
##  6  2013     1     1      820            820         0     1254           1310
##  7  2013     1     1      820            820         0     1249           1329
##  8  2013     1     1      840            845        -5     1311           1350
##  9  2013     1     1      909            810        59     1331           1315
## 10  2013     1     1      913            918        -5     1346           1416
## # ℹ 7,592 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
anti_join(airports, flights, by = c("faa" = "dest"))
## # A tibble: 1,357 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # ℹ 1,347 more rows

Question Two: Find the day during 2013 that had the longest average total delay (arrival + departure).
Using the weather data, can you shed any light on what happened on that day? Answer: According to the data, the day with the longest average total delay was March 8 due to decreased visibility throughout the day.

## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 24 × 16
## # Groups:   year, month [1]
##     year month   day avg_arr_dep origin  hour  temp  dewp humid wind_dir
##    <int> <int> <int>       <dbl> <chr>  <int> <dbl> <dbl> <dbl>    <dbl>
##  1  2013     3     8        170. EWR        0  33.8  32    95.8      330
##  2  2013     3     8        170. EWR        1  33.1  32    95.8      330
##  3  2013     3     8        170. EWR        2  33.1  30.9  91.7      330
##  4  2013     3     8        170. EWR        3  33.1  30.9  91.7      340
##  5  2013     3     8        170. EWR        4  33.8  30.9  91.7      340
##  6  2013     3     8        170. EWR        5  32    30.9  95.7      340
##  7  2013     3     8        170. EWR        6  32    30.0  92.3      350
##  8  2013     3     8        170. EWR        7  32    30.0  92.3      330
##  9  2013     3     8        170. EWR        8  32    30.2 100        340
## 10  2013     3     8        170. EWR        9  32    30.2  93.0      340
## 11  2013     3     8        170. EWR       10  32    30.9  95.7      340
## 12  2013     3     8        170. EWR       11  32    30.9  95.7      340
## 13  2013     3     8        170. EWR       12  33.8  32    95.8      320
## 14  2013     3     8        170. EWR       13  34.0  32    93.0      340
## 15  2013     3     8        170. EWR       14  34.0  33.1  96.5      320
## 16  2013     3     8        170. EWR       15  37.0  33.8  93.1      350
## 17  2013     3     8        170. EWR       16  39.0  30.9  72.5      360
## 18  2013     3     8        170. EWR       17  39.9  33.1  76.3      360
## 19  2013     3     8        170. EWR       18  41    30.0  64.7      360
## 20  2013     3     8        170. EWR       19  41    30.9  67.1      330
## 21  2013     3     8        170. EWR       20  39.9  32    73.1      340
## 22  2013     3     8        170. EWR       21  39.9  30.9  70.0      340
## 23  2013     3     8        170. EWR       22  39.0  28.9  66.8      350
## 24  2013     3     8        170. EWR       23  37.9  28.9  69.7      340
## # ℹ 6 more variables: wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour <dttm>

Question Three: Using planes and flights find the airplane models with the fastest average speeds.
Which planes are they? Why or why not? Answer: The plane with the highest average speed is the plane by the Boeing manufacturer, model 777-222 with an average speed of 482.6254.

flights%>%
  left_join(planes, by = ("tailnum" = "tailnum"))%>%
  mutate(speed = distance/air_time*60)%>%
  group_by(model, manufacturer)%>%
  summarize(flights_speed_avg = mean(speed, na.rm = TRUE))%>%
  arrange(-flights_speed_avg)%>%
  head()
## `summarise()` has grouped output by 'model'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 3
## # Groups:   model [6]
##   model     manufacturer     flights_speed_avg
##   <chr>     <chr>                        <dbl>
## 1 777-222   BOEING                        483.
## 2 A330-243  AIRBUS                        480.
## 3 767-424ER BOEING                        467.
## 4 A321-231  AIRBUS INDUSTRIE              460.
## 5 A330-223  AIRBUS INDUSTRIE              458.
## 6 757-212   BOEING                        456.

Question Four: Solve problem 13.4.6.3 from the textbook (on the relationship between the age of a plane and the length of the delays) Answer: According to the data, we see that there is a trend for plane age and delay. When the plane is middle aged, or the center of our graph, we see that the delay increases. One could follow the data to assume that this could be around the time that planes need part repairs.

library(nycflights13)
library(tidyverse)
data(flights)
data(planes)
data(airports)
data(weather)
view(flights)
anti_join(flights, airports, by = c("dest" = "faa"))
anti_join(airports, flights, by = c("faa" = "dest"))
weather_delay <- flights%>%
  group_by(year, month, day)%>%
  summarize(avg_arr_dep = mean(arr_delay + dep_delay, na.rm = TRUE))%>%
  arrange(-avg_arr_dep)%>%
  left_join(weather, by= c("year", "month", "day"))%>%
  head(24)
print(weather_delay, n=24)
flights%>%
  left_join(planes, by = ("tailnum" = "tailnum"))%>%
  mutate(speed = distance/air_time*60)%>%
  group_by(model, manufacturer)%>%
  summarize(flights_speed_avg = mean(speed, na.rm = TRUE))%>%
  arrange(-flights_speed_avg)%>%
  head()
plane_cohorts <- inner_join(flights,
  select(planes, tailnum, plane_year = year),
  by = "tailnum"
) %>%
  mutate(age = year - plane_year) %>%
  filter(!is.na(age)) %>%
  mutate(age = if_else(age > 25, 25L, age)) %>%
  group_by(age) %>%
  summarise(
    dep_delay_mean = mean(dep_delay, na.rm = TRUE),
    arr_delay_mean = mean(arr_delay, na.rm = TRUE),
  )
print(plane_cohorts, n=20)
ggplot(plane_cohorts, aes(x=age, y=dep_delay_mean))+
  geom_point()