Question One: Solve problem 13.5.1.5 from the textbook. The first line of code (25) is telling us about the flights that are not in our flights and airports datasets. The second line of code (26) is telling us about the flights that are not in our flights dataset that includes in country flights from FAA. Therefore it tells us the flights that go out of the country.
anti_join(flights, airports, by = c("dest" = "faa"))
## # A tibble: 7,602 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 544 545 -1 1004 1022
## 2 2013 1 1 615 615 0 1039 1100
## 3 2013 1 1 628 630 -2 1137 1140
## 4 2013 1 1 701 700 1 1123 1154
## 5 2013 1 1 711 715 -4 1151 1206
## 6 2013 1 1 820 820 0 1254 1310
## 7 2013 1 1 820 820 0 1249 1329
## 8 2013 1 1 840 845 -5 1311 1350
## 9 2013 1 1 909 810 59 1331 1315
## 10 2013 1 1 913 918 -5 1346 1416
## # ℹ 7,592 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
anti_join(airports, flights, by = c("faa" = "dest"))
## # A tibble: 1,357 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # ℹ 1,347 more rows
Question Two: Find the day during 2013 that had the longest average
total delay (arrival + departure).
Using the weather data, can you shed any light on what happened on that
day? Answer: According to the data, the day with the longest average
total delay was March 8 due to decreased visibility throughout the
day.
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 24 × 16
## # Groups: year, month [1]
## year month day avg_arr_dep origin hour temp dewp humid wind_dir
## <int> <int> <int> <dbl> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2013 3 8 170. EWR 0 33.8 32 95.8 330
## 2 2013 3 8 170. EWR 1 33.1 32 95.8 330
## 3 2013 3 8 170. EWR 2 33.1 30.9 91.7 330
## 4 2013 3 8 170. EWR 3 33.1 30.9 91.7 340
## 5 2013 3 8 170. EWR 4 33.8 30.9 91.7 340
## 6 2013 3 8 170. EWR 5 32 30.9 95.7 340
## 7 2013 3 8 170. EWR 6 32 30.0 92.3 350
## 8 2013 3 8 170. EWR 7 32 30.0 92.3 330
## 9 2013 3 8 170. EWR 8 32 30.2 100 340
## 10 2013 3 8 170. EWR 9 32 30.2 93.0 340
## 11 2013 3 8 170. EWR 10 32 30.9 95.7 340
## 12 2013 3 8 170. EWR 11 32 30.9 95.7 340
## 13 2013 3 8 170. EWR 12 33.8 32 95.8 320
## 14 2013 3 8 170. EWR 13 34.0 32 93.0 340
## 15 2013 3 8 170. EWR 14 34.0 33.1 96.5 320
## 16 2013 3 8 170. EWR 15 37.0 33.8 93.1 350
## 17 2013 3 8 170. EWR 16 39.0 30.9 72.5 360
## 18 2013 3 8 170. EWR 17 39.9 33.1 76.3 360
## 19 2013 3 8 170. EWR 18 41 30.0 64.7 360
## 20 2013 3 8 170. EWR 19 41 30.9 67.1 330
## 21 2013 3 8 170. EWR 20 39.9 32 73.1 340
## 22 2013 3 8 170. EWR 21 39.9 30.9 70.0 340
## 23 2013 3 8 170. EWR 22 39.0 28.9 66.8 350
## 24 2013 3 8 170. EWR 23 37.9 28.9 69.7 340
## # ℹ 6 more variables: wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
## # pressure <dbl>, visib <dbl>, time_hour <dttm>
Question Three: Using planes and flights find the airplane models
with the fastest average speeds.
Which planes are they? Why or why not? Answer: The plane with the
highest average speed is the plane by the Boeing manufacturer, model
777-222 with an average speed of 482.6254.
flights%>%
left_join(planes, by = ("tailnum" = "tailnum"))%>%
mutate(speed = distance/air_time*60)%>%
group_by(model, manufacturer)%>%
summarize(flights_speed_avg = mean(speed, na.rm = TRUE))%>%
arrange(-flights_speed_avg)%>%
head()
## `summarise()` has grouped output by 'model'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 3
## # Groups: model [6]
## model manufacturer flights_speed_avg
## <chr> <chr> <dbl>
## 1 777-222 BOEING 483.
## 2 A330-243 AIRBUS 480.
## 3 767-424ER BOEING 467.
## 4 A321-231 AIRBUS INDUSTRIE 460.
## 5 A330-223 AIRBUS INDUSTRIE 458.
## 6 757-212 BOEING 456.
Question Four: Solve problem 13.4.6.3 from the textbook (on the relationship between the age of a plane and the length of the delays) Answer: According to the data, we see that there is a trend for plane age and delay. When the plane is middle aged, or the center of our graph, we see that the delay increases. One could follow the data to assume that this could be around the time that planes need part repairs.
library(nycflights13)
library(tidyverse)
data(flights)
data(planes)
data(airports)
data(weather)
view(flights)
anti_join(flights, airports, by = c("dest" = "faa"))
anti_join(airports, flights, by = c("faa" = "dest"))
weather_delay <- flights%>%
group_by(year, month, day)%>%
summarize(avg_arr_dep = mean(arr_delay + dep_delay, na.rm = TRUE))%>%
arrange(-avg_arr_dep)%>%
left_join(weather, by= c("year", "month", "day"))%>%
head(24)
print(weather_delay, n=24)
flights%>%
left_join(planes, by = ("tailnum" = "tailnum"))%>%
mutate(speed = distance/air_time*60)%>%
group_by(model, manufacturer)%>%
summarize(flights_speed_avg = mean(speed, na.rm = TRUE))%>%
arrange(-flights_speed_avg)%>%
head()
plane_cohorts <- inner_join(flights,
select(planes, tailnum, plane_year = year),
by = "tailnum"
) %>%
mutate(age = year - plane_year) %>%
filter(!is.na(age)) %>%
mutate(age = if_else(age > 25, 25L, age)) %>%
group_by(age) %>%
summarise(
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
)
print(plane_cohorts, n=20)
ggplot(plane_cohorts, aes(x=age, y=dep_delay_mean))+
geom_point()