I, Maria Dermit, hereby state that I have not communicated with or gained information in any way from my classmates or anyone other than the Professor or TA during this exam, and that all work is my own.
library(tidyverse)
library(nycflights13)
library(glue)
Question 1 (5 points) - What are the ten most common destinations for flights from NYC airports in 2013? Make a table that lists these in descending order of frequency and shows the number of flights heading to each airport.
flights %>%
count(dest, name="number_flights_dest", sort = TRUE)
## # A tibble: 105 x 2
## dest number_flights_dest
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
## # … with 95 more rows
Which airlines have the most flights departing from NYC airports in 2013? Make a table that lists these in descending order of frequency and shows the number of flights for each airline. In your narrative mention the names of the airlines as well. Hint: You can use the airlines dataset to look up the airline name based on carrier code.
inner_join(flights,airlines) %>%
count(name, name="number_flights_airline", sort = TRUE)
## Joining, by = "carrier"
## # A tibble: 16 x 2
## name number_flights_airline
## <chr> <int>
## 1 United Air Lines Inc. 58665
## 2 JetBlue Airways 54635
## 3 ExpressJet Airlines Inc. 54173
## 4 Delta Air Lines Inc. 48110
## 5 American Airlines Inc. 32729
## 6 Envoy Air 26397
## 7 US Airways Inc. 20536
## 8 Endeavor Air Inc. 18460
## 9 Southwest Airlines Co. 12275
## 10 Virgin America 5162
## 11 AirTran Airways Corporation 3260
## 12 Alaska Airlines Inc. 714
## 13 Frontier Airlines Inc. 685
## 14 Mesa Airlines Inc. 601
## 15 Hawaiian Airlines Inc. 342
## 16 SkyWest Airlines Inc. 32
Consider only flights that have non-missing arrival delay information. Your answer should include the name of the carrier in addition to the carrier code and the values asked. a. Which carrier had the highest mean arrival delay? b. Which carrier had the lowest mean arrival delay?
#initially check whether there are some NA in arrival delay indeed
#flights %>% arrange(desc(is.na(arr_delay)))
flights %>%
group_by(carrier) %>%
summarise(delay = mean(arr_delay, na.rm = TRUE)) %>%
inner_join(airlines) %>% arrange(desc(delay)) %>%
slice_head(n = 1)
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "carrier"
## # A tibble: 1 x 3
## carrier delay name
## <chr> <dbl> <chr>
## 1 F9 21.9 Frontier Airlines Inc.
What was the mean temperature at the origin airport on the day with the highest departure delay? Your answer should include the name of origin airport, the date with the highest departure delay, and the mean temperature on that day.
worst_day <-flights %>%
group_by(year,month,day) %>%
summarise(max_dep_delay=max(dep_delay,na.rm=TRUE)) %>%
arrange(desc(max_dep_delay)) %>%
ungroup() %>%
slice_head(n = 1)
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#I can use semi-join, which connects the two tables and only the rows in x that have a match in y
weather_most_delayed <-
semi_join(weather, worst_day,
by = c("year", "month", "day" )) %>%
group_by(origin) %>%
summarise(mean_temp=mean(temp, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
weather_most_delayed
## # A tibble: 3 x 2
## origin mean_temp
## <chr> <dbl>
## 1 EWR 42.1
## 2 JFK 42.7
## 3 LGA 44.9
Consider breaking the day into four time intervals: 12:01am-6am, 6:01am-12pm, 12:01pm-6pm, 6:01pm-12am. a. Calculate the proportion of flights that are delayed at departure at each of these time intervals. b. Comment on how the likelihood of being delayed change throughout the day?
I will consider delayed as more than 2h delayed
#a
flights_intervals <- flights %>%
mutate(
interval= case_when(
dep_time <= 600 & dep_time >= 1 ~ "first_interval",
dep_time >= 601 & dep_time <= 1200 ~ "second_interval",
dep_time >= 1201 & dep_time <= 1800 ~ "third_interval",
dep_time >= 1801 & dep_time <= 2400 ~ "fourth_interval")
%>% factor(levels=c('first_interval', 'second_interval', 'third_interval',"fourth_interval")))
#checking that I did this correctly
#flights_intervals %>% count(interval)
#flights %>% ggplot(aes(dep_time))+geom_histogram()
flights_intervals %>%
group_by(interval) %>%
count(delayed_more_2h=dep_delay > 120) %>%
add_count(interval, wt = n,
name = "fligths_interval") %>%
mutate(prop = n/fligths_interval*100) %>%
drop_na()
## # A tibble: 8 x 5
## # Groups: interval [4]
## interval delayed_more_2h n fligths_interval prop
## <fct> <lgl> <int> <int> <dbl>
## 1 first_interval FALSE 8672 9344 92.8
## 2 first_interval TRUE 672 9344 7.19
## 3 second_interval FALSE 121395 122082 99.4
## 4 second_interval TRUE 687 122082 0.563
## 5 third_interval FALSE 118562 120738 98.2
## 6 third_interval TRUE 2176 120738 1.80
## 7 fourth_interval FALSE 70169 76357 91.9
## 8 fourth_interval TRUE 6188 76357 8.10
#b, prediction delay
flights_delayed<- flights %>%
# group flight cancellation and flight delay into one level
mutate(delay = ifelse(dep_delay >= 120 | is.na(dep_delay) == TRUE, 1, 0))
#actually only 5% of flights are delayed
flights_delayed %>% count(delay) %>% mutate(prop=n/sum(n))
## # A tibble: 2 x 3
## delay n prop
## <dbl> <int> <dbl>
## 1 0 318633 0.946
## 2 1 18143 0.0539
flights_delayed %>%
filter(delay == 1) %>%
group_by(hour) %>% summarize(n_delays = n()) %>%
ggplot(aes(x= hour, y = n_delays)) +
geom_point() +
geom_line(col = "blue")
## `summarise()` ungrouping output (override with `.groups` argument)
Most of the delays occur around 7 pm
Find the flight with the longest air time. a. How long is this flight? b. What city did it fly to? c. How many seats does the plane that flew this flight have?
longest_distance<-flights %>% arrange( desc(distance)) %>%
slice_head(n=1) %>% pull(distance)
longest_destination<-flights %>% arrange( desc(distance)) %>%
slice_head(n=1) %>% pull(dest)
longest_origin<-flights %>% arrange( desc(distance)) %>%
slice_head(n=1) %>% pull(origin)
longest_seats <- inner_join(flights,planes, by="tailnum") %>% filter(tailnum=="N380HA") %>% pull(seats) %>% unique()
glue('The longest flight is {longest_origin} to {longest_destination}, which is {longest_distance} miles and has {longest_seats}.')
## The longest flight is JFK to HNL, which is 4983 miles and has 377.
Question 7 (15 pts) - The airports data frame contains information on a large number of primarily American airports. These data include location information for these airports in the form of latitude and longitude coordinates. In this question we limit our focus to the Contiguous United States. Visualize and describe the distribution of the longitudes of airports in the Contiguous United States. What does this tell you about the geographical distribution of these airports? Hint: You will first need to limit your analysis to the Contiguous United States. This Wikipedia article can help, but you’re welcomed to use other resources as well. Make sure to cite whatever resource you use.
I use the tzone information ont the airport data and this this Wikipedia article on tzones names
flights_latlon <- flights %>%
inner_join(select(airports, origin = faa, origin_lat = lat, origin_lon = lon),
by = "origin"
) %>%
right_join(select(airports, dest = faa, dest_lat = lat, dest_lon = lon,tzone),
by = "dest"
)
# Checking that I am looking withing the Contiguous United States
flights_latlon %>%
filter(!tzone%in% c("Pacific/Honolulu","America/Anchorage")) %>%
ggplot(aes(
x = origin_lon, xend = dest_lon,
y = origin_lat, yend = dest_lat
)) +
borders("state") +
geom_segment(arrow = arrow(length = unit(0.1, "cm"))) +
coord_quickmap() +
labs(y = "Latitude", x = "Longitude")
## Warning: Removed 1102 rows containing missing values (geom_segment).
flights_latlon %>%
filter(!tzone%in% c("Pacific/Honolulu","America/Anchorage")) %>%
select(dest_lon) %>%
ggplot()+
geom_freqpoly(aes(dest_lon), binwidth=1)
Map of the longitudes of airports in the Contiguous United States shows that the middle states have fewer airports than the Coasts
Recreate the plot included below using the flights data. Once you have created the visualization, in no more than one paragraph, describe what you think the point of this visualization might be. Hint: The visualization uses the variable arrival, which is not included in the flights data frame. You will have to create arrival yourself, it is a categorical variable that is equal to “ontime” when arr_delay <= 0 and “delayed” when arr_delay > 0.
flights %>%
filter(dest %in% c("PHL","RDU")) %>%
filter(month==12) %>%
mutate(
Arrival= case_when(
arr_delay <= 0 ~ "ontime",
arr_delay > 0 ~ "delayed")) %>%
filter(!is.na(Arrival)) %>%
ggplot(aes(x=Arrival, y= dep_delay, color=dest))+
geom_boxplot()+
facet_grid(dest~origin)+
labs(title="On time performance of NYC fligths",
subtitle="December 2013",
y="Departure delay",
color = "Destination")+
coord_cartesian( ylim = c(0,200))
[Enter code and narrative here.]