NYC Flights Homework

Author

Oliver Kronen

library(ggplot2)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ lubridate 1.9.5     ✔ tibble    3.3.1
✔ purrr     1.2.1     ✔ tidyr     1.3.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
data(flights)
flights_no_na <- flights |>
  filter(!is.na(distance) & !is.na(arr_delay) & !is.na(dep_delay))  
flights2 <- left_join(flights_no_na, airlines, by = "carrier")
flights2$name <- gsub("Inc\\.|Co\\.", "", flights2$name)
by_carrier <- flights2 |>
  group_by(carrier) |>  
  summarise(count = n(),   
            avg_dist = mean(distance), 
            avg_arr_delay = mean(arr_delay),  
            avg_dep_delay = mean(dep_delay), 
            .groups = "drop") |>  
  arrange(avg_arr_delay) |>
  filter(avg_dist < 3000)
head(by_carrier)
# A tibble: 6 × 5
  carrier count avg_dist avg_arr_delay avg_dep_delay
  <chr>   <int>    <dbl>         <dbl>         <dbl>
1 G4        667     723.       -5.88            3.98
2 YX      85431     485.       -4.64            4.11
3 9E      52204     487.       -2.23            7.38
4 AS       7734    2481.        0.0844         11.8 
5 MQ        354     725.        0.119          10.5 
6 DL      60364    1278.        1.64           15.0 
p1 <- by_carrier |>
  ggplot(aes(x = carrier, y = avg_dep_delay, fill = carrier)) +
  geom_histogram(stat = "identity", alpha = 0.9 ,binwidth = 5, colour = "white") +
  scale_fill_discrete(name = "Airline Carrier (Abbr)", labels = c("Endeavor Air (9E)", "American Airlines (AA)", "Alaska Airlines (AS)", "JetBlue Ariways (B6)", "Delta Air Lines (DL)", "Frontier Airlines (F9)", "Allegiant Air (G4)", "Envoy Air (MQ)", "Spirit Airlines (NK)", "Sky West Airlines (OO)", "United Airlines (UA)", "Southwest Airlines (WN)", "Republic Airways (YX)")) +
  labs(x = "Airline Carrier", y = "Average Departure Delay in Minutes", title = "Average Departure Delay of Airline Carriers", caption = "Source: FAA Aircraft Registry")
Warning in geom_histogram(stat = "identity", alpha = 0.9, binwidth = 5, :
Ignoring unknown parameters: `binwidth` and `bins`
p1

Please note that while no dplyr command was used in actual code for the graph, there were many used in the previous strings of code which enabled me to create the graph. The visualization I have created showcases the average departure delay for each airline carrier using a bar graph. The y axis represents the average departure delay, starting at 0 minutes and capping at 30 minutes. The x axis displays the abbreviated airline carriers, and the legend corresponds to the airline carrier to their respective colour and abbreviation. Each airline carrier is differentiated by colour. There is a caption at the bottom indicating the source of the information.  

Through this graph, we can understand certain facts regarding departure time, such as Frontier Airlines having the largest average delay while Allegiant Air and Republic Airways have the lowest.   

One aspect I would like to highlight is the data inside the legend. While the code is not present in the previous lines as I could not figure out how to make it all work together, I had to use the left join function to understand which airline abbreviation went with which carrier name, allowing me to create the legend.