NYC Flights HW

Author

AlineMayrink

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
library(RColorBrewer)
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2023     1     1        1           2038       203      328              3
2  2023     1     1       18           2300        78      228            135
3  2023     1     1       31           2344        47      500            426
4  2023     1     1       33           2140       173      238           2352
5  2023     1     1       36           2048       228      223           2252
6  2023     1     1      503            500         3      808            815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
clean_flights <- flights |>
  filter(!is.na(dep_delay) & !is.na(arr_delay)) |>
  filter(dep_delay <= 600, arr_delay <= 600) |>    
  mutate(
    carrier = as.factor(carrier),  
    month = as.factor(month),      
    day = as.factor(day),          
    origin = as.factor(origin),    
    dest = as.factor(dest)         
  )
nrow(clean_flights)
[1] 422308
avg_delay_by_carrier <- clean_flights |>
  group_by(carrier) |>
  summarize(
    avg_dep_delay = mean(dep_delay, na.rm = TRUE),
    avg_arr_delay = mean(arr_delay, na.rm = TRUE)
  ) |>
  mutate(
    carrier_name = recode(
      carrier,
      AA = "American Airlines", 
      AS = "Alaska Airlines", 
      B6 = "JetBlue Airways", 
      DL = "Delta Airlines", 
      F9 = "Frontier Airlines", 
      G4 = "Allegiant Air",
      HA = "Hawaiian Airlines",
      MQ = "Envoy Air", 
      NK = "Spirit Air Lines",
      OO = "SkyWest Airlines", 
      UA = "United Airlines", 
      WN = "Southwest Airlines",
      YX = "Republic Airline"
    )
  ) |>
  mutate(carrier_name = factor(carrier_name, levels = unique(carrier_name)))
sum(is.na(avg_delay_by_carrier$carrier_name))
[1] 0
num_carriers <- length(unique(avg_delay_by_carrier$carrier_name))
carrier_colors <- colorRampPalette(brewer.pal(12, "Set3")) (num_carriers)
ggplot(avg_delay_by_carrier, aes(x = avg_dep_delay, y = avg_arr_delay, color = carrier_name)) +
  geom_point(size = 5, shape = 16, alpha = 0.7) +  # Larger, semi-transparent points
  labs(
    title = "Average Departure vs Arrival Delays by Air Carrier",
    x = "Average Departure Delay (minutes)",
    y = "Average Arrival Delay (minutes)",
    caption = "Source: FAA Aircraft Registry",
    color = "Carrier Name"  # Legend title
  ) +
  scale_color_manual(values = carrier_colors) + 
  theme_minimal() +
  theme(
    legend.title = element_text(size = 12),  
    legend.text = element_text(size = 10),   
    axis.title = element_text(size = 12),    
    plot.title = element_text(size = 16, face = "bold"),  
    plot.caption = element_text(size = 10, color = "gray")  
  )

The visualization presents a scatter plot illustrating the relationship between average departure delays and average arrival delays for different air carriers. Each point represents a specific airline, plotted based on its average delay metrics. The x-axis displays the average departure delay (in minutes), while the y-axis represents the average arrival delay (in minutes). A key takeaway from the plot is that airlines with higher departure delays tend to have higher arrival delays, suggesting a strong correlation between late takeoffs and late landings.

One aspect of the plot worth highlighting is the use of the RColorBrewer library, which ensures each carrier is visually distinct. This color-coding enhances readability and simplifies comparisons between different airlines’ punctuality. Additionally, outlier airlines with significantly higher delays can be quickly identified, making it easier to spot carriers that struggle with on-time performance.

Overall, this visualization offers valuable insights into airline punctuality and operational efficiency. It can be useful for passengers, aviation analysts, and policymakers looking to evaluate airline reliability. Further analysis could examine how external factors such as weather, airport congestion, or time of day impact these delays. Identifying seasonal trends or patterns in delay frequency could also help airlines optimize scheduling and minimize disruptions, improving overall efficiency in air travel.