NYC Flights Visualization

Author

Zaid Hageman

NYC Flights Visualization

Importing and looking at all the necessary data

First thing we need to do is import tidyverse

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Then we need to import the library with all of our data in it

library(nycflights23)

Now we can look up the head to see the data sets structure

head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2023     1     1        1           2038       203      328              3
2  2023     1     1       18           2300        78      228            135
3  2023     1     1       31           2344        47      500            426
4  2023     1     1       33           2140       173      238           2352
5  2023     1     1       36           2048       228      223           2252
6  2023     1     1      503            500         3      808            815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

How to represent average flight delay per carrier

First we need to be able to group the data by the carrier and summarize the data based on its average flight delay. With this information sorted we can then represent it with the code to show how they relate.

avg_arrival_delay <- flights |>
  group_by(carrier) |>
  summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE))

This is the code needed to represent the graph

ggplot(avg_arrival_delay, aes(y = reorder(carrier, avg_arr_delay), 
                              x = avg_arr_delay, fill = avg_arr_delay > 0)) +
  geom_bar(stat = "identity") +  
  scale_fill_manual(values = alpha(c("green", "red"), 0.7),  
                    labels = c("On Time", "Delayed"),
                    name = "Arrival Status") +
  labs(title = "Average Arrival Delay by Carrier",
       x = "Average Arrival Delay (minutes)",
       y = "Carrier",
       caption = "Data source: nycflights23") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, face = "bold", hjust = 0.5),  # hjust centers the text
    axis.title = element_text(size = 13, face = "bold"),
    axis.text = element_text(size = 10),  # Slightly larger axis text for better readability
  )

Summary

This bar plot shows the average arrival delays for each carrier flights from New York City. The carriers are represented on the y axis, while the average delay in minutes is shown on the x axis. Carriers with a positive average delay are colored red, indicating that their flights are generally delayed, while those with a non-positive average delay are colored green, showing they are on time or early and by how much. This visualization allows for quick comparison across carriers. Notably, G4 shows a significant average delay, which stands out compared to other carriers. This insight highlights the need for G4 airlines to examine factors contributing to their delays, because in order for them to be down there, the average would have to be delayed. This means a significant amount of their flights are not on time which is good information in order to fix it.