NYC Flights Visualization

Author

Iris Wu

Load the library

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.1
Warning: package 'tibble' was built under R version 4.5.1
Warning: package 'purrr' was built under R version 4.5.1
Warning: package 'stringr' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)

data(flights)

data("airports")

Remove NA values

flightsnoNA <- flights |>
    filter(!is.na(distance) & !is.na(arr_delay) & !is.na(dep_delay))  

Combine the month, day, and year columns into one

#Name the new column "date" 
flights2 <- flightsnoNA |>
  unite (date, year, month, day, sep = "-") |> 
  #reformat the dates 
  mutate(newdate = as.Date(date, format = "%Y - %m - %d" ))

Filter for dates between Thanksgiving and New Year’s Eve

#Include Nov 22 in the date range 
flights_holidayseason <- flights2 |>
  filter(newdate >= "2023-11-22" & newdate <= "2023-12-31")

Join the dataset with “airports”

flights_hs <- left_join(flights_holidayseason, airports, by = c("origin" = "faa")) |>
  #create a new column that spells out the airport codes 
  mutate("airport" = name) 
#reformat the data 
flights_hs$`airport`<- gsub("John F Kennedy", "John F. Kennedy", flights_hs$`airport`) 
#remove duplicate columns 
flights_hs2 <- flights_hs |> select(-origin, -date)
##create a new column called "period" 
flights_hs3 <- flights_hs2 |>
  mutate(period = "holiday season")

Group data by airport

by_airport <- flights_hs3 |>
  group_by(airport, period) |>
  summarize(avg_dep_delay = mean(dep_delay),
            .groups = "drop") 
  head(by_airport)
# A tibble: 3 × 3
  airport                               period         avg_dep_delay
  <chr>                                 <chr>                  <dbl>
1 John F. Kennedy International Airport holiday season         10.4 
2 La Guardia Airport                    holiday season          6.21
3 Newark Liberty International Airport  holiday season          6.80

Create a dataset for comparison, repeating the steps above

This dataset excludes all dates between November 22 and December 31, 2023.

flights_year <- flights2 |>
  filter(newdate < "2023-11-22")
flights_y2 <- left_join(flights_year, airports, by = c("origin" = "faa")) |>
  mutate("airport" = name) 
#clean up the data
  flights_y2$airport <- gsub("John F Kennedy", "John F. Kennedy", flights_y2$airport)
  #remove duplicates
  flights_y3 <- flights_y2 |> select(-origin, -date)
  #create column for period 
  flights_y4 <- flights_y3 |> 
    mutate(period = "rest of year")
by_airport2 <- flights_y4 |> 
  group_by(airport, period) |>
  summarize(avg_dep_delay = mean(dep_delay), 
            .groups = "drop")
head(by_airport2)
# A tibble: 3 × 3
  airport                               period       avg_dep_delay
  <chr>                                 <chr>                <dbl>
1 John F. Kennedy International Airport rest of year          16.4
2 La Guardia Airport                    rest of year          11.2
3 Newark Liberty International Airport  rest of year          16.2

Data Visualization

#Combine the two datasets 
flights_final <- bind_rows(by_airport, by_airport2) 
#Remove redundant words 
flights_final$airport <- gsub("International", " ", flights_final$airport)
#Repeat 
flights_final$airport <- gsub("Airport", " ", flights_final$airport)
ggplot(flights_final, aes(x = airport, y = avg_dep_delay, fill = period)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_fill_manual(name = "Time Period", labels = c("Holiday Season", "Rest of Year"), values = c("red3", "orange")) +
   scale_y_continuous(lim = c(0, 20)) +
  labs (x = "NYC Airports",y = "Avg Departure Delay (minutes)", title = "Average Departure Delays of NYC Flights during \n 2023 Holiday Season* Compared to Rest of Year", subtitle = "*From Nov 22 to Dec 31", caption = "Source: FAA Aircraft Registry")+
  theme_gray() +
  theme(plot.title = element_text(hjust = .6, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = .5, face = "italic")) +
  theme(plot.caption = element_text(hjust = .5, vjust = -2, face = "italic")) + 
  theme(axis.text.x = element_text(vjust = -.5)) +
  theme(axis.title.x = element_text(vjust = -2))+
  theme(axis.title.y = element_text(vjust = 2))

Description of Data Visualization

I wanted to see if average departure delays of NYC flights would be longer during the holiday season than the rest of the year, so I created a bar chart comparing the two time periods. In this chart, the “holiday season” is the period from November 22 to December 31, New Year’s Eve. I included November 22 because it was the Wednesday before Thanksgiving in 2023, and that day is usually one of the busiest travel days of the year. The “rest of year” time period is from January 1 to November 21. Since the holiday season is a busy time for travel, I had assumed that the increased number of flights would lead to longer delays. Surprisingly, the chart shows that average departure delays were actually shorter during the holiday season. I think this chart is most useful for making comparisons among and within the three NYC airports. For example, Newark Liberty had the greatest difference between average departure delays during the holiday season and the rest of the year, while La Guardia had the smallest. Of the three, John F. Kennedy had the longest average departure delay during the rest of the year. These comparisons raise questions about practices each airport could implement to reduce departure delay times during the holiday season or the entire year.

Use of AI Disclosure

I was stuck on how to make each bar on the chart represent a different time period (holiday season vs. rest of year), so I asked Google Gemini (2.5 Flash Version) for suggestions.

Prompt used: “Working in R studio, I have created two datasets, data1 and data2, one showing flights in x time period and another showing flights in y time period. I want to plot them in a grouped bar chart, with one bar showing x period and another y period (they should be different colors). What is the line of code I should start with?”

Answer from Google Gemini: “The essential first step is to combine your datasets and add a column to identify the time period for each observation. This prepares the data for ggplot2’s grouping aesthetic.The line of code you should start with is:

R combined_data <- bind_rows( mutate(data1, time_period = “X”), mutate(data2, time_period = “Y”))“

I used Gemini’s suggestion to write the “mutate(period =”holiday season”)” and “mutate(period =”rest of year”)” code and to combine the two datasets using “flights_final <- bind_rows(by_airport, by_airport2)”.