NYC Flights Visualization

Author

Senay Leul Kahsay

Loading in libraries and the dataset

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
data(flights)

Cleaning the data

First get rid of abbreviations

flights$carrier <- factor(flights$carrier, 
                        levels = c("9E", "AA", "AS", "B6", "DL", "F9", "G4", "HA", "MQ", "NK", "OO", "UA", "WN", "YX"), 
                        labels = c("Endeavor Air", "American Airlines", "Alaska Airlines", "JetBlue Airways", "Delta Airlines", "Frontier Airlines", "Allegiant Air", "Hawaiian Airlines", "Envoy Air", "Spirit Airlines", "Skywest Airlines", "United Airlines", "Southwest Airlines", "Republic Airline"))
flights |> count(carrier)
# A tibble: 14 × 2
   carrier                n
   <fct>              <int>
 1 Endeavor Air       54141
 2 American Airlines  40525
 3 Alaska Airlines     7843
 4 JetBlue Airways    66169
 5 Delta Airlines     61562
 6 Frontier Airlines   1286
 7 Allegiant Air        671
 8 Hawaiian Airlines    366
 9 Envoy Air            357
10 Spirit Airlines    15189
11 Skywest Airlines    6432
12 United Airlines    79641
13 Southwest Airlines 12385
14 Republic Airline   88785

Then change abbreviations for origin too

flights$origin <- factor(flights$origin, 
                         levels = c("EWR", "JFK", "LGA"), 
                         labels = c("Newark Liberty Airport", "John F. Kennedy Airport", "LaGuardia Airport"))

Visualization

For my visualization, I want to show the average departure delay in minutes for each of the airlines in each of the three origins for the month of august because that is when airports get the busiest. To make this happen, I first need to filter for the month of August.

august_flights <- flights |> 
  filter(month==8) 

Then I create a new subset dataset by grouping by origin and carrier. Then I summarize the average for each group.

grouped_august_flights <- august_flights |>
  group_by(origin, carrier) |> 
  summarize(average_delay = mean(dep_delay, na.rm=T))
`summarise()` has grouped output by 'origin'. You can override using the
`.groups` argument.
  grouped_august_flights
# A tibble: 28 × 3
# Groups:   origin [3]
   origin                 carrier           average_delay
   <fct>                  <fct>                     <dbl>
 1 Newark Liberty Airport Endeavor Air               6.01
 2 Newark Liberty Airport American Airlines         11.8 
 3 Newark Liberty Airport Alaska Airlines           18.3 
 4 Newark Liberty Airport JetBlue Airways           16.7 
 5 Newark Liberty Airport Delta Airlines            13.3 
 6 Newark Liberty Airport Allegiant Air              7.91
 7 Newark Liberty Airport Spirit Airlines           25.8 
 8 Newark Liberty Airport Skywest Airlines           7.5 
 9 Newark Liberty Airport United Airlines           15.5 
10 Newark Liberty Airport Republic Airline           1.72
# ℹ 18 more rows

Finally I create the visualization and make it a heat map to account two categorical variables.

ggp <- ggplot(grouped_august_flights, aes(origin, carrier)) +
  geom_tile(aes(fill=average_delay)) + 
  scale_fill_distiller(palette="RdYlBu") +
  theme_dark() +
  labs(x = "NYC Airports",
       y="Airlines",
       caption = "Source: FAA Aircraft registry",
       fill = "Average Departure Delay \n (in minutes)",
       title = "Average Flight Departure Delays of Airlines from \n Three Different Airports in August") 
ggp

Summary

For my visualization I chose to show the average flight departure delay of several airlines originating from Newark Liberty Airport, John F. Kennedy Airport, and LaGuardia Airport. I used a heat map to illustrate the two categorical variables- airport and airlines. For the fill values, I used the average departure delay in minutes. I highly recommend taking these fill values with a grain of salt because they do not account for huge outliers. For Delta Airlines flights originating from Newark Liberty Airport, for example, the average departure delay was 13.28 minutes. However, the minimum was -16, meaning the flight was 16 minutes early in its departure, and the maximum was 1047 minutes. These average values may not fully represent the dataset. Something that may stick out at first glance is the empty tiles for some of the airports and airlines. This happened because there wasn’t any data for the variables, at least for the month of august.

 ss <- flights |> filter(origin== "Newark Liberty Airport", carrier == "Delta Airlines", month== 8)
summary(ss$dep_delay)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -16.00   -5.00   -1.50   13.28    6.00 1047.00      10