Project 1: nyc_flights

Author

AskSalomon

Finding Nemo:

Packages:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(wesanderson)
library(alluvial)
library(ggalluvial)

Cleaning the flights dataset:

flights_nona <- flights |>
  filter(!is.na(distance) & !is.na(arr_delay))  
# remove na's for distance and arr_delay

First idea: If New York can’t deal with a storm in 2022 it probably wasn’t better in 2013

Last time I flew through New York, a thunderstorm grounded half the planes. Everything was terrible.

This dataset is probably too big still, I will therefore zone in on a month, choosing here the month of February.

Why? February is a wintery and stormy month, and best of all, boring. No significant holidays.

february_flights <- flights_nona |> group_by(origin) |> filter(month == 2, na.rm = TRUE)

I then did some quick googleing and there was indeed a winter storm on the 8th, I therefore created a second set to see if I can find it in case it does not appear in the first set.

february_789_flights <- flights_nona |> group_by(origin) |> filter(month == 02, day == 08, na.rm = TRUE)
february_789_flights |> arrange((dep_delay))
# A tibble: 455 × 19
# Groups:   origin [3]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     2     8     1126           1140       -14     1450           1445
 2  2013     2     8      552            605       -13      755            805
 3  2013     2     8     1517           1530       -13     1658           1725
 4  2013     2     8      653            705       -12      952            940
 5  2013     2     8      620            630       -10      753            800
 6  2013     2     8      812            822       -10     1021           1019
 7  2013     2     8     1340           1350       -10     1627           1518
 8  2013     2     8      551            600        -9      848            910
 9  2013     2     8      551            600        -9      647            703
10  2013     2     8      651            700        -9      812            807
# ℹ 445 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Adding the number of plane seats

My original plan was to calculate the number of seats lost in relation to the severity of the storm, however I could not arrive at a good relation in time.

planes_seats <- planes |> select(tailnum, seats) |> filter(!is.na(seats))
feb_passengers <- february_flights |> 
  left_join(planes_seats, by = "tailnum") 
Feb_Passenger_Origin <- feb_passengers |>
  group_by(origin) |> summarise(day,seats, na.rm = TRUE)
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'origin'. You can override using the
`.groups` argument.
Seats_Day <- Feb_Passenger_Origin |> 
  group_by(origin,day) |> 
  summarise(total_seats_day = sum(seats, na.rm = TRUE))
`summarise()` has grouped output by 'origin'. You can override using the
`.groups` argument.

Month of February

ggalluv <- Seats_Day |>
  ggplot(aes(x = day, y = total_seats_day, alluvium = origin)) + 
  theme_bw() +
  geom_alluvium(aes(fill = origin), color = "black", alpha = 1) +
  scale_fill_manual(values = wes_palette(n = 3, name = "FantasticFox1" )) + 
  scale_x_continuous(breaks = 1:28, limits = c(1,28)) +
  labs(title = "A storm hits New York on February 8th and 9th 2013",
       y = "Number of Seats", 
       x = "Days of the Month of February",
       fill = "Airport of Origin")
ggalluv

But not only have we found that a storm appears in our data, no, because this is no ordinary storm.

This, is Nemo.

https://theweek.com/articles/467870/everything-need-know-about-winter-storm-nemo

ggalluv <- Seats_Day |>
  ggplot(aes(x = day, y = total_seats_day, alluvium = origin)) + 
  theme_bw() +
  geom_alluvium(aes(fill = origin), color = "black", alpha = 1) +
  scale_fill_manual(values = wes_palette(n = 3, name = "FantasticFox1" )) + 
  scale_x_continuous(breaks = 1:28, limits = c(1,28)) +
  labs(title = "Finding Nemo: The effect of a winter storm on EWR,JFK and LGA",
       y = "Number of Seats", 
       x = "Days of the Month of February",
       fill = "Airport of Origin")
ggalluv

Conclusion

Well so I accidentally found a storm named “Nemo”. What I think is particularly neat about this graph is how clear it comes through in the data. My original idea was to use both passenger numbers and number of aircraft to demonstrate how small and private aircraft are less resilient. However, passenger numbers also more accurately “weighs” the capacity of the airports and I was worried that the signal of the February storms could be buried by the fluctuations in the number of small passenger aircraft. To my delight, we can see how the number of potential passengers(represented by the total number of seats across all aircraft) that landed at JFK, EWR and LGA plummeted between the 8th of February to the 10th of February. As for the reason that the passenger numbers do not reach zero on those days, it turns out the airports were still operating until the late afternoon on the 8th and managed to return to operation later in the day on the 9th.