NYC flights2023 Assignment

Author

Duchelle K

Load the libraries and feed data into the global environment

The data I will use is the update of the now 10-year-old ‘nycflights13’ data package. ‘nycflights23’ contains information about all flights that departed from the three main New York City airports in 2023 and metadata on airlines, airports, weather, and planes.

I will use the ‘flights’ dataset that is pre-built in the ‘nycflights23’ package.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
data("flights") # loads the flights dataset into my global environment

To view the data structure I use ‘str’

# View the structure of data
str(flights)
tibble [435,352 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:435352] 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
 $ month         : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:435352] 1 18 31 33 36 503 520 524 537 547 ...
 $ sched_dep_time: int [1:435352] 2038 2300 2344 2140 2048 500 510 530 520 545 ...
 $ dep_delay     : num [1:435352] 203 78 47 173 228 3 10 -6 17 2 ...
 $ arr_time      : int [1:435352] 328 228 500 238 223 808 948 645 926 845 ...
 $ sched_arr_time: int [1:435352] 3 135 426 2352 2252 815 949 710 818 852 ...
 $ arr_delay     : num [1:435352] 205 53 34 166 211 -7 -1 -25 68 -7 ...
 $ carrier       : chr [1:435352] "UA" "DL" "B6" "B6" ...
 $ flight        : int [1:435352] 628 393 371 1053 219 499 996 981 206 225 ...
 $ tailnum       : chr [1:435352] "N25201" "N830DN" "N807JB" "N265JB" ...
 $ origin        : chr [1:435352] "EWR" "JFK" "JFK" "JFK" ...
 $ dest          : chr [1:435352] "SMF" "ATL" "BQN" "CHS" ...
 $ air_time      : num [1:435352] 367 108 190 108 80 154 192 119 258 157 ...
 $ distance      : num [1:435352] 2500 760 1576 636 488 ...
 $ hour          : num [1:435352] 20 23 23 21 20 5 5 5 5 5 ...
 $ minute        : num [1:435352] 38 0 44 40 48 0 10 30 20 45 ...
 $ time_hour     : POSIXct[1:435352], format: "2023-01-01 20:00:00" "2023-01-01 23:00:00" ...

Filter and sort data using dplyr package

Find all the flights that arrived earlier than the scheduled arrival time. It means all the flights which “arr-delay” is less than 0.

# Flights with "arr_delay" less than 0
early_arrival_flights <- flights |>
  filter(arr_delay < 0) |>
  arrange(arr_delay)
early_arrival_flights
# A tibble: 275,815 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2023    12    25      838            845        -7     1054           1231
 2  2023    12    26     1956           1959        -3     2208           2344
 3  2023     4    11      907            915        -8     1118           1250
 4  2023     6     7     1547           1548        -1     1823           1955
 5  2023    12    26     1659           1710       -11     1941           2112
 6  2023     4    11     1643           1649        -6     1846           2015
 7  2023    12    26      853            900        -7     1110           1238
 8  2023     7    23     1725           1730        -5     2014           2140
 9  2023    12    25     1511           1515        -4     1707           1833
10  2023    12    25     1949           1954        -5     2218           2344
# ℹ 275,805 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Notice although I didn’t use arrange(desc()) to sort the arrival delays from the longest to the shortest, I get that order because there are negative values in that column.

Group and summarize data

Calculate the total flights by carriers

# Calculate total flights by carriers
carriers_sum <- flights |>
  group_by(carrier) |>
  summarize(total_flights = n()) |>
  arrange(desc(total_flights))  ## total_flights is a new variable created

carriers_sum
# A tibble: 14 × 2
   carrier total_flights
   <chr>           <int>
 1 YX              88785
 2 UA              79641
 3 B6              66169
 4 DL              61562
 5 9E              54141
 6 AA              40525
 7 NK              15189
 8 WN              12385
 9 AS               7843
10 OO               6432
11 F9               1286
12 G4                671
13 HA                366
14 MQ                357

Create an alluvial representing carriers total flights by month throughout the year

Load the alluvial package

library(alluvial)
library(ggalluvial)

Examine the number of flights each carrier did per month

# calculate total flights by carrier by month
monthly_flights <- flights |>
  group_by(carrier, month) |>
  summarize(total_flights = n()) |>
  arrange(month)
`summarise()` has grouped output by 'carrier'. You can override using the
`.groups` argument.
monthly_flights
# A tibble: 165 × 3
# Groups:   carrier [14]
   carrier month total_flights
   <chr>   <int>         <int>
 1 9E          1          3985
 2 AA          1          3574
 3 AS          1           542
 4 B6          1          5917
 5 DL          1          4836
 6 F9          1            92
 7 G4          1            42
 8 HA          1            31
 9 MQ          1             9
10 NK          1          1176
# ℹ 155 more rows

Notice the biggest numbers of flights were operated by Republic Airways YX.

Rename the months from numbers to names

monthly_flights$month[monthly_flights$month == 1] <- "January"
monthly_flights$month[monthly_flights$month == 2] <- "February"
monthly_flights$month[monthly_flights$month == 3] <- "March"
monthly_flights$month[monthly_flights$month == 4] <- "April"
monthly_flights$month[monthly_flights$month == 5] <- "May"
monthly_flights$month[monthly_flights$month == 6] <- "June"
monthly_flights$month[monthly_flights$month == 7] <- "July"
monthly_flights$month[monthly_flights$month == 8] <- "August"
monthly_flights$month[monthly_flights$month == 9] <- "September"
monthly_flights$month[monthly_flights$month == 10] <- "October"
monthly_flights$month[monthly_flights$month == 11] <- "November"
monthly_flights$month[monthly_flights$month == 12] <- "December"

Create the alluvial

The function mutate(month = factor(month, levels = c(“January”, “February”, “March”, “April”, “May”, “June”, “July”, “August”, “September”, “October”, “November”, “December”))) line ensures that the month column is treated as a factor with levels ordered from January to December instead of alphabetical order.

theme(plot.title = element_text(hjust = 0.5)) centers the title. The “hjust” parameter controls the horizontal justification of the title text, where 0 is left-aligned, 0.5 is centered, and 1 is right-aligned.

# Merge annual total flights into monthly data
monthly_flights2 <- monthly_flights |>
  left_join(carriers_sum, by = "carrier", suffix = c("_monthly", "_annual")) |>
  mutate(monthly_rate = (total_flights_monthly/total_flights_annual)*100)

# Create the alluvial plot
ggalluv <- monthly_flights2 |>
  mutate(month = factor(month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))) |>
  ggplot(aes(x = month, y = total_flights_monthly, alluvium = carrier ))+
  geom_alluvium(aes(fill = carrier, text = paste("Carrier:", carrier,
      "<br>Month:", month,
      "<br>Monthly Total:", total_flights_monthly,
      "<br>Annual Total:", total_flights_annual,
      "<br>Rate:", monthly_rate)),
                color = "white",
                width = .1,
                alpha = .8,
                decreasing = FALSE) + 
 scale_fill_manual ( values = c("9E" = "darkgreen", "AA" = "#E6BBC7", "AS" = "#B6EEE2", "B6" = "#CAB8E5", 
                    "DL" = "#20DE8B", "F9" = "red", "G4" = "darkblue", "HA" = "yellow", 
                    "MQ" = "violet", "NK" = "#90CFFF", "OO" = "#FF0076", "UA" = "#FAEFAF", 
                    "WN" = "#98FB98", "YX" = "#EA967C")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "NYC Monthly Flights by Carriers",
       x = "Month",
       y = "Monthly Total Flights",
       fill = "Carrier", 
       caption = "Source: FAA and Bureau of Transportation Statistics \n (https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236)") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0.5)) # plot.title and plot.caption center both title and caption.
Warning in geom_alluvium(aes(fill = carrier, text = paste("Carrier:", carrier,
: Ignoring unknown aesthetics: text
ggalluv  

Convert the plot into an interactive alluvial

To create an interactive alluvial plot with tooltip using ggplotly from the plotly library in R, we should ensure that the text aesthetic is properly set in the ggplot object. This text aesthetic will then be used by ggplotly to display tooltip.

# Install plotly
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
# Convert ggplot object to an interactive plotly object
alluvial_plot_interactive <- ggplotly(ggalluv, tooltip = "text")

alluvial_plot_interactive

The alluvial plot titled ‘NYC Monthly Flights by Carriers’ provides an overall view of the flight volumes for various carriers operating in New York City throughout 2023. The X-axis represents the months from January to December, while the Y-axis shows the total number of flights per month. Each colored curve corresponds to a different airline carrier, with tooltips providing detailed information on the carrier, month, monthly total flights, and annual total flights. The wider a carrier’s curve is, the higher flights’ volume it has.

From the plot, we can observe that the top five carriers with the highest flights’ volumes are Republic Airways “YX”, United Airlines “UA”, JetBlue Airways “B6”, Delta Airlines “DL”, and Endeavor Air “9E”.

We notice that Carrier AA (American Airlines) consistently has a high number of flights each month, with a notable peak during march. Carrier DL (Delta Air Lines) also shows a significant volume, particularly in the second half of the year while carrier JetBlue Airways “B6” has greater volume of flights in the first half of the year.

Frontier Airlines “F9”, Allegiant Air “G4”, Hawaiian Airlines “HA”, Envoy Air (American Eagle) “MQ”, are almost invisible because they have the fewest numbers of flights. Although Frontier Airlines F9 has fewer flights, it is a little bit visible because it has more than one thousand annual flights. Plus, it curve shows a little increase during the summer and the holiday season. Meanwhile, Allegiant Air “G4”, Hawaiian Airlines “HA”, and Envoy Air (American Eagle) “MQ” are barely visible on the plot since they have less than one thousand annual flights.

Another Carrier that maintain a steady performance throughout the year is Southwest Airlines “WN”, while carriers with the largest volume of flights, such as Republic Airways “YX”, United Airlines “UA”, Delta Airlines “DL” exhibit more variability. We can also observe that December is the worst month in term of flights volume for Republic Airways “YX”.

Overall, this visualization helps identify seasonal trends, compare carrier performance, and understand the dynamics of air travel in NYC over 2023.