NYC Flights Assignment

Author

Jorge Pineda

NYC Flight Homework

The nycflights23 package is an R data package containing all out-bound flights from NYC in 2023 plus useful metadata

This package was created using the anyflights R package.

The main dataset, flights, contains detailed flight-level information such as scheduled and actual departure/arrival times, delays (in minutes), distance (in miles), and more. The airlines dataset provides a key for airline carrier codes, matching carrier codes to their full names (for example “AA”for “American Airlines”).

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
data(flights)
data(airlines)

Data Preparation

?flights #See units used
starting httpd help server ... done
glimpse(flights) #glimpse of dataset
Rows: 435,352
Columns: 19
$ year           <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 1, 18, 31, 33, 36, 503, 520, 524, 537, 547, 549, 551, 5…
$ sched_dep_time <int> 2038, 2300, 2344, 2140, 2048, 500, 510, 530, 520, 545, …
$ dep_delay      <dbl> 203, 78, 47, 173, 228, 3, 10, -6, 17, 2, -10, -9, -7, -…
$ arr_time       <int> 328, 228, 500, 238, 223, 808, 948, 645, 926, 845, 905, …
$ sched_arr_time <int> 3, 135, 426, 2352, 2252, 815, 949, 710, 818, 852, 901, …
$ arr_delay      <dbl> 205, 53, 34, 166, 211, -7, -1, -25, 68, -7, 4, -13, -14…
$ carrier        <chr> "UA", "DL", "B6", "B6", "UA", "AA", "B6", "AA", "UA", "…
$ flight         <int> 628, 393, 371, 1053, 219, 499, 996, 981, 206, 225, 800,…
$ tailnum        <chr> "N25201", "N830DN", "N807JB", "N265JB", "N17730", "N925…
$ origin         <chr> "EWR", "JFK", "JFK", "JFK", "EWR", "EWR", "JFK", "EWR",…
$ dest           <chr> "SMF", "ATL", "BQN", "CHS", "DTW", "MIA", "BQN", "ORD",…
$ air_time       <dbl> 367, 108, 190, 108, 80, 154, 192, 119, 258, 157, 164, 1…
$ distance       <dbl> 2500, 760, 1576, 636, 488, 1085, 1576, 719, 1400, 1065,…
$ hour           <dbl> 20, 23, 23, 21, 20, 5, 5, 5, 5, 5, 5, 6, 5, 6, 6, 6, 6,…
$ minute         <dbl> 38, 0, 44, 40, 48, 0, 10, 30, 20, 45, 59, 0, 59, 0, 0, …
$ time_hour      <dttm> 2023-01-01 20:00:00, 2023-01-01 23:00:00, 2023-01-01 2…

Find NA values columns (remove missing values before analysis)

colSums(is.na(flights)) %>% 
  .[. > 0]
 dep_time dep_delay  arr_time arr_delay   tailnum  air_time 
    10738     10738     11453     12534      1913     12534 
#identify and count missing values (NA) in each column

Filter out rows with NA values in key columns

flights_nona <- flights |>
  filter(!is.na(distance) & !is.na(arr_delay) & !is.na(dep_delay))
# remove na's for distance, arr_delay, departure delay

Join airline names to each flight

flights_named <- left_join(flights_nona, airlines, by = "carrier")

Join airline names to each flight

flights_named <- flights_nona |>
  left_join(airlines, by = "carrier")

Remove “Inc.” and “Co.” for cleaner labels

flights_named$name <- gsub("Inc\\.|Co\\.", "", flights_named$name)

Data visualization

Which airlines have the most unpredictable arrival delays? Are some more consistent?

Reorder airlines by median arrival delay (high to low)

ordered_names <- flights_named |>
  group_by(name) |>
  summarise(median_delay = median(arr_delay, na.rm = TRUE)) |>
  arrange(desc(median_delay)) |>
  pull(name)


flights_named$name <- factor(flights_named$name, levels = ordered_names)

Which Airlines Have the Most Unpredictable Arrival Delays?

This boxplot shows the spread and consistency of arrival delays for each airline. Airlines are sorted from top to bottom based on their median arrival delay, making it easy to compare both the typical delay and the variability between carriers. Outliers over 300 minutes are excluded to focus on the most relevant range of delays. In this order, airlines with the worst typical delays appear at the top, while the best-performing carriers are positioned at the bottom.

ggplot(flights_named, aes(x = fct_rev(name), y = arr_delay, fill = name)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.8) +
  coord_flip(ylim = c(-50, 300)) +
  scale_fill_viridis_d(option = "turbo", name = "Airline") +
  labs(
    title = "Variation in Arrival Delays by Airline (NYC Flights 2013)",
    x = "Airline (ordered by median delay)",
    y = "Arrival Delay (minutes)",
    caption = "Source: nycflights23 (flights + airlines datasets)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.caption = element_text(hjust = 0.1),  # shift caption slightly left
    legend.position = "right"
  )

Note: This plot excludes extremely delayed flights over 300 minutes (5 hours), which represent only 2,102 out of 422,818 total flights (~0.5%). The maximum recorded delay was 1812 minutes (more than 30 hours).
By focusing on this cleaner range (–50 to 300 mins), we highlight the central delay tendencies and variability for each airline.
Some airlines, like Hawaiian and Frontier, show both higher median delays and more variability, suggesting inconsistent performance.

Note

# Total number of non-missing arrival delay observations
total_flights <- nrow(flights_named)

# Flights with arrival delay > 300 minutes
extreme_delays <- flights_named |> 
  filter(arr_delay > 300)

num_extreme <- nrow(extreme_delays)

# Max arrival delay
max_delay <- max(flights_named$arr_delay, na.rm = TRUE)

# Percentage of flights with extreme delays
percentage_extreme <- (num_extreme / total_flights) * 100

# Print all results
cat("Total flights with valid arr_delay:", total_flights, "\n")
Total flights with valid arr_delay: 422818 
cat("Flights with arr_delay > 300 mins:", num_extreme, "\n")
Flights with arr_delay > 300 mins: 2102 
cat("Maximum arrival delay:", max_delay, "minutes\n")
Maximum arrival delay: 1812 minutes
cat("Percentage of flights > 300 mins delay:", round(percentage_extreme, 2), "%\n")
Percentage of flights > 300 mins delay: 0.5 %