NYC flights Hwk

Author

M Youssef

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
Warning: package 'nycflights23' was built under R version 4.5.2
data(flights)
data(airlines)
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2023     1     1        1           2038       203      328              3
2  2023     1     1       18           2300        78      228            135
3  2023     1     1       31           2344        47      500            426
4  2023     1     1       33           2140       173      238           2352
5  2023     1     1       36           2048       228      223           2252
6  2023     1     1      503            500         3      808            815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
str(flights)
tibble [435,352 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:435352] 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
 $ month         : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:435352] 1 18 31 33 36 503 520 524 537 547 ...
 $ sched_dep_time: int [1:435352] 2038 2300 2344 2140 2048 500 510 530 520 545 ...
 $ dep_delay     : num [1:435352] 203 78 47 173 228 3 10 -6 17 2 ...
 $ arr_time      : int [1:435352] 328 228 500 238 223 808 948 645 926 845 ...
 $ sched_arr_time: int [1:435352] 3 135 426 2352 2252 815 949 710 818 852 ...
 $ arr_delay     : num [1:435352] 205 53 34 166 211 -7 -1 -25 68 -7 ...
 $ carrier       : chr [1:435352] "UA" "DL" "B6" "B6" ...
 $ flight        : int [1:435352] 628 393 371 1053 219 499 996 981 206 225 ...
 $ tailnum       : chr [1:435352] "N25201" "N830DN" "N807JB" "N265JB" ...
 $ origin        : chr [1:435352] "EWR" "JFK" "JFK" "JFK" ...
 $ dest          : chr [1:435352] "SMF" "ATL" "BQN" "CHS" ...
 $ air_time      : num [1:435352] 367 108 190 108 80 154 192 119 258 157 ...
 $ distance      : num [1:435352] 2500 760 1576 636 488 ...
 $ hour          : num [1:435352] 20 23 23 21 20 5 5 5 5 5 ...
 $ minute        : num [1:435352] 38 0 44 40 48 0 10 30 20 45 ...
 $ time_hour     : POSIXct[1:435352], format: "2023-01-01 20:00:00" "2023-01-01 23:00:00" ...
flights_nona <- flights |>
  filter(!is.na(arr_delay) & !is.na(carrier))
boxpl <- flights |>
  ggplot(aes(x=carrier, y = arr_delay, fill = carrier)) +
  geom_boxplot() +
  labs(title = "Arrival Delay Distribution by Airline",
       caption = "Source: FAA Aircraft registry",
       x = "Airline Carrier",
       y= "average delay in min") +
  coord_flip()
boxpl
Warning: Removed 12534 rows containing non-finite outside the scale range
(`stat_boxplot()`).

## I copied this code from the heatmaps code 
final_flights <- flights_nona |>
  select(carrier, arr_delay) |>
  left_join(airlines, by = "carrier")

final_flights$name <- gsub("Inc\\.|Co\\.", "", final_flights$name)
# My Inclusion/Exculsion criteria 
final_flights |>
  count(carrier)  #Top 5 B6,DL,9E,AA,NK
# A tibble: 14 × 2
   carrier     n
   <chr>   <int>
 1 9E      52204
 2 AA      39750
 3 AS       7734
 4 B6      64280
 5 DL      60364
 6 F9       1218
 7 G4        667
 8 HA        362
 9 MQ        354
10 NK      14769
11 OO       6199
12 UA      77438
13 WN      12048
14 YX      85431
top_carrier <- final_flights |>
  filter(carrier %in% c("B6","DL","9E","AA","NK"))
boxpl1 <- top_carrier |>
  ggplot(aes(x=name, y = arr_delay, fill = name)) +
  geom_boxplot() +
  labs(title = "Arrival Delay Distribution by Airline",
       caption = "Source: FAA Aircraft registry",
       x = "Airline Carrier",
       y= "average delay in min",
       fill = "Airline") +
  scale_fill_manual(values=c("#2ca02c","#d62728","#1f77b4", "#ff7f0e","#9467bd")) +     #Top 5 B6,DL,9E,AA,NK
  coord_flip()

boxpl1

In this visualization, I used a boxplot to show the arrival delay for different airlines. At the beginning, I tried to include all the airlines in the graph, but it made the plot look very crowded and hard to read. Because of that, I decided to only focus on the five airlines with the highest number of flights, which are JetBlue Airways, Delta Air Lines, Endeavor Air, American Airlines, and Spirit Airlines. By reducing the number of airlines, it became much easier to compare the delays between them.

Each airline is represented by a different color in the plot. I chose different colors so that each airline could be clearly distinguished from the others. The colors also match the legend, which helps the reader quickly understand which box belongs to which airline. Using multiple colors makes the visualization easier to read and helps highlight the differences between the airlines.

One thing that stands out in the plot is the large number of outliers for American Airlines. Most flights from all airlines have delays that are close to zero minutes, meaning they arrive close to their scheduled time. However, American Airlines shows several points that are much farther away from the rest of the data. These points represent flights that had very large delays. This suggests that although many American Airlines flights arrive on time, there are some cases where the delays are much higher compared to the other airlines.