NYC Flights23

Author

Mamokotjo Letjama

The NYCFlights dataset

With thousands of departures recorded, the NYC Flights dataset can provides a powerful foundation for analyzing flight delays, airline performance, seasonal patterns, and operational efficiency across major carriers. ## Install package (“NYC Flights23”, dataset)

library(nycflights23)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data(flights)
data(airlines)

View the data structure

str(flights)
tibble [435,352 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:435352] 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
 $ month         : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:435352] 1 18 31 33 36 503 520 524 537 547 ...
 $ sched_dep_time: int [1:435352] 2038 2300 2344 2140 2048 500 510 530 520 545 ...
 $ dep_delay     : num [1:435352] 203 78 47 173 228 3 10 -6 17 2 ...
 $ arr_time      : int [1:435352] 328 228 500 238 223 808 948 645 926 845 ...
 $ sched_arr_time: int [1:435352] 3 135 426 2352 2252 815 949 710 818 852 ...
 $ arr_delay     : num [1:435352] 205 53 34 166 211 -7 -1 -25 68 -7 ...
 $ carrier       : chr [1:435352] "UA" "DL" "B6" "B6" ...
 $ flight        : int [1:435352] 628 393 371 1053 219 499 996 981 206 225 ...
 $ tailnum       : chr [1:435352] "N25201" "N830DN" "N807JB" "N265JB" ...
 $ origin        : chr [1:435352] "EWR" "JFK" "JFK" "JFK" ...
 $ dest          : chr [1:435352] "SMF" "ATL" "BQN" "CHS" ...
 $ air_time      : num [1:435352] 367 108 190 108 80 154 192 119 258 157 ...
 $ distance      : num [1:435352] 2500 760 1576 636 488 ...
 $ hour          : num [1:435352] 20 23 23 21 20 5 5 5 5 5 ...
 $ minute        : num [1:435352] 38 0 44 40 48 0 10 30 20 45 ...
 $ time_hour     : POSIXct[1:435352], format: "2023-01-01 20:00:00" "2023-01-01 23:00:00" ...
str(airlines)
tibble [14 × 2] (S3: tbl_df/tbl/data.frame)
 $ carrier: chr [1:14] "9E" "AA" "AS" "B6" ...
 $ name   : chr [1:14] "Endeavor Air Inc." "American Airlines Inc." "Alaska Airlines Inc." "JetBlue Airways" ...

Clean data and sort data, remove na’s for distance, arr_delay, departure delay

flights_nona <- flights |>
  filter(!is.na(distance) & !is.na(arr_delay) & !is.na(dep_delay))
flights_nona |>
  select(carrier, flight,month, dep_time, arr_time)
# A tibble: 422,818 × 5
   carrier flight month dep_time arr_time
   <chr>    <int> <int>    <int>    <int>
 1 UA         628     1        1      328
 2 DL         393     1       18      228
 3 B6         371     1       31      500
 4 B6        1053     1       33      238
 5 UA         219     1       36      223
 6 AA         499     1      503      808
 7 B6         996     1      520      948
 8 AA         981     1      524      645
 9 UA         206     1      537      926
10 NK         225     1      547      845
# ℹ 422,808 more rows

Calculate average arrival time and departure time per airline

avg_times <- flights_nona |>
  group_by(carrier) |>
  summarize(avg_dep_time = mean(dep_time), avg_arr_time = mean(arr_time))

create a bar chart

ggplot(flights_nona) +
  geom_col(mapping = aes(x = carrier, y = dep_time, fill = carrier)) +
  labs(x = "Carrier", y = "Departure Time", title = "Departure Times by Carrier", caption = "Source: FAA Aircraft Registry")

What the graph shows

The visualization presents departure times by airline carriers, using a bar chart to compare how each carrier’s departures were distributed in 2023. The bars are color-coded for easy identification of carriers. One interesting aspect to highlight is how departure times may show noticeable clustering or disparities between carriers which could indicate patterns in scheduling, delays, or operational efficiency.