NYC Flights

Author

Thiloni Konara

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
library(RColorBrewer)
data("flights")
data("airlines")

Just to look at the data type and first 6 rows

str(flights)
tibble [435,352 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:435352] 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
 $ month         : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:435352] 1 18 31 33 36 503 520 524 537 547 ...
 $ sched_dep_time: int [1:435352] 2038 2300 2344 2140 2048 500 510 530 520 545 ...
 $ dep_delay     : num [1:435352] 203 78 47 173 228 3 10 -6 17 2 ...
 $ arr_time      : int [1:435352] 328 228 500 238 223 808 948 645 926 845 ...
 $ sched_arr_time: int [1:435352] 3 135 426 2352 2252 815 949 710 818 852 ...
 $ arr_delay     : num [1:435352] 205 53 34 166 211 -7 -1 -25 68 -7 ...
 $ carrier       : chr [1:435352] "UA" "DL" "B6" "B6" ...
 $ flight        : int [1:435352] 628 393 371 1053 219 499 996 981 206 225 ...
 $ tailnum       : chr [1:435352] "N25201" "N830DN" "N807JB" "N265JB" ...
 $ origin        : chr [1:435352] "EWR" "JFK" "JFK" "JFK" ...
 $ dest          : chr [1:435352] "SMF" "ATL" "BQN" "CHS" ...
 $ air_time      : num [1:435352] 367 108 190 108 80 154 192 119 258 157 ...
 $ distance      : num [1:435352] 2500 760 1576 636 488 ...
 $ hour          : num [1:435352] 20 23 23 21 20 5 5 5 5 5 ...
 $ minute        : num [1:435352] 38 0 44 40 48 0 10 30 20 45 ...
 $ time_hour     : POSIXct[1:435352], format: "2023-01-01 20:00:00" "2023-01-01 23:00:00" ...
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2023     1     1        1           2038       203      328              3
2  2023     1     1       18           2300        78      228            135
3  2023     1     1       31           2344        47      500            426
4  2023     1     1       33           2140       173      238           2352
5  2023     1     1       36           2048       228      223           2252
6  2023     1     1      503            500         3      808            815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Converting months into their names

flights$month[flights$month == 1] <- "Jan"
flights$month[flights$month == 2] <- "Feb"
flights$month[flights$month == 3] <- "March"
flights$month[flights$month == 4] <- "April"
flights$month[flights$month == 5] <- "May"
flights$month[flights$month == 6] <- "June"
flights$month[flights$month == 7] <- "July"
flights$month[flights$month == 8] <- "Aug"
flights$month[flights$month == 9] <- "Sep"
flights$month[flights$month == 10] <- "Oct"
flights$month[flights$month == 11] <- "Nov"
flights$month[flights$month == 12] <- "Dec"

Changing the month column into an ordered factor

flights$month <- factor(flights$month, levels=c("Jan","Feb","March","April","May","June","July","Aug","Sep","Oct","Nov","Dec"))

Keeping main airlines and adding their full names

majors <- c("UA","DA","AA","B6","WN","US")

flights_named <- flights |>
  filter (carrier %in% majors) |>
left_join(airlines, by = "carrier")

Group the data set by month and airlines and arranging them in descending order

flights_months <- flights_named |>
  group_by(name,month) |>
  summarize(total = n()) |>
  arrange(desc(total))
`summarise()` has grouped output by 'name'. You can override using the
`.groups` argument.

Getting the top 3 flights

top3_airlines <- flights_named |>
  group_by(name) |>
  summarize(total = n()) |>
  slice_max(order_by = total, n=3)
top3_airlines
# A tibble: 3 × 2
  name                   total
  <chr>                  <int>
1 United Air Lines Inc.  79641
2 JetBlue Airways        66169
3 American Airlines Inc. 40525

Filtering top 3 airlines

flights_months_top3 <- flights_months |>
  filter(name %in% top3_airlines$name)
flights_months_top3
# A tibble: 36 × 3
# Groups:   name [3]
   name                  month total
   <chr>                 <fct> <int>
 1 United Air Lines Inc. March  7243
 2 United Air Lines Inc. May    6976
 3 United Air Lines Inc. Oct    6888
 4 United Air Lines Inc. April  6803
 5 United Air Lines Inc. July   6796
 6 United Air Lines Inc. Jan    6780
 7 JetBlue Airways       March  6595
 8 United Air Lines Inc. June   6576
 9 United Air Lines Inc. Aug    6548
10 United Air Lines Inc. Sep    6401
# ℹ 26 more rows

Plot

ggplot(flights_months_top3,aes(x=month, y= total, fill =name))+
         geom_col(position = position_dodge(width = 0.9),width = 0.4)+
  labs(title = "Monthly Flight Volume for Top 3 Airlines (NYC,2013)",
       x = "Month",y="Number of Flights",color = "Airline",caption ="Source:FAA Aircraft registry")+
  theme_minimal()+
  scale_fill_brewer(palette = "Set2")

Essay

This bar graph shows the monthly flight volume of the top three airlines departing from New York City in 2013: United Airlines, JetBlue Airways, and American Air Lines. The x-axis represents the months from January to December, and the y-axis shows the number of flights. Each airline is represented by a different color to make comparisons easier. From the graph, we can see that United Air Lines had the highest number of flights throughout the year, followed by American Airlines and JetBlue Airways. The busiest months appear to be between March and June, while flight counts slightly dropped around August and September. I chose this visualization because it clearly shows how airline activity changes over time and makes it easy to compare the top airlines in one view. One interesting thing I noticed is that United Airlines consistently leads every month, showing how dominant it was in NYC’s flight traffic during 2013.