NYC_Flights

Author

O. Nseyo

Load in Required Libraries

Loads libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Now we will install the NYC flights dataset and load it into our global environment

#install.packages("nycflights23")
library(nycflights23)
data(flights)

This data set was compiled by Chester Ismay contains information about all flights that departed from the three main New York City airports in 2023. For more information on the data set use the given link: https://CRAN.R-project.org/package=nycflights23

Next, let us take a look a the first 6 rows of this data-set to see what we are working with

head(flights)

# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2023     1     1        1           2038       203      328              3
2  2023     1     1       18           2300        78      228            135
3  2023     1     1       31           2344        47      500            426
4  2023     1     1       33           2140       173      238           2352
5  2023     1     1       36           2048       228      223           2252
6  2023     1     1      503            500         3      808            815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Looking at the global environment we know we have 435,352 observations and 19 variables

Using str() “structure” to look at the data and to understand it before we begin to work on it

str(flights)

tibble [435,352 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:435352] 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
 $ month         : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:435352] 1 18 31 33 36 503 520 524 537 547 ...
 $ sched_dep_time: int [1:435352] 2038 2300 2344 2140 2048 500 510 530 520 545 ...
 $ dep_delay     : num [1:435352] 203 78 47 173 228 3 10 -6 17 2 ...
 $ arr_time      : int [1:435352] 328 228 500 238 223 808 948 645 926 845 ...
 $ sched_arr_time: int [1:435352] 3 135 426 2352 2252 815 949 710 818 852 ...
 $ arr_delay     : num [1:435352] 205 53 34 166 211 -7 -1 -25 68 -7 ...
 $ carrier       : chr [1:435352] "UA" "DL" "B6" "B6" ...
 $ flight        : int [1:435352] 628 393 371 1053 219 499 996 981 206 225 ...
 $ tailnum       : chr [1:435352] "N25201" "N830DN" "N807JB" "N265JB" ...
 $ origin        : chr [1:435352] "EWR" "JFK" "JFK" "JFK" ...
 $ dest          : chr [1:435352] "SMF" "ATL" "BQN" "CHS" ...
 $ air_time      : num [1:435352] 367 108 190 108 80 154 192 119 258 157 ...
 $ distance      : num [1:435352] 2500 760 1576 636 488 ...
 $ hour          : num [1:435352] 20 23 23 21 20 5 5 5 5 5 ...
 $ minute        : num [1:435352] 38 0 44 40 48 0 10 30 20 45 ...
 $ time_hour     : POSIXct[1:435352], format: "2023-01-01 20:00:00" "2023-01-01 23:00:00" ...

Before displaying the summary i will be changing the month categories from numbers to names.

For this, I want to be able to spot a trend by a specific month if i could.

df1_flights <- flights |>
  mutate(month = month.name[month])

Next, I correct the carrier column, replacing the abbreviations with the full name. I had to use “select(-airline_name)” to remove the extra table I mistakenly created for airline_names. The mutate function replaces the abbreviations with the full carrier names data-frame created.

airline_name <- data.frame(
  carrier = c("UA", "DL", "B6", "AA", "WN", "AS", "F9", "NK", "HA"),
  airline_name = c("United Airlines", "Delta Air Lines", "JetBlue", "American Airlines",
                   "Southwest", "Alaska Airlines", "Frontier", "Spirit", "Hawaiian Airlines")
)


df2_flights <- df1_flights |>
  left_join(airline_name, by = "carrier") |>
  mutate(carrier = airline_name) |>
   select(-airline_name)

We want to look at the fligths with most delays, so using the filter functon i’ll isolate the variables that i want to look at.

df3_flights <- df2_flights |>
  select(month, carrier, origin, dest, dep_delay, arr_delay, flight)

Here i have mapped the Airport of Origin from abbreviation to full name.

df3_flights$origin[df3_flights$origin == "LGA"]<- "LaGuardia"
df3_flights$origin[df3_flights$origin == "EWR"]<- "Newark Liberty International"
df3_flights$origin[df3_flights$origin == "JFK"]<- "John F. Kennedy International"

Lets take a look at the filtered data set

head(df3_flights)

# A tibble: 6 × 7
  month   carrier           origin              dest  dep_delay arr_delay flight
  <chr>   <chr>             <chr>               <chr>     <dbl>     <dbl>  <int>
1 January United Airlines   Newark Liberty Int… SMF         203       205    628
2 January Delta Air Lines   John F. Kennedy In… ATL          78        53    393
3 January JetBlue           John F. Kennedy In… BQN          47        34    371
4 January JetBlue           John F. Kennedy In… CHS         173       166   1053
5 January United Airlines   Newark Liberty Int… DTW         228       211    219
6 January American Airlines Newark Liberty Int… MIA           3        -7    499

Now lets look at the summary statistics

summary(df3_flights)

    month             carrier             origin              dest          
 Length:435352      Length:435352      Length:435352      Length:435352     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   dep_delay         arr_delay            flight      
 Min.   : -50.00   Min.   : -97.000   Min.   :   1.0  
 1st Qu.:  -6.00   1st Qu.: -22.000   1st Qu.: 364.0  
 Median :  -2.00   Median : -10.000   Median : 734.0  
 Mean   :  13.84   Mean   :   4.345   Mean   : 785.2  
 3rd Qu.:  10.00   3rd Qu.:   9.000   3rd Qu.:1188.0  
 Max.   :1813.00   Max.   :1812.000   Max.   :1972.0  
 NA's   :10738     NA's   :12534

Looking at the summary it would appear we have some missing values in departure delays and arrival delays so we remove them

df3_flights <- df3_flights |>
  filter(!is.na(dep_delay) & !is.na(arr_delay))

We’ll check to make sure that there are no NA’s left

sum(is.na(df3_flights$dep_delay))

[1] 0

sum(is.na(df3_flights$arr_delay))

[1] 0

Now to plot a facet plot to compare

Using a facet plot we want to see

df_facetPl <- df3_flights |>
   filter(!is.na(carrier)) |>
  group_by(carrier) |>
  summarise(
    avg_dep_delay = mean(dep_delay, na.rm = T),
    avg_arr_delay = mean(arr_delay, na.rm = T)
  ) |>
  pivot_longer(cols = c(avg_dep_delay, avg_arr_delay), 
               names_to = "Delay_Type", 
               values_to = "Average_Delay")

ggplot(df_facetPl, aes(x = reorder(carrier, Average_Delay), y = Average_Delay, fill = carrier)) +
  geom_col() +
  facet_wrap(~ Delay_Type, scales = "free") +  
  labs(title = "Comparison of Average Departure & Arrival Delays by Airline for 2023",
       x = "Airline Carrier",
       y = "Average Delay (mins)") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

Looking at this chart i can see the average delay by minutes from each airline, comparing average arrival delay with average departure delay.

Now i add in the month category to look at trends.

df_facet_month <- df3_flights |>
    filter(!is.na(carrier)) |>
  mutate(month = factor(month, levels = c("January", "February", "March", "April", 
                                          "May", "June", "July", "August", 
                                          "September", "October", "November", "December"))) |>
  group_by(carrier, month) |>
  summarise(
    avg_dep_delay = mean(dep_delay, na.rm = TRUE),
    avg_arr_delay = mean(arr_delay, na.rm = TRUE)
  ) |>
  pivot_longer(cols = c(avg_dep_delay, avg_arr_delay), 
               names_to = "Delay_Type", 
               values_to = "Average_Delay")

`summarise()` has grouped output by 'carrier'. You can override using the
`.groups` argument.

ggplot(df_facet_month, aes(x = month, y = Average_Delay, fill = carrier)) +
  geom_col(position = "dodge") +
  facet_wrap(~ Delay_Type, scales = "free") + 
  labs(title = "Comparison of Departure & Arrival Delays by Month",
       x = "Month",
       y = "Average Delay (mins)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

With this visual i don’t have alot of insight with individual airlines but whit this i can see possible trends that has most arrival and departure delays around July and June, possibly an increased number of flyers during this period for vacation purposes.

##The NYC flight dataset for the year 2023

What I did first was try to see what theoretical questions could be asked; I started arranging the data.

After loading the dataset, I could see that it included approximately all the days from January to December of the year 2023, so I knew I would be able to get an intricate look into various parts of the dataset.

I chose to find out if I could identify a trend in delays for either departures or arrivals by calculating against different variables. From all the given airline carriers, could I determine if some specific planes have a tendency to arrive early or be delayed? In questions like this, a hidden variable, such as which pilot flew that day, could be at play, or depending on the plane type, what engine was installed and the weight of passengers in relation to fuel weight could also be important factors.

For example, most planes used commercially could perform twice as well or fly faster, so to speak, if their pilots decided to use the engines beyond regulated guidelines. However, since flight is a delicate process, a wide margin or allowance for extra energy when needed in emergencies calls for planes to cruise using perhaps 60 to 50 percent of the engines’ capabilities.