2025-09-28

Dataset Overview

  • For this project, we will be exploring the Flight Delay Dataset — 2024

  • Source: https://www.kaggle.com/datasets/hrishitpatil/flight-data-2024.

  • This dataset includes a variety of variables:

    • Origin City/Airport Code

    • Destination City/Airport Code

    • Date and Time (including day of the week)

    • Length of Delay (in minutes)

    • Type of Delay

Key Questions to Explore

  • How do delays vary across different origin airports?

  • Which destinations experience the highest frequency or length of delays?

  • Do delays follow a pattern based on the month?

  • Which delay causes (weather, mechanical, security) contribute most to total delay time?

Importing the Data

Because this dataset has over 7 million rows, I chose to use the library data.table as it is built for larger data sets. I will be using tidyverse, plotly, and ggplot2 throughout this project.

According to the Keggle post, this dataset has been cleaned already.

library(data.table)
library(tidyverse)
library(plotly)
library(ggplot2)
library(knitr)
library(dplyr)
library(kableExtra)
library(tidyr)
library(scales)
setwd("C:/Users/bmoor/Downloads/archive (1)")
flight_data <- fread("flight_data_2024.csv")
data_dictionary <-fread("flight_data_2024_data_dictionary.csv")

Data Dictionary Summary

kable(data_dictionary[1:11,]) %>%
  kable_styling(font_size = 10)
column dtype null_pct example_value
year Int64 0 2024
month Int64 0 1
day_of_month Int64 0 1
day_of_week Int64 0 1
fl_date datetime64[ns] 0 2024-01-01 00:00:00
op_unique_carrier object 0 9E
op_carrier_fl_num float64 0 4814.0
origin object 0 JFK
origin_city_name object 0 New York, NY
origin_state_nm object 0 New York
dest object 0 DTW

Data Dictionary Summary Continued

kable(data_dictionary[11:21,]) %>%
  kable_styling(font_size = 10)
column dtype null_pct example_value
dest object 0.00 DTW
dest_city_name object 0.00 Detroit, MI
dest_state_nm object 0.00 Michigan
crs_dep_time Int64 0.00 1252
dep_time float64 1.31 1247.0
dep_delay float64 1.31 -5.0
taxi_out float64 1.35 31.0
wheels_off float64 1.35 1318.0
wheels_on float64 1.38 1442.0
taxi_in float64 1.38 7.0
crs_arr_time Int64 0.00 1508

Data Dictionary Summary Continued

kable(data_dictionary[22:32,]) %>%
  kable_styling(font_size = 10)
column dtype null_pct example_value
arr_time float64 1.38 1449.0
arr_delay float64 1.61 -19.0
cancelled int64 0.00 0
cancellation_code object 98.64 B
diverted int64 0.00 0
crs_elapsed_time float64 0.00 136.0
actual_elapsed_time float64 1.61 122.0
air_time float64 1.61 84.0
distance float64 0.00 509.0
carrier_delay int64 0.00 0
weather_delay int64 0.00 0

Data Dictionary Summary Continued

kable(data_dictionary[33:35,]) %>%
  kable_styling(font_size = 10)
column dtype null_pct example_value
nas_delay int64 0 0
security_delay int64 0 0
late_aircraft_delay int64 0 0

Flight Info Summary

Origin City Summary

summary(flight_data[,c("origin","origin_city_name")])
##     origin          origin_city_name  
##  Length:7079081     Length:7079081    
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

Destination City Summary

summary(flight_data[,c("dest","dest_city_name")])
##      dest           dest_city_name    
##  Length:7079081     Length:7079081    
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

Delay Measures Summary

summary(flight_data[,c("dep_delay","arr_delay")])
##    dep_delay         arr_delay       
##  Min.   : -96.00   Min.   :-126.000  
##  1st Qu.:  -6.00   1st Qu.: -15.000  
##  Median :  -2.00   Median :  -6.000  
##  Mean   :  12.68   Mean   :   7.098  
##  3rd Qu.:   9.00   3rd Qu.:   9.000  
##  Max.   :3777.00   Max.   :3803.000  
##  NA's   :92970     NA's   :113814

Delay Causes Summary

summary(flight_data[,c("carrier_delay","weather_delay","nas_delay",
                       "security_delay", "late_aircraft_delay")])
##  carrier_delay     weather_delay         nas_delay        security_delay     
##  Min.   :   0.00   Min.   :   0.0000   Min.   :   0.000   Min.   :0.000e+00  
##  1st Qu.:   0.00   1st Qu.:   0.0000   1st Qu.:   0.000   1st Qu.:0.000e+00  
##  Median :   0.00   Median :   0.0000   Median :   0.000   Median :0.000e+00  
##  Mean   :   5.06   Mean   :   0.8752   Mean   :   2.772   Mean   :2.542e-02  
##  3rd Qu.:   0.00   3rd Qu.:   0.0000   3rd Qu.:   0.000   3rd Qu.:0.000e+00  
##  Max.   :3689.00   Max.   :1804.0000   Max.   :2700.000   Max.   :1.164e+03  
##  late_aircraft_delay
##  Min.   :   0.000   
##  1st Qu.:   0.000   
##  Median :   0.000   
##  Mean   :   5.929   
##  3rd Qu.:   0.000   
##  Max.   :2690.000

Checking NA Values: Rows

flight_data %>% summarise(across(everything(), ~sum(is.na(.))))
##   year month day_of_month day_of_week fl_date op_unique_carrier
## 1    0     0            0           0       0                 0
##   op_carrier_fl_num origin origin_city_name origin_state_nm dest dest_city_name
## 1                 1      0                0               0    0              0
##   dest_state_nm crs_dep_time dep_time dep_delay taxi_out wheels_off wheels_on
## 1             0            0    92659     92970    95734      95734     97856
##   taxi_in crs_arr_time arr_time arr_delay cancelled cancellation_code diverted
## 1   97856            0    97854    113814         0                 0        0
##   crs_elapsed_time actual_elapsed_time air_time distance carrier_delay
## 1                1              113814   113814        0             0
##   weather_delay nas_delay security_delay late_aircraft_delay
## 1             0         0              0                   0

Checking NA Values: Columns

flight_data %>% filter(if_any(everything(), is.na))
##          year month day_of_month day_of_week    fl_date op_unique_carrier
##         <int> <int>        <int>       <int>     <IDat>            <char>
##      1:  2024     1            1           1 2024-01-01                AS
##      2:  2024     1            1           1 2024-01-01                AS
##      3:  2024     1            1           1 2024-01-01                AS
##      4:  2024     1            1           1 2024-01-01                AS
##      5:  2024     1            1           1 2024-01-01                DL
##     ---                                                                  
## 113811:  2024    12           31           2 2024-12-31                WN
## 113812:  2024    12           31           2 2024-12-31                WN
## 113813:  2024    12           31           2 2024-12-31                WN
## 113814:  2024    12           31           2 2024-12-31                WN
## 113815:  2024    12           31           2 2024-12-31                YX
##         op_carrier_fl_num origin             origin_city_name origin_state_nm
##                     <num> <char>                       <char>          <char>
##      1:              1023    PSC Pasco/Kennewick/Richland, WA      Washington
##      2:                30    SFO            San Francisco, CA      California
##      3:                41    JFK                 New York, NY        New York
##      4:               900    SEA                  Seattle, WA      Washington
##      5:              2348    SLC           Salt Lake City, UT            Utah
##     ---                                                                      
## 113811:              4316    BWI                Baltimore, MD        Maryland
## 113812:              4899    MDW                  Chicago, IL        Illinois
## 113813:               804    IND             Indianapolis, IN         Indiana
## 113814:               827    DEN                   Denver, CO        Colorado
## 113815:              4642    CMH                 Columbus, OH            Ohio
##           dest    dest_city_name dest_state_nm crs_dep_time dep_time dep_delay
##         <char>            <char>        <char>        <int>    <num>     <num>
##      1:    SEA       Seattle, WA    Washington          700       NA        NA
##      2:    JFK      New York, NY      New York         1038       NA        NA
##      3:    SFO San Francisco, CA    California         2020       NA        NA
##      4:    RDM  Bend/Redmond, OR        Oregon         2040       NA        NA
##      5:    SFO San Francisco, CA    California         2130     2133         3
##     ---                                                                       
## 113811:    SAN     San Diego, CA    California         1850     1930        40
## 113812:    CMH      Columbus, OH          Ohio         1450     1517        27
## 113813:    RSW    Fort Myers, FL       Florida          750      744        -6
## 113814:    BUR       Burbank, CA    California          840      839        -1
## 113815:    BOS        Boston, MA Massachusetts          700      655        -5
##         taxi_out wheels_off wheels_on taxi_in crs_arr_time arr_time arr_delay
##            <num>      <num>     <num>   <num>        <int>    <num>     <num>
##      1:       NA         NA        NA      NA          813       NA        NA
##      2:       NA         NA        NA      NA         1913       NA        NA
##      3:       NA         NA        NA      NA         2359       NA        NA
##      4:       NA         NA        NA      NA         2149       NA        NA
##      5:       NA         NA        NA      NA         2251       NA        NA
##     ---                                                                      
## 113811:       20       1950        NA      NA         2135       NA        NA
## 113812:       28       1545      2039       5         1700     2044        NA
## 113813:       10        754      1230       3         1025     1233        NA
## 113814:       24        903      1216       9         1010     1225        NA
## 113815:       15        710      1301       7          902     1308        NA
##         cancelled cancellation_code diverted crs_elapsed_time
##             <int>            <char>    <int>            <num>
##      1:         1                 B        0               73
##      2:         1                 A        0              335
##      3:         1                 A        0              399
##      4:         1                 B        0               69
##      5:         1                 A        0              141
##     ---                                                      
## 113811:         0                          1              345
## 113812:         0                          1               70
## 113813:         0                          1              155
## 113814:         0                          1              150
## 113815:         0                          1              122
##         actual_elapsed_time air_time distance carrier_delay weather_delay
##                       <num>    <num>    <num>         <int>         <int>
##      1:                  NA       NA      172             0             0
##      2:                  NA       NA     2586             0             0
##      3:                  NA       NA     2586             0             0
##      4:                  NA       NA      228             0             0
##      5:                  NA       NA      599             0             0
##     ---                                                                  
## 113811:                  NA       NA     2295             0             0
## 113812:                  NA       NA      283             0             0
## 113813:                  NA       NA      945             0             0
## 113814:                  NA       NA      850             0             0
## 113815:                  NA       NA      640             0             0
##         nas_delay security_delay late_aircraft_delay
##             <int>          <int>               <int>
##      1:         0              0                   0
##      2:         0              0                   0
##      3:         0              0                   0
##      4:         0              0                   0
##      5:         0              0                   0
##     ---                                             
## 113811:         0              0                   0
## 113812:         0              0                   0
## 113813:         0              0                   0
## 113814:         0              0                   0
## 113815:         0              0                   0

Data Wrangling, Munging, and Cleaning Conclusions:

Because this data set was cleaned prior to my downloading the files, there was minimal manipulation needed to analyze this dataset. I ran a few summaries and checked for rows and columns with missing fields. The rows did not return much information; however, the columns returned a ton of information. Which makes sense given that not every instance had a delay for all categories.

Comparing Origin City vs Departure Delays (Top 10)

Comparing Origin City vs Departure Delays (Bottom 10)

Comparing Origin City vs Arrival Delays (Top 10)

Comparing Origin City vs Arrival Delays (Bottom 10)

How do delays vary across different origin airports?

I was surprised to see the big airport hubs not be listed in either the top 10 or the bottom 10 for either arrival delays or departure delays. I expected these airports to be the ones that filtered to the top. Most the airports seems to be smaller regional airports. Many are in locations where large amounts of snowfall occur.

Destination with the longest length of Arrival Delay

Destination with the longest length of Departure Delay

Which destinations experience the highest frequency or length of delays?

In comparing the longest delay a passenger could see for both arrival and departure delays for destination cities. I was not surprised that the major US airports were listed in both instances. This makes sense due to the amount of traffic that flows through these hubs.

Comparison of Arrival Delays by Month

Delays by Month

I was surprised to see that July experienced the highest average of delays. I would have expected winter months or the holidays to be the times when we would experience the most delays. That said, passengers were, on average, delayed less than 30 minutes. I was not surprised to see the arrival and departure delays mirror one another. However, consistently, departure delays were greater than arrival delays.

Cause of Delay by Month

Which delay causes (weather, mechanical, security) contribute most to total delay time?

Throughout this dataset the a late aircraft delay was the most frequent cause for a delay. It was followed closely by a carrier delay. I was surprised to see that weather delays were one of the least common causes for a flight delay. Whatever happened in July continues in this analysis as well.

Average Delays Per Airline

Average Delay per Airline Continued

American Airlines had the highest average of delay minutes for its departures. They also had the highest average for arrival delays. The question now is, is this an airport issue or an airline policy issue? This data set does not include any information regarding the policies, however, if I worked for American Airlines and looked at this data set I would question is this a process issue that we have control over or is this a series of events at an airport outside of our control.

Conclusion

While there are definitely long delays the average delay for the year 2024, after reviewing over 7 million lines of data was less than 30 minutes. There are some airlines that may want to look at their internal policies to address any areas in which they can make their processes more efficient. Also the perception of flight delays around the holidays is not what the data suggests.

Works Cited