Introduction

How well do weather related delays and flight volume predict the average time of delay for flights?

Airline delays are a common occurrence and a source of frustration amongst travelers as well as creating a challenge for Airports and Airlines. Understanding the factors that are the most strongly associated with longer delays can help Airline put their resources more efficiently and Improve customer’s travel experience.

I decided to use the airline_delay dataset from OpenIntro, which will summarize the statistics of delay for U.S Airports from December 2019 & December 2020. Each row indicates a combination of month, airline, airport, total number of arriving flights, amount of flights that are delayed, and total amount of time delayed broken down by the causes like Traffic Control, Weather, and Late Aircraft. For my research question, I will only focus on the essential variables which is Flight Volume(Total Arriving Flights), Weather delay in minutes, and Average delay time, defined as the total of delayed minutes divided by the number of delayed flights, making it well suited for a Multiple Regression model for average delay time as a continuous outcome, and having flight volume and weather delay as my predictor.

Data Analysis

Before using the dataset for modeling, I performed some basic exploratory data analysis (EDA), using function like head(), str(), and summary(). Additionally, I also used colSums(is.na()) to look for any missing values within the dataset and unique() to review the distinct categories for certain columns.

I also used several dplyr functions to clean and keep my dataset organized. The rows that contained missing values in my key variables, were removed by utilizing the filter() to ensure that my analysis was based on complete observations. Additionally, I created a new variable called average_delay_time by using the mutate() function, and calculated the total arrival delay minutes by dividing it by the number of flights delayed more than 15 minutes. I then used filter() to remove rows where no flights were delayed, since those observations may cause an undefined average delay time. After cleaning, I used select() to keep just the key variables needed for my analysis. Applied group_by() and summarise() to compute summaries, like the average delay time by airline carrier. This process helped ensure that my dataset was clean, complete, perfectly formatted to proceed with building the multiple regression model.

# Importing the dataset

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
airline_delay <- read.csv("airline_delay.csv")
# Checking the Head, Structure, and Summary of the dataset

head(airline_delay)
##   year month carrier      carrier_name airport
## 1 2020    12      9E Endeavor Air Inc.     ABE
## 2 2020    12      9E Endeavor Air Inc.     ABY
## 3 2020    12      9E Endeavor Air Inc.     AEX
## 4 2020    12      9E Endeavor Air Inc.     AGS
## 5 2020    12      9E Endeavor Air Inc.     ALB
## 6 2020    12      9E Endeavor Air Inc.     ATL
##                                                  airport_name arr_flights
## 1 Allentown/Bethlehem/Easton, PA: Lehigh Valley International          44
## 2                      Albany, GA: Southwest Georgia Regional          90
## 3                    Alexandria, LA: Alexandria International          88
## 4                 Augusta, GA: Augusta Regional at Bush Field         184
## 5                            Albany, NY: Albany International          76
## 6       Atlanta, GA: Hartsfield-Jackson Atlanta International        5985
##   arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
## 1         3       1.63       0.00   0.12           0             1.25
## 2         1       0.96       0.00   0.04           0             0.00
## 3         8       5.75       0.00   1.60           0             0.65
## 4         9       4.17       0.00   1.83           0             3.00
## 5        11       4.78       0.00   5.22           0             1.00
## 6       445     142.89      11.96 161.37           1           127.79
##   arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
## 1             0            1        89            56             0         3
## 2             0            0        23            22             0         1
## 3             0            1       338           265             0        45
## 4             0            0       508           192             0        92
## 5             1            0       692           398             0       178
## 6             5            0     30756         16390          1509      5060
##   security_delay late_aircraft_delay
## 1              0                  30
## 2              0                   0
## 3              0                  28
## 4              0                 224
## 5              0                 116
## 6             16                7781
str(airline_delay)
## 'data.frame':    3351 obs. of  21 variables:
##  $ year               : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ month              : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ carrier            : chr  "9E" "9E" "9E" "9E" ...
##  $ carrier_name       : chr  "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." ...
##  $ airport            : chr  "ABE" "ABY" "AEX" "AGS" ...
##  $ airport_name       : chr  "Allentown/Bethlehem/Easton, PA: Lehigh Valley International" "Albany, GA: Southwest Georgia Regional" "Alexandria, LA: Alexandria International" "Augusta, GA: Augusta Regional at Bush Field" ...
##  $ arr_flights        : int  44 90 88 184 76 5985 142 147 84 150 ...
##  $ arr_del15          : int  3 1 8 9 11 445 14 10 14 19 ...
##  $ carrier_ct         : num  1.63 0.96 5.75 4.17 4.78 ...
##  $ weather_ct         : num  0 0 0 0 0 ...
##  $ nas_ct             : num  0.12 0.04 1.6 1.83 5.22 ...
##  $ security_ct        : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ late_aircraft_ct   : num  1.25 0 0.65 3 1 ...
##  $ arr_cancelled      : int  0 0 0 0 1 5 1 0 1 3 ...
##  $ arr_diverted       : int  1 0 1 0 0 0 0 1 1 0 ...
##  $ arr_delay          : int  89 23 338 508 692 30756 436 1070 2006 846 ...
##  $ carrier_delay      : int  56 22 265 192 398 16390 162 838 1164 423 ...
##  $ weather_delay      : int  0 0 0 0 0 1509 0 141 619 0 ...
##  $ nas_delay          : int  3 1 45 92 178 5060 182 24 223 389 ...
##  $ security_delay     : int  0 0 0 0 0 16 0 0 0 0 ...
##  $ late_aircraft_delay: int  30 0 28 224 116 7781 92 67 0 34 ...
summary(airline_delay)
##       year          month      carrier          carrier_name      
##  Min.   :2019   Min.   :12   Length:3351        Length:3351       
##  1st Qu.:2019   1st Qu.:12   Class :character   Class :character  
##  Median :2019   Median :12   Mode  :character   Mode  :character  
##  Mean   :2019   Mean   :12                                        
##  3rd Qu.:2020   3rd Qu.:12                                        
##  Max.   :2020   Max.   :12                                        
##                                                                   
##    airport          airport_name        arr_flights        arr_del15   
##  Length:3351        Length:3351        Min.   :    1.0   Min.   :   0  
##  Class :character   Class :character   1st Qu.:   35.0   1st Qu.:   5  
##  Mode  :character   Mode  :character   Median :   83.0   Median :  12  
##                                        Mean   :  298.3   Mean   :  51  
##                                        3rd Qu.:  194.5   3rd Qu.:  33  
##                                        Max.   :19713.0   Max.   :2289  
##                                        NA's   :8         NA's   :8     
##    carrier_ct       weather_ct         nas_ct         security_ct     
##  Min.   :  0.00   Min.   : 0.000   Min.   :   0.00   Min.   : 0.0000  
##  1st Qu.:  1.49   1st Qu.: 0.000   1st Qu.:   0.82   1st Qu.: 0.0000  
##  Median :  4.75   Median : 0.060   Median :   2.98   Median : 0.0000  
##  Mean   : 16.07   Mean   : 1.443   Mean   :  16.18   Mean   : 0.1373  
##  3rd Qu.: 12.26   3rd Qu.: 1.010   3rd Qu.:   8.87   3rd Qu.: 0.0000  
##  Max.   :697.00   Max.   :89.420   Max.   :1039.54   Max.   :17.3100  
##  NA's   :8        NA's   :8        NA's   :8         NA's   :8        
##  late_aircraft_ct arr_cancelled      arr_diverted       arr_delay     
##  Min.   :  0.00   Min.   :  0.000   Min.   : 0.0000   Min.   :     0  
##  1st Qu.:  0.90   1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.:   230  
##  Median :  3.28   Median :  0.000   Median : 0.0000   Median :   746  
##  Mean   : 17.17   Mean   :  2.885   Mean   : 0.5758   Mean   :  3334  
##  3rd Qu.: 10.24   3rd Qu.:  2.000   3rd Qu.: 0.0000   3rd Qu.:  2096  
##  Max.   :819.66   Max.   :224.000   Max.   :42.0000   Max.   :160383  
##  NA's   :8        NA's   :8         NA's   :8         NA's   :8       
##  carrier_delay     weather_delay       nas_delay       security_delay   
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :  0.000  
##  1st Qu.:   68.5   1st Qu.:    0.0   1st Qu.:   21.5   1st Qu.:  0.000  
##  Median :  272.0   Median :    3.0   Median :  106.0   Median :  0.000  
##  Mean   : 1144.8   Mean   :  177.6   Mean   :  749.6   Mean   :  5.401  
##  3rd Qu.:  830.5   3rd Qu.:   82.0   3rd Qu.:  362.0   3rd Qu.:  0.000  
##  Max.   :55215.0   Max.   :14219.0   Max.   :82064.0   Max.   :553.000  
##  NA's   :8         NA's   :8         NA's   :8         NA's   :8        
##  late_aircraft_delay
##  Min.   :    0      
##  1st Qu.:   31      
##  Median :  205      
##  Mean   : 1257      
##  3rd Qu.:  724      
##  Max.   :75179      
##  NA's   :8
#Checking for any missing values

colSums(is.na(airline_delay))
##                year               month             carrier        carrier_name 
##                   0                   0                   0                   0 
##             airport        airport_name         arr_flights           arr_del15 
##                   0                   0                   8                   8 
##          carrier_ct          weather_ct              nas_ct         security_ct 
##                   8                   8                   8                   8 
##    late_aircraft_ct       arr_cancelled        arr_diverted           arr_delay 
##                   8                   8                   8                   8 
##       carrier_delay       weather_delay           nas_delay      security_delay 
##                   8                   8                   8                   8 
## late_aircraft_delay 
##                   8
# Cleaning up the missing values for the variables needed

airline_delay_clean <- airline_delay |>
  filter(!is.na(arr_del15), !is.na(arr_delay), !is.na(weather_delay), arr_del15 > 0)
# Checking Dataset's categories

unique(airline_delay$carrier)
##  [1] "9E" "AA" "AS" "B6" "DL" "F9" "G4" "HA" "MQ" "NK" "OH" "OO" "YX" "UA" "WN"
## [16] "YV" "EV"
unique(airline_delay$airport)
##   [1] "ABE" "ABY" "AEX" "AGS" "ALB" "ATL" "ATW" "AVL" "AZO" "BDL" "BHM" "BIS"
##  [13] "BMI" "BNA" "BOS" "BQK" "BTR" "BTV" "BUF" "BWI" "CAE" "CHA" "CHO" "CHS"
##  [25] "CID" "CLE" "CLT" "CMH" "CRW" "CSG" "CVG" "CWA" "DAL" "DAY" "DCA" "DFW"
##  [37] "DHN" "DLH" "DSM" "DTW" "ECP" "ELM" "EVV" "EWR" "FAR" "FAY" "FLL" "FSD"
##  [49] "FWA" "GFK" "GNV" "GPT" "GRB" "GRR" "GSO" "GSP" "GTR" "HOU" "HSV" "IAD"
##  [61] "ICT" "ILM" "IND" "JAN" "JAX" "JFK" "LAN" "LEX" "LFT" "LGA" "LIT" "MBS"
##  [73] "MCI" "MCO" "MDT" "MEM" "MGM" "MIA" "MKE" "MLI" "MLU" "MOB" "MOT" "MSN"
##  [85] "MSP" "MSY" "MYR" "OAJ" "OKC" "OMA" "ORD" "ORF" "PHL" "PIT" "RAP" "RDU"
##  [97] "RIC" "ROA" "ROC" "RST" "RSW" "SAV" "SBN" "SDF" "SGF" "SHV" "SRQ" "STL"
## [109] "SYR" "TLH" "TPA" "TRI" "TUL" "TYS" "VLD" "VPS" "XNA" "ABQ" "AUS" "BOI"
## [121] "BUR" "BZN" "COS" "DEN" "EGE" "ELP" "EYW" "FAT" "GEG" "GUC" "HDN" "HNL"
## [133] "IAH" "JAC" "KOA" "LAS" "LAX" "LIH" "MTJ" "OGG" "ONT" "PBI" "PDX" "PHX"
## [145] "PSP" "PVD" "PWM" "RNO" "SAN" "SAT" "SEA" "SFO" "SJC" "SJU" "SLC" "SMF"
## [157] "SNA" "STT" "STX" "TUS" "ADK" "ADQ" "ANC" "BET" "BRW" "CDB" "CDV" "FAI"
## [169] "JNU" "KTN" "OAK" "OME" "OTZ" "PSG" "SCC" "SIT" "WRG" "YAK" "ACK" "HPN"
## [181] "BIL" "DAB" "FCA" "MDW" "MLB" "MSO" "PNS" "HRL" "ISP" "TTN" "AZA" "BGR"
## [193] "BLI" "BLV" "CKB" "EUG" "FNT" "GJT" "GRI" "GTF" "HGR" "HTS" "IAG" "IDA"
## [205] "LCK" "LRD" "MFE" "MFR" "MRY" "OGD" "OWB" "PBG" "PGD" "PIA" "PIE" "PSC"
## [217] "PSM" "PVU" "RDM" "RFD" "SCE" "SCK" "SFB" "SMX" "SPI" "STC" "SWF" "TOL"
## [229] "TVC" "USA" "ITO" "LGB" "ABI" "ACT" "ALO" "AMA" "AVP" "BPT" "BRO" "CLL"
## [241] "CMI" "COU" "CRP" "FSM" "GCK" "GGG" "GRK" "JLN" "LAW" "LBB" "LSE" "MHK"
## [253] "MQT" "SBA" "SBP" "SJT" "SPS" "STS" "SUX" "SWO" "TXK" "TYR" "ACY" "CAK"
## [265] "LBE" "ERI" "EWN" "LYH" "MHT" "PHF" "ABR" "ACV" "ALS" "APN" "ASE" "ATY"
## [277] "BFF" "BFL" "BGM" "BJI" "BRD" "BTM" "CDC" "CGI" "CIU" "CMX" "CNY" "COD"
## [289] "CPR" "CYS" "DDC" "DEC" "DIK" "DRO" "DVL" "EAR" "EAU" "EKO" "ESC" "FLG"
## [301] "GCC" "HIB" "HLN" "HOB" "HYS" "IMT" "INL" "ITH" "JMS" "JST" "LAR" "LBF"
## [313] "LBL" "LCH" "LNK" "LWB" "LWS" "MAF" "MEI" "MKG" "OGS" "OTH" "PAE" "PAH"
## [325] "PIB" "PIH" "PIR" "PLN" "PRC" "PUB" "RDD" "RHI" "RIW" "RKS" "ROW" "SAF"
## [337] "SGU" "SHD" "SHR" "SLN" "SUN" "TWF" "VCT" "VEL" "XWA" "YUM" "GUM" "SPN"
## [349] "HHH" "BFM" "PPG" "DBQ" "DRT" "PGV" "BQN" "HVN" "MMH" "ORH" "UIN" "PSE"
# Outcome variable

airline_delay_clean <- airline_delay_clean |>
  mutate(avg_delay_time = arr_delay / arr_del15)
#Selecting variables for modeling

airline_delay_final <- airline_delay_clean |>
  select(avg_delay_time, weather_delay, arr_flights, carrier, airport, year)
#Grouped Summary

airline_delay_final |>
  group_by(carrier) |>
  summarise(mean_avg_delay = mean(avg_delay_time, na.rm = TRUE))
## # A tibble: 17 × 2
##    carrier mean_avg_delay
##    <chr>            <dbl>
##  1 9E                59.0
##  2 AA                57.9
##  3 AS                48.9
##  4 B6                80.3
##  5 DL                52.4
##  6 EV                76.5
##  7 F9                54.9
##  8 G4                71.1
##  9 HA                36.7
## 10 MQ                58.7
## 11 NK                57.6
## 12 OH                58.8
## 13 OO                80.3
## 14 UA                59.5
## 15 WN                42.6
## 16 YV                78.7
## 17 YX                62.4

Statistical Analysis (Multiple Regress.)

In order to answer my research question, I decided to use Multiple-linear regression to examines how weather related delays and the flight volume would predict the average time of delay for flights. Utilizing Multiple-linear regression is perfect for this analysis because the dependent variable, average delay time is continuous and the model gives way for a simultaenous evaluation of multiple predictors. The final model will include weather delay minutes and total number of arriving flights as independent variables.

After fitting the model, I have evaluated that the regression output by examining the estimated coefficients, p-values, and the R-squared value. Diagnostic plots were used to assess whether or not the assumptions of multiple-linear regression were reasonably satisfied. These diagnostics helped me determine if the model provides a reliable explanation of the relationship between flight delays and the selected predictors.

#Multiple Regression Model

delay_model <- lm(avg_delay_time ~ weather_delay + arr_flights, data = airline_delay_final)

summary(delay_model)
## 
## Call:
## lm(formula = avg_delay_time ~ weather_delay + arr_flights, data = airline_delay_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -90.58  -21.12   -7.18    9.70 1488.17 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   62.956946   0.882373  71.350  < 2e-16 ***
## weather_delay  0.012157   0.001358   8.954  < 2e-16 ***
## arr_flights   -0.006186   0.001171  -5.281 1.37e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.8 on 3179 degrees of freedom
## Multiple R-squared:  0.0246, Adjusted R-squared:  0.02399 
## F-statistic: 40.09 on 2 and 3179 DF,  p-value: < 2.2e-16
# Diagnostic Plots

par(mfrow = c(2, 2))
plot(delay_model)

After creating the multiple regression model, diagnostic plots were examined to evaluate whether the assumption of the multiple linear regression was reasonably satisfied. The Residuals V. Fitted Plot indicated no strong systematic pattern, suggesting that the assumption for linearity is appropriate. The normal Q-Q plot also indicated that the residuals were normally distributed, with a common deviations in the upper tail that was expected. The scale location plot also suggests a mild heteroscedasticity but the overall variance of residuals remained stable across the fitted values. Lastly, the Residuals V. Leverage plot started to reveal a small number of observations with a higher leverage. Overall, the diagnostics show that the regression assumption were reasonably met and the model can provide a reliable interpretation.

Conclusion

The analysis examines how weather related delays and flight volume predict the average time of delay for flights by using a multiple linear regression model. The results indicate that weather related delays were positively and statistically significant associated with longer average delay time, indicating that increased weather disruption will contribute meaningfully to prolonged delays. Flight volumes were also a statistically significant predictor with high numbers of arriving flights associated with a slightly shorter average delay times, suggesting that airports with a greater traffic may operate efficiently in terms of managing delays. The overall regression was statistically significant and provides a meaningful insight into the relationship between operational factors and flight delays.

These findings highlight the importance of weather conditions influencing airline delay performance while at the same time suggesting that the operational scale may play a role in mitigating a delay severity. For my future research is to improve the model by adding more variables like airport congestion levels, staffing, aircraft types, and traffic control constraints. Expanding this analysis to include data from over a period of time could also help get a seasonal pattern to improve my analysis. Overall, this analysis provides an efficient foundation to understand factors associated with the airline delays and offers an opportunity for a deeper analysis.

Reference Section

https://www.openintro.org/data/index.php?data=airline_delay