How well do weather related delays and flight volume predict the average time of delay for flights?
Airline delays are a common occurrence and a source of frustration amongst travelers as well as creating a challenge for Airports and Airlines. Understanding the factors that are the most strongly associated with longer delays can help Airline put their resources more efficiently and Improve customer’s travel experience.
I decided to use the airline_delay dataset from OpenIntro, which will summarize the statistics of delay for U.S Airports from December 2019 & December 2020. Each row indicates a combination of month, airline, airport, total number of arriving flights, amount of flights that are delayed, and total amount of time delayed broken down by the causes like Traffic Control, Weather, and Late Aircraft. For my research question, I will only focus on the essential variables which is Flight Volume(Total Arriving Flights), Weather delay in minutes, and Average delay time, defined as the total of delayed minutes divided by the number of delayed flights, making it well suited for a Multiple Regression model for average delay time as a continuous outcome, and having flight volume and weather delay as my predictor.
Before using the dataset for modeling, I performed some basic exploratory data analysis (EDA), using function like head(), str(), and summary(). Additionally, I also used colSums(is.na()) to look for any missing values within the dataset and unique() to review the distinct categories for certain columns.
I also used several dplyr functions to clean and keep my dataset organized. The rows that contained missing values in my key variables, were removed by utilizing the filter() to ensure that my analysis was based on complete observations. Additionally, I created a new variable called average_delay_time by using the mutate() function, and calculated the total arrival delay minutes by dividing it by the number of flights delayed more than 15 minutes. I then used filter() to remove rows where no flights were delayed, since those observations may cause an undefined average delay time. After cleaning, I used select() to keep just the key variables needed for my analysis. Applied group_by() and summarise() to compute summaries, like the average delay time by airline carrier. This process helped ensure that my dataset was clean, complete, perfectly formatted to proceed with building the multiple regression model.
# Importing the dataset
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
airline_delay <- read.csv("airline_delay.csv")
# Checking the Head, Structure, and Summary of the dataset
head(airline_delay)
## year month carrier carrier_name airport
## 1 2020 12 9E Endeavor Air Inc. ABE
## 2 2020 12 9E Endeavor Air Inc. ABY
## 3 2020 12 9E Endeavor Air Inc. AEX
## 4 2020 12 9E Endeavor Air Inc. AGS
## 5 2020 12 9E Endeavor Air Inc. ALB
## 6 2020 12 9E Endeavor Air Inc. ATL
## airport_name arr_flights
## 1 Allentown/Bethlehem/Easton, PA: Lehigh Valley International 44
## 2 Albany, GA: Southwest Georgia Regional 90
## 3 Alexandria, LA: Alexandria International 88
## 4 Augusta, GA: Augusta Regional at Bush Field 184
## 5 Albany, NY: Albany International 76
## 6 Atlanta, GA: Hartsfield-Jackson Atlanta International 5985
## arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
## 1 3 1.63 0.00 0.12 0 1.25
## 2 1 0.96 0.00 0.04 0 0.00
## 3 8 5.75 0.00 1.60 0 0.65
## 4 9 4.17 0.00 1.83 0 3.00
## 5 11 4.78 0.00 5.22 0 1.00
## 6 445 142.89 11.96 161.37 1 127.79
## arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
## 1 0 1 89 56 0 3
## 2 0 0 23 22 0 1
## 3 0 1 338 265 0 45
## 4 0 0 508 192 0 92
## 5 1 0 692 398 0 178
## 6 5 0 30756 16390 1509 5060
## security_delay late_aircraft_delay
## 1 0 30
## 2 0 0
## 3 0 28
## 4 0 224
## 5 0 116
## 6 16 7781
str(airline_delay)
## 'data.frame': 3351 obs. of 21 variables:
## $ year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ month : int 12 12 12 12 12 12 12 12 12 12 ...
## $ carrier : chr "9E" "9E" "9E" "9E" ...
## $ carrier_name : chr "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." ...
## $ airport : chr "ABE" "ABY" "AEX" "AGS" ...
## $ airport_name : chr "Allentown/Bethlehem/Easton, PA: Lehigh Valley International" "Albany, GA: Southwest Georgia Regional" "Alexandria, LA: Alexandria International" "Augusta, GA: Augusta Regional at Bush Field" ...
## $ arr_flights : int 44 90 88 184 76 5985 142 147 84 150 ...
## $ arr_del15 : int 3 1 8 9 11 445 14 10 14 19 ...
## $ carrier_ct : num 1.63 0.96 5.75 4.17 4.78 ...
## $ weather_ct : num 0 0 0 0 0 ...
## $ nas_ct : num 0.12 0.04 1.6 1.83 5.22 ...
## $ security_ct : num 0 0 0 0 0 1 0 0 0 0 ...
## $ late_aircraft_ct : num 1.25 0 0.65 3 1 ...
## $ arr_cancelled : int 0 0 0 0 1 5 1 0 1 3 ...
## $ arr_diverted : int 1 0 1 0 0 0 0 1 1 0 ...
## $ arr_delay : int 89 23 338 508 692 30756 436 1070 2006 846 ...
## $ carrier_delay : int 56 22 265 192 398 16390 162 838 1164 423 ...
## $ weather_delay : int 0 0 0 0 0 1509 0 141 619 0 ...
## $ nas_delay : int 3 1 45 92 178 5060 182 24 223 389 ...
## $ security_delay : int 0 0 0 0 0 16 0 0 0 0 ...
## $ late_aircraft_delay: int 30 0 28 224 116 7781 92 67 0 34 ...
summary(airline_delay)
## year month carrier carrier_name
## Min. :2019 Min. :12 Length:3351 Length:3351
## 1st Qu.:2019 1st Qu.:12 Class :character Class :character
## Median :2019 Median :12 Mode :character Mode :character
## Mean :2019 Mean :12
## 3rd Qu.:2020 3rd Qu.:12
## Max. :2020 Max. :12
##
## airport airport_name arr_flights arr_del15
## Length:3351 Length:3351 Min. : 1.0 Min. : 0
## Class :character Class :character 1st Qu.: 35.0 1st Qu.: 5
## Mode :character Mode :character Median : 83.0 Median : 12
## Mean : 298.3 Mean : 51
## 3rd Qu.: 194.5 3rd Qu.: 33
## Max. :19713.0 Max. :2289
## NA's :8 NA's :8
## carrier_ct weather_ct nas_ct security_ct
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 1.49 1st Qu.: 0.000 1st Qu.: 0.82 1st Qu.: 0.0000
## Median : 4.75 Median : 0.060 Median : 2.98 Median : 0.0000
## Mean : 16.07 Mean : 1.443 Mean : 16.18 Mean : 0.1373
## 3rd Qu.: 12.26 3rd Qu.: 1.010 3rd Qu.: 8.87 3rd Qu.: 0.0000
## Max. :697.00 Max. :89.420 Max. :1039.54 Max. :17.3100
## NA's :8 NA's :8 NA's :8 NA's :8
## late_aircraft_ct arr_cancelled arr_diverted arr_delay
## Min. : 0.00 Min. : 0.000 Min. : 0.0000 Min. : 0
## 1st Qu.: 0.90 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 230
## Median : 3.28 Median : 0.000 Median : 0.0000 Median : 746
## Mean : 17.17 Mean : 2.885 Mean : 0.5758 Mean : 3334
## 3rd Qu.: 10.24 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 2096
## Max. :819.66 Max. :224.000 Max. :42.0000 Max. :160383
## NA's :8 NA's :8 NA's :8 NA's :8
## carrier_delay weather_delay nas_delay security_delay
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.000
## 1st Qu.: 68.5 1st Qu.: 0.0 1st Qu.: 21.5 1st Qu.: 0.000
## Median : 272.0 Median : 3.0 Median : 106.0 Median : 0.000
## Mean : 1144.8 Mean : 177.6 Mean : 749.6 Mean : 5.401
## 3rd Qu.: 830.5 3rd Qu.: 82.0 3rd Qu.: 362.0 3rd Qu.: 0.000
## Max. :55215.0 Max. :14219.0 Max. :82064.0 Max. :553.000
## NA's :8 NA's :8 NA's :8 NA's :8
## late_aircraft_delay
## Min. : 0
## 1st Qu.: 31
## Median : 205
## Mean : 1257
## 3rd Qu.: 724
## Max. :75179
## NA's :8
#Checking for any missing values
colSums(is.na(airline_delay))
## year month carrier carrier_name
## 0 0 0 0
## airport airport_name arr_flights arr_del15
## 0 0 8 8
## carrier_ct weather_ct nas_ct security_ct
## 8 8 8 8
## late_aircraft_ct arr_cancelled arr_diverted arr_delay
## 8 8 8 8
## carrier_delay weather_delay nas_delay security_delay
## 8 8 8 8
## late_aircraft_delay
## 8
# Cleaning up the missing values for the variables needed
airline_delay_clean <- airline_delay |>
filter(!is.na(arr_del15), !is.na(arr_delay), !is.na(weather_delay), arr_del15 > 0)
# Checking Dataset's categories
unique(airline_delay$carrier)
## [1] "9E" "AA" "AS" "B6" "DL" "F9" "G4" "HA" "MQ" "NK" "OH" "OO" "YX" "UA" "WN"
## [16] "YV" "EV"
unique(airline_delay$airport)
## [1] "ABE" "ABY" "AEX" "AGS" "ALB" "ATL" "ATW" "AVL" "AZO" "BDL" "BHM" "BIS"
## [13] "BMI" "BNA" "BOS" "BQK" "BTR" "BTV" "BUF" "BWI" "CAE" "CHA" "CHO" "CHS"
## [25] "CID" "CLE" "CLT" "CMH" "CRW" "CSG" "CVG" "CWA" "DAL" "DAY" "DCA" "DFW"
## [37] "DHN" "DLH" "DSM" "DTW" "ECP" "ELM" "EVV" "EWR" "FAR" "FAY" "FLL" "FSD"
## [49] "FWA" "GFK" "GNV" "GPT" "GRB" "GRR" "GSO" "GSP" "GTR" "HOU" "HSV" "IAD"
## [61] "ICT" "ILM" "IND" "JAN" "JAX" "JFK" "LAN" "LEX" "LFT" "LGA" "LIT" "MBS"
## [73] "MCI" "MCO" "MDT" "MEM" "MGM" "MIA" "MKE" "MLI" "MLU" "MOB" "MOT" "MSN"
## [85] "MSP" "MSY" "MYR" "OAJ" "OKC" "OMA" "ORD" "ORF" "PHL" "PIT" "RAP" "RDU"
## [97] "RIC" "ROA" "ROC" "RST" "RSW" "SAV" "SBN" "SDF" "SGF" "SHV" "SRQ" "STL"
## [109] "SYR" "TLH" "TPA" "TRI" "TUL" "TYS" "VLD" "VPS" "XNA" "ABQ" "AUS" "BOI"
## [121] "BUR" "BZN" "COS" "DEN" "EGE" "ELP" "EYW" "FAT" "GEG" "GUC" "HDN" "HNL"
## [133] "IAH" "JAC" "KOA" "LAS" "LAX" "LIH" "MTJ" "OGG" "ONT" "PBI" "PDX" "PHX"
## [145] "PSP" "PVD" "PWM" "RNO" "SAN" "SAT" "SEA" "SFO" "SJC" "SJU" "SLC" "SMF"
## [157] "SNA" "STT" "STX" "TUS" "ADK" "ADQ" "ANC" "BET" "BRW" "CDB" "CDV" "FAI"
## [169] "JNU" "KTN" "OAK" "OME" "OTZ" "PSG" "SCC" "SIT" "WRG" "YAK" "ACK" "HPN"
## [181] "BIL" "DAB" "FCA" "MDW" "MLB" "MSO" "PNS" "HRL" "ISP" "TTN" "AZA" "BGR"
## [193] "BLI" "BLV" "CKB" "EUG" "FNT" "GJT" "GRI" "GTF" "HGR" "HTS" "IAG" "IDA"
## [205] "LCK" "LRD" "MFE" "MFR" "MRY" "OGD" "OWB" "PBG" "PGD" "PIA" "PIE" "PSC"
## [217] "PSM" "PVU" "RDM" "RFD" "SCE" "SCK" "SFB" "SMX" "SPI" "STC" "SWF" "TOL"
## [229] "TVC" "USA" "ITO" "LGB" "ABI" "ACT" "ALO" "AMA" "AVP" "BPT" "BRO" "CLL"
## [241] "CMI" "COU" "CRP" "FSM" "GCK" "GGG" "GRK" "JLN" "LAW" "LBB" "LSE" "MHK"
## [253] "MQT" "SBA" "SBP" "SJT" "SPS" "STS" "SUX" "SWO" "TXK" "TYR" "ACY" "CAK"
## [265] "LBE" "ERI" "EWN" "LYH" "MHT" "PHF" "ABR" "ACV" "ALS" "APN" "ASE" "ATY"
## [277] "BFF" "BFL" "BGM" "BJI" "BRD" "BTM" "CDC" "CGI" "CIU" "CMX" "CNY" "COD"
## [289] "CPR" "CYS" "DDC" "DEC" "DIK" "DRO" "DVL" "EAR" "EAU" "EKO" "ESC" "FLG"
## [301] "GCC" "HIB" "HLN" "HOB" "HYS" "IMT" "INL" "ITH" "JMS" "JST" "LAR" "LBF"
## [313] "LBL" "LCH" "LNK" "LWB" "LWS" "MAF" "MEI" "MKG" "OGS" "OTH" "PAE" "PAH"
## [325] "PIB" "PIH" "PIR" "PLN" "PRC" "PUB" "RDD" "RHI" "RIW" "RKS" "ROW" "SAF"
## [337] "SGU" "SHD" "SHR" "SLN" "SUN" "TWF" "VCT" "VEL" "XWA" "YUM" "GUM" "SPN"
## [349] "HHH" "BFM" "PPG" "DBQ" "DRT" "PGV" "BQN" "HVN" "MMH" "ORH" "UIN" "PSE"
# Outcome variable
airline_delay_clean <- airline_delay_clean |>
mutate(avg_delay_time = arr_delay / arr_del15)
#Selecting variables for modeling
airline_delay_final <- airline_delay_clean |>
select(avg_delay_time, weather_delay, arr_flights, carrier, airport, year)
#Grouped Summary
airline_delay_final |>
group_by(carrier) |>
summarise(mean_avg_delay = mean(avg_delay_time, na.rm = TRUE))
## # A tibble: 17 × 2
## carrier mean_avg_delay
## <chr> <dbl>
## 1 9E 59.0
## 2 AA 57.9
## 3 AS 48.9
## 4 B6 80.3
## 5 DL 52.4
## 6 EV 76.5
## 7 F9 54.9
## 8 G4 71.1
## 9 HA 36.7
## 10 MQ 58.7
## 11 NK 57.6
## 12 OH 58.8
## 13 OO 80.3
## 14 UA 59.5
## 15 WN 42.6
## 16 YV 78.7
## 17 YX 62.4
In order to answer my research question, I decided to use Multiple-linear regression to examines how weather related delays and the flight volume would predict the average time of delay for flights. Utilizing Multiple-linear regression is perfect for this analysis because the dependent variable, average delay time is continuous and the model gives way for a simultaenous evaluation of multiple predictors. The final model will include weather delay minutes and total number of arriving flights as independent variables.
After fitting the model, I have evaluated that the regression output by examining the estimated coefficients, p-values, and the R-squared value. Diagnostic plots were used to assess whether or not the assumptions of multiple-linear regression were reasonably satisfied. These diagnostics helped me determine if the model provides a reliable explanation of the relationship between flight delays and the selected predictors.
#Multiple Regression Model
delay_model <- lm(avg_delay_time ~ weather_delay + arr_flights, data = airline_delay_final)
summary(delay_model)
##
## Call:
## lm(formula = avg_delay_time ~ weather_delay + arr_flights, data = airline_delay_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.58 -21.12 -7.18 9.70 1488.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.956946 0.882373 71.350 < 2e-16 ***
## weather_delay 0.012157 0.001358 8.954 < 2e-16 ***
## arr_flights -0.006186 0.001171 -5.281 1.37e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.8 on 3179 degrees of freedom
## Multiple R-squared: 0.0246, Adjusted R-squared: 0.02399
## F-statistic: 40.09 on 2 and 3179 DF, p-value: < 2.2e-16
# Diagnostic Plots
par(mfrow = c(2, 2))
plot(delay_model)
After creating the multiple regression model, diagnostic plots were examined to evaluate whether the assumption of the multiple linear regression was reasonably satisfied. The Residuals V. Fitted Plot indicated no strong systematic pattern, suggesting that the assumption for linearity is appropriate. The normal Q-Q plot also indicated that the residuals were normally distributed, with a common deviations in the upper tail that was expected. The scale location plot also suggests a mild heteroscedasticity but the overall variance of residuals remained stable across the fitted values. Lastly, the Residuals V. Leverage plot started to reveal a small number of observations with a higher leverage. Overall, the diagnostics show that the regression assumption were reasonably met and the model can provide a reliable interpretation.
The analysis examines how weather related delays and flight volume predict the average time of delay for flights by using a multiple linear regression model. The results indicate that weather related delays were positively and statistically significant associated with longer average delay time, indicating that increased weather disruption will contribute meaningfully to prolonged delays. Flight volumes were also a statistically significant predictor with high numbers of arriving flights associated with a slightly shorter average delay times, suggesting that airports with a greater traffic may operate efficiently in terms of managing delays. The overall regression was statistically significant and provides a meaningful insight into the relationship between operational factors and flight delays.
These findings highlight the importance of weather conditions influencing airline delay performance while at the same time suggesting that the operational scale may play a role in mitigating a delay severity. For my future research is to improve the model by adding more variables like airport congestion levels, staffing, aircraft types, and traffic control constraints. Expanding this analysis to include data from over a period of time could also help get a seasonal pattern to improve my analysis. Overall, this analysis provides an efficient foundation to understand factors associated with the airline delays and offers an opportunity for a deeper analysis.