Week 9 Data Dive

Airline_Delay_Post_COVID_2021_2023 <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")

Introduction

Last week, a simple linear regression model was built using arr_flights to predict total arrival delay (arr_delay). That model had an R² of 0.81, which was a strong result, but it only used flight volume as a predictor. There are clearly other factors that contribute to delays (weather, late aircraft, and so on) so this week the goal is to build a more complete model by adding more variables. After fitting the extended model, diagnostic plots are used to check whether the regression assumptions actually hold.

Revisiting the Simple Model (Week 8)

lm_model <- lm(arr_delay ~ arr_flights, data = Airline_Delay_Post_COVID_2021_2023)
summary(lm_model)

## 
## Call:
## lm(formula = arr_delay ~ arr_flights, data = Airline_Delay_Post_COVID_2021_2023)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -130360    -634    -200     259  172886 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 90.30061   26.71593    3.38             0.000725 ***
## arr_flights 12.42247    0.02833  438.47 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5324 on 44875 degrees of freedom
##   (34 observations deleted due to missingness)
## Multiple R-squared:  0.8108, Adjusted R-squared:  0.8108 
## F-statistic: 1.923e+05 on 1 and 44875 DF,  p-value: < 0.00000000000000022

The simple model gives R² = 0.8108, meaning flight volume explains about 81% of the variation in arrival delay. The slope is 12.42, so for each additional arriving flight, total delay goes up by around 12.4 minutes. That number seems high. It is probably picking up some of the effect that actually belongs to other delay causes, which will become clearer once those are added to the model.

Variable Selection

Variable 1: `weather_delay`

Weather is an obvious cause of delays. Storms, fog, and ice can disrupt an entire airport’s schedule. Looking at Week 8’s grouped bar chart, weather was actually the smallest contributor on average across airlines, but that doesn’t mean it’s unimportant. It’s worth including to separate the weather effect from the congestion effect already captured by arr_flights.

Multicollinearity concern: weather_delay is technically a part of arr_delay, so there is some built-in correlation between this predictor and the response. That’s worth keeping in mind when interpreting the results, but since weather and flight volume are measuring different things, it still makes sense to include it.

Decision: Include.

Variable 2: `late_aircraft_delay`

This variable captures the ripple effect of delays. When a plane arrives late, the next flight scheduled on that same aircraft is also delayed. This is different from congestion or weather; it’s purely an operational issue. Week 8’s bar chart showed that late aircraft delays were the biggest contributor to total delays across almost every airline, so leaving this out of the model would be a big omission.

Multicollinearity concern: Like weather_delay, this is also a component of arr_delay. Including multiple sub-components of the response variable will likely inflate R², which is something to flag when interpreting model fit.

Decision: Include.

Variable 3: Interaction Term: `arr_flights x weather_delay`

The idea here is that weather might hit harder at busier airports. A storm at a major hub with thousands of daily flights will cause far more disruption than the same storm at a small regional airport, simply because there are more flights to affect. An interaction term between arr_flights and weather_delay lets the model test whether that’s actually happening in the data.

Decision: Include. This is the interaction term the task asks us to try, and the reasoning behind it is grounded in how airports actually work.

The final model has four terms: arr_flights, weather_delay, late_aircraft_delay, and arr_flights:weather_delay.

Extended Regression Model

model2 <- lm(arr_delay ~ arr_flights + weather_delay + 
               late_aircraft_delay + arr_flights:weather_delay,
             data = Airline_Delay_Post_COVID_2021_2023)

summary(model2)

## 
## Call:
## lm(formula = arr_delay ~ arr_flights + weather_delay + late_aircraft_delay + 
##     arr_flights:weather_delay, data = Airline_Delay_Post_COVID_2021_2023)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32169   -315    -95    148  93818 
## 
## Coefficients:
##                              Estimate  Std. Error t value             Pr(>|t|)
## (Intercept)               71.09033454 10.82399543   6.568      0.0000000000516
## arr_flights                2.98747442  0.02224267 134.313 < 0.0000000000000002
## weather_delay              2.47500049  0.01645989 150.366 < 0.0000000000000002
## late_aircraft_delay        1.61090709  0.00379622 424.345 < 0.0000000000000002
## arr_flights:weather_delay  0.00003137  0.00000265  11.839 < 0.0000000000000002
##                              
## (Intercept)               ***
## arr_flights               ***
## weather_delay             ***
## late_aircraft_delay       ***
## arr_flights:weather_delay ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2074 on 44872 degrees of freedom
##   (34 observations deleted due to missingness)
## Multiple R-squared:  0.9713, Adjusted R-squared:  0.9713 
## F-statistic: 3.795e+05 on 4 and 44872 DF,  p-value: < 0.00000000000000022

What the Coefficients Tell Us

Intercept (70.9 minutes): When all predictors are zero, the model predicts a baseline delay of about 71 minutes. This likely reflects things like maintenance and staffing delays that exist regardless of flight volume.

arr_flights (2.987): This dropped a lot from the simple model, from 12.42 down to 2.99. That tells us the Week 8 slope was inflated because it was absorbing some of the effect that actually belongs to weather and late aircraft delays. Now that those are controlled for, the congestion effect looks more like 3 minutes per flight.

weather_delay (2.475): Each additional minute of weather delay is associated with about 2.5 more minutes of total arrival delay. Interestingly, this is a larger per-minute effect than late aircraft delay, which is a bit surprising given that weather was the smallest bar in Week 8’s chart. The difference is that weather events are less common but more disruptive when they do occur, probably because they hit multiple flights at once rather than one at a time.

late_aircraft_delay (1.611): Each additional minute of late aircraft delay adds about 1.6 minutes to total arrival delay. This makes sense. Late arrivals cause late departures, so there’s a multiplier effect built into airline scheduling.

Interaction arr_flights:weather_delay (0.00003137): This is highly statistically significant (p < 0.0000000000000002), which confirms that weather is slightly worse at busier airports. But the coefficient is tiny. The actual amplification effect is so small it wouldn’t really change any operational decisions. It’s real, but it’s not a big deal in practice.

R² = 0.9713: The model now explains about 97% of the variance in arrival delay, up from 81%. That is a large increase. However, this value should be interpreted with caution because weather_delay and late_aircraft_delay are components of arr_delay. This means the model is partially using pieces of the response to predict itself, which mechanically inflates R². As a result, the model functions more as a decomposition of delay sources than a true predictive model, and the reported fit likely overstates its performance on independent data.

R² Comparison

cat("Simple model R²:", round(summary(lm_model)$r.squared, 4), "\n")

## Simple model R²: 0.8108

cat("Extended model R²:", round(summary(model2)$r.squared, 4), "\n")

## Extended model R²: 0.9713

The model went from explaining 81% to 97% of the variance, a 16 percentage point improvement. As noted above, the structural overlap between the predictors and the response means this figure reflects how well the model fits the existing data rather than its true predictive ability.

Model Diagnostics

The five standard diagnostic plots are used to check whether the regression assumptions hold.

par(mfrow = c(2, 3))
plot(model2, which = 1:5)

Plot 1: Residuals vs Fitted: Linearity

What this checks: If the residuals are randomly scattered around zero with no clear pattern, the linearity assumption is met. A curve or systematic trend would suggest the relationship isn’t actually linear.

What the plot shows: For the bulk of the data, the residuals sit close to zero and the smooth line is fairly flat, which is good. The linearity assumption seems reasonable for most observations. However, three points (observations 34665, 32955, and 36381) sit far above the rest of the data, stretching the scale of the plot. These are extreme outliers that the model isn’t capturing well. They’re consistent with the long right tail in the arr_delay distribution from Week 8.

Severity: Mild for the main body of data. The deviation is coming from a small number of extreme cases, not a systematic pattern.

Confidence in assumption: Moderate. Linearity holds for most of the data but breaks down at the extremes.

Plot 2: Normal Q-Q: Normality of Residuals

What this checks: If the residuals follow a normal distribution, the points should fall along the 45° reference line. Departures, especially at the tails, indicate non-normality.

What the plot shows: Through the middle of the distribution, the points follow the reference line reasonably well. But the upper tail is a serious problem. Observations 34665, 32955, and 36381 have standardized residuals of around 45, 33, and 28 respectively. That’s an extremely heavy upper tail. The lower tail deviates slightly as well, but it’s much less dramatic. These are the same three observations flagged in Plot 1.

Severity: Substantial in the upper tail. With about 44,900 observations, the Central Limit Theorem means the coefficient estimates and p-values are still approximately valid. But prediction intervals at high delay values would not be trustworthy.

Confidence in assumption: Low-to-moderate. The assumption is okay for inference on the coefficients at this sample size, but the tail behavior is genuinely severe.

Plot 3: Scale-Location: Homoscedasticity

What this checks: If the spread of residuals stays roughly constant across fitted values, the constant variance assumption is met. A funnel or upward slope means variance is increasing (heteroscedasticity).

What the plot shows: The smooth line slopes upward clearly from left to right. Residual spread increases as fitted values get larger. This confirms heteroscedasticity. The same three extreme observations appear in the upper right corner. This pattern makes sense given what arr_delay actually measures: it’s a cumulative total of delay minutes, so larger totals naturally come with more variability. Small delays are predictable; large disruption events are not.

Severity: Moderate. The coefficient estimates are still unbiased under heteroscedasticity, but the standard errors are likely underestimated, which means the p-values are probably more significant-looking than they should be. The statistical significance of every predictor should be treated as approximate rather than exact.

Confidence in assumption: Low. The violation is clear and consistent across the range of fitted values.

Plot 4: Cook’s Distance: Influence

What this checks: Cook’s Distance measures how much each observation influences the overall model. It combines both the size of the residual and the leverage of the observation. Values above 1.0 are generally considered concerning, and values above 0.5 deserve attention.

What the plot shows: Observation 4983 has a Cook’s Distance of approximately 4.0, which is well above the threshold of 1.0. Observations 33720 and 34665 also come in above 1.0 at around 1.5 and 1.3 respectively. These are genuinely influential observations. If any of these were removed, the model’s coefficients would shift noticeably. Observation 4983 is particularly interesting because it doesn’t stand out in the residual plots. It is influential not because of a huge residual but because of its position in the predictor space (high leverage).

Severity: High. The presence of Cook’s D values well above 1.0 means the model results are not fully stable across all observations.

Confidence in assumption: Low. These observations need to be investigated before drawing firm conclusions from the model.

Plot 5: Residuals vs Leverage: Leverage

What this checks: Leverage measures how unusual an observation’s predictor values are. High leverage on its own isn’t necessarily a problem, but combined with a large residual it can be very influential. This plot also shows Cook’s distance boundary lines to help identify the most problematic points.

What the plot shows: Most observations have very low leverage, which is expected with over 44,000 data points. Observation 4983 stands out at leverage of approximately 0.12, much further right than everything else, and falls outside the Cook’s distance dashed lines. This confirms it as both high-leverage and influential. In contrast, observation 34665 has low leverage but an extremely large residual (around 45 standardized units). These two cases represent different problems: 4983 is influential because of where it sits in predictor space, while 34665 is an outlier purely in terms of how poorly the model predicts it.

This is a useful illustration of the difference between leverage and influence. High leverage doesn’t always mean high influence, and a large residual doesn’t always mean high leverage. But when both occur together, as with 4983, that’s the most concerning situation.

Severity: High for observation 4983 specifically.

Confidence in assumption: Low for model stability given observation 4983’s Cook’s Distance of approximately 4.

Overall Assessment

# Summary table of diagnostic findings
assumption_summary <- data.frame(
  Assumption = c("Linearity", "Normality", "Homoscedasticity", 
                 "No influential observations", "Leverage"),
  Diagnostic = c("Residuals vs Fitted", "Q-Q Plot", "Scale-Location", 
                 "Cook's Distance", "Residuals vs Leverage"),
  Verdict = c("Mostly satisfied; extreme observations deviate at upper end",
              "Substantially violated in upper tail; obs. 34665, 32955, 36381",
              "Violated; variance increases with fitted value",
              "Violated; obs. 4983 has Cook's D approx 4, others exceed 1.0",
              "Obs. 4983 is high-leverage and outside Cook's distance boundary"),
  Severity = c("Mild to Moderate", "Substantial", "Moderate", "High", "High")
)

knitr::kable(assumption_summary)

Assumption	Diagnostic	Verdict	Severity
Linearity	Residuals vs Fitted	Mostly satisfied; extreme observations deviate at upper end	Mild to Moderate
Normality	Q-Q Plot	Substantially violated in upper tail; obs. 34665, 32955, 36381	Substantial
Homoscedasticity	Scale-Location	Violated; variance increases with fitted value	Moderate
No influential observations	Cook’s Distance	Violated; obs. 4983 has Cook’s D approx 4, others exceed 1.0	High
Leverage	Residuals vs Leverage	Obs. 4983 is high-leverage and outside Cook’s distance boundary	High

The extended model explains much more variance than the simple model (97% vs 81%), and all four predictors are statistically significant. That said, the diagnostics reveal some real issues. Heteroscedasticity is present throughout, the upper tail of the residuals is far from normal, and three observations (especially 4983) are exerting disproportionate influence on the model. The results should be treated as useful and informative, but not definitive, until those observations are looked into and the variance issue is addressed.

Insights and Significance

1. The simple model’s slope was inflated due to missing variables. The arr_flights coefficient dropped from 12.42 to 2.99 once weather and late aircraft delays were added. The Week 8 model was giving too much credit to congestion. A chunk of that effect actually belongs to weather and operational delays. The extended model gives a cleaner picture of what each factor contributes.

2. Weather is more disruptive per minute than late aircraft delay, even though it contributes less total volume. The weather delay coefficient (2.475) is larger than the late aircraft coefficient (1.611) on a per-minute basis. This seems counterintuitive given Week 8’s bar chart, but it makes sense: weather events affect many flights at once, while late aircraft delays cascade one flight at a time. Weather is rarer but hits harder when it does occur.

3. The interaction between congestion and weather is real but practically small. The interaction term is statistically significant, meaning busier airports do see slightly worse weather effects. But the coefficient is so small (0.00003137) that it wouldn’t meaningfully change any real-world decisions. The large sample size is probably making this show up as significant even though the effect size is tiny.

4. Observation 4983 is a red flag. A Cook’s Distance of approximately 4 means this observation has a lot of pull over the model’s coefficients. It doesn’t stand out in the residual plots, which means it’s influential because of its unusual predictor values rather than a large error. It’s worth identifying what carrier, airport, and month this corresponds to. It could be a legitimate outlier event or it could point to a data issue.

Further Questions

What is observation 4983? Identifying the specific carrier, airport, and month for this record is the most important next step. If it’s a legitimate extreme event, it might be worth modeling disruption events separately from normal operations.
Would a log transformation of arr_delay fix the heteroscedasticity? Given how right-skewed the distribution is, this seems like the most straightforward thing to try next.
Do results hold across individual airports? This model pools everything together. Some airports might behave very differently from others, and grouping by airport could reveal patterns that are currently hidden.
Are there seasonal effects? Adding month as a categorical variable would help separate winter weather effects from summer travel demand, which are probably both contributing to the residual variance right now.