Last week, we found that departure delay alone explains about 83.7% of the variation in arrival delay, with a slope of approximately 1.02 , meaning departure delays are almost fully carried through to arrival. This week, we extend that model by adding up to three more variables, evaluate each candidate carefully, and then rigorously diagnose the final model using five diagnostic plots.
The guiding questions are:
We reuse df2 from last week (filtering on
dep_delay and arr_delay) and additionally drop
rows missing air_time, which we will consider as a new
predictor. This leaves us with 327346 observations.
##
## Call:
## lm(formula = arr_delay ~ dep_delay, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.587 -11.005 -1.883 8.938 201.938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.8994935 0.0330195 -178.7 <2e-16 ***
## dep_delay 1.0190929 0.0007864 1295.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.03 on 327344 degrees of freedom
## Multiple R-squared: 0.8369, Adjusted R-squared: 0.8369
## F-statistic: 1.679e+06 on 1 and 327344 DF, p-value: < 2.2e-16
Recap: Our SLR from last week gave us:
\[\widehat{\text{arr_delay}} = -5.90 + 1.02 \times \text{dep\_delay}\]
The model had R² ≈ 0.837 and a residual standard error of about 18 minutes. While the fit is strong, that 18-minute residual standard error tells us meaningful variation remains unexplained. We now ask: can additional variables reduce that error without causing problems like multicollinearity?
air_time (continuous)Flights that are in the air longer have more opportunity to make up
time by flying faster or taking a more direct routing. This “catch-up”
effect is well known in aviation operations. If it exists in the data,
we would expect a negative relationship between
air_time and arrival delay, even after controlling for
departure delay.
df2 |>
sample_n(5000) |>
ggplot(aes(x = air_time, y = arr_delay)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", color = "steelblue") +
labs(
title = "Arrival Delay vs. Air Time",
x = "Air Time (minutes)",
y = "Arrival Delay (minutes)"
) +
theme_classic()## [1] -0.03529709
## [1] -0.02240508
The scatter plot and correlation confirm a weak
negative relationship with arrival delay, consistent
with the catch-up hypothesis. Importantly, the correlation between
air_time and dep_delay is very small, meaning
these two predictors are nearly independent — multicollinearity is not a
concern here.
Decision: Include air_time.
distance (continuous)Distance and air time are closely related, longer flights take more time in the air.
## [1] 0.9906496
The correlation between distance and
air_time is extremely high (near 1.0). Including both in
the same model would cause severe multicollinearity:
the coefficient estimates would become unstable, standard errors would
inflate dramatically, and the coefficients would be nearly impossible to
interpret. Between the two, air_time is the better choice
because it directly measures time available for recovery, whereas
distance is just a proxy for that.
Decision: Exclude distance due to near-perfect
multicollinearity with air_time.
peak_hours (binary variable)Flights departing during busy windows , early morning (7–9 AM) and evening rush (5–8 PM), may face greater congestion, more gate competition, and cascading delays from earlier flights. We create a binary indicator to capture this.
df2 <- df2 |>
mutate(
peak_hours = if_else(
(dep_time >= 700 & dep_time <= 900) | (dep_time >= 1700 & dep_time <= 2000),
1L, 0L
)
)
df2 |>
group_by(peak_hours) |>
summarise(
n = n(),
mean_arr_delay = round(mean(arr_delay), 2),
sd_arr_delay = round(sd(arr_delay), 2)
)## # A tibble: 2 × 4
## peak_hours n mean_arr_delay sd_arr_delay
## <int> <int> <dbl> <dbl>
## 1 0 215352 7.84 45.9
## 2 1 111994 5.08 42.0
df2 |>
mutate(period = if_else(peak_hours == 1, "Peak", "Off-Peak")) |>
ggplot(aes(x = period, y = arr_delay)) +
geom_boxplot() +
coord_cartesian(ylim = c(-50, 150)) +
labs(
title = "Arrival Delay by Departure Period",
x = "Departure Period",
y = "Arrival Delay (minutes)"
) +
theme_classic()##
## Welch Two Sample t-test
##
## data: arr_delay by peak_hours
## t = 17.278, df = 245093, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 2.448484 3.075048
## sample estimates:
## mean in group 0 mean in group 1
## 7.840252 5.078486
The t-test confirms a statistically significant difference in mean
arrival delay between peak and off-peak departures. The boxplot shows
that peak-hour flights tend to arrive later on average. The correlation
between peak_hours and dep_delay is also low,
so multicollinearity is not an issue.
Decision: Include peak_hours.
dep_delay × peak_hoursShould the effect of departure delay on arrival delay be different during peak hours? We can check this visually by plotting the two groups separately.
df2 |>
sample_n(5000) |>
mutate(period = if_else(peak_hours == 1, "Peak", "Off-Peak")) |>
ggplot(aes(x = dep_delay, y = arr_delay, color = period)) +
geom_point(alpha = 0.15) +
geom_smooth(method = "lm", se = FALSE) +
coord_cartesian(xlim = c(-20, 120), ylim = c(-60, 180)) +
scale_color_manual(values = c("Off-Peak" = "steelblue", "Peak" = "firebrick")) +
labs(
title = "dep_delay vs. arr_delay by Peak Hour Status",
x = "Departure Delay (minutes)",
y = "Arrival Delay (minutes)",
color = "Period"
) +
theme_classic()The two regression lines are nearly parallel which means the
relationship between departure delay and arrival delay is essentially
the same regardless of whether it is a peak hour departure. An
interaction term would add a parameter to the model without meaningfully
improving fit. The main effect of peak_hours as a simple
shift in the intercept is sufficient.
Decision: Exclude the interaction term, parallel slopes do not justify the added complexity.
Based on the reasoning above, our final model includes three terms:
dep_delay — the core predictor from last weekair_time — captures airborne delay recoverypeak_hours — captures congestion-related intercept
shifts##
## Call:
## lm(formula = arr_delay ~ dep_delay + air_time + peak_hours, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.515 -10.962 -1.778 8.809 203.145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.7685732 0.0635192 -75.073 < 2e-16 ***
## dep_delay 1.0186537 0.0007864 1295.402 < 2e-16 ***
## air_time -0.0069626 0.0003373 -20.643 < 2e-16 ***
## peak_hours -0.2228167 0.0666180 -3.345 0.000824 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.02 on 327342 degrees of freedom
## Multiple R-squared: 0.8371, Adjusted R-squared: 0.8371
## F-statistic: 5.607e+05 on 3 and 327342 DF, p-value: < 2.2e-16
## dep_delay air_time peak_hours
## 1.001203 1.007210 1.007506
All VIF values are close to 1, confirming that multicollinearity is not a problem. We can trust the coefficient estimates and their standard errors.
The fitted equation is:
\[\widehat{\text{arr_delay}} = \beta_0 + 1.02 \cdot \text{dep\_delay} - (\text{coef}) \cdot \text{air\_time} + (\text{coef}) \cdot \text{peak\_hours}\]
dep_delay: Each additional minute of
departure delay adds approximately 1.02 minutes to arrival delay —
almost the same as last week, which is reassuring.air_time: A negative coefficient
confirms that longer flights tend to recover some delay. This
supports the catch-up hypothesis.peak_hours: A positive coefficient
means peak-hour departures arrive later on average, even after
controlling for departure delay and air time.The R² increased slightly compared to last week’s SLR, and the residual standard error decreased. The new predictors add modest but meaningful explanatory power.
Regression relies on four core assumptions — Linearity, Independence, Constant Variance (Homoscedasticity), and Normality of residuals. We evaluate each using five diagnostic plots.
diag_data <- data.frame(
fitted = fitted(model_final),
residuals = residuals(model_final),
std_resid = rstandard(model_final),
leverage = hatvalues(model_final),
cooks_d = cooks.distance(model_final)
) |>
mutate(obs = row_number())This is the primary check for both linearity and homoscedasticity (constant variance). Ideally, residuals should be randomly scattered around zero with no pattern.
diag_data |>
sample_n(5000) |>
ggplot(aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.1) +
geom_hline(yintercept = 0, color = "firebrick", linetype = "dashed") +
geom_smooth(method = "loess", color = "steelblue", se = FALSE) +
labs(
title = "Plot 1: Residuals vs. Fitted Values",
subtitle = "Checks linearity and homoscedasticity",
x = "Fitted Values",
y = "Residuals"
) +
theme_classic()Linearity: The LOESS smoother is approximately flat and centered near zero for the bulk of the data. However, at higher fitted values the smoother begins to drift upward, suggesting the model slightly underestimates arrival delay for severely delayed flights. This is a mild non-linearity, not severe enough to invalidate the model for typical flights, but worth noting.
Homoscedasticity: There is a clear fan shape, where residual spread increases as fitted values increase. This is heteroscedasticity, a real violation of the constant variance assumption. Standard errors and p-values are technically invalid as a result, though the coefficient estimates themselves remain unbiased.
Severity: Moderate to high. The fan shape is pronounced and is the most significant issue in this model.
This plot checks whether the residuals follow a normal distribution by comparing their quantiles to theoretical normal quantiles. Points should fall along the diagonal reference line.
ggplot(diag_data, aes(sample = std_resid)) +
stat_qq(alpha = 0.1) +
stat_qq_line(color = "firebrick", linetype = "dashed") +
labs(
title = "Plot 2: Normal Q-Q Plot",
subtitle = "Checks normality of standardized residuals",
x = "Theoretical Quantiles",
y = "Standardized Residuals"
) +
theme_classic()The central bulk of the residuals follows the diagonal line reasonably well, which is encouraging. However, both tails — especially the upper tail — deviate significantly. This indicates heavy right skew: the model consistently underpredicts the worst delays.
Severity: Moderate. The normality assumption is violated in the tails. With over 300,000 observations, the Central Limit Theorem provides some reassurance that inference on coefficients is approximately valid. However, prediction intervals for extreme delays will be unreliable.
Further question: Is the heavy tail driven by specific carriers, routes, or weather events? These could be important omitted variables.
This plot is a second, cleaner check of homoscedasticity. We plot the square root of absolute standardized residuals against fitted values. A flat LOESS line would indicate constant variance.
diag_data |>
sample_n(5000) |>
mutate(sqrt_abs_resid = sqrt(abs(std_resid))) |>
ggplot(aes(x = fitted, y = sqrt_abs_resid)) +
geom_point(alpha = 0.15) +
geom_smooth(method = "loess", color = "steelblue", se = FALSE) +
labs(
title = "Plot 3: Scale-Location Plot",
subtitle = "Checks homoscedasticity — a flat line indicates constant variance",
x = "Fitted Values",
y = "√|Standardized Residuals|"
) +
theme_classic()The LOESS line increases from left to right, confirming the heteroscedasticity seen in Plot 1. Residual spread grows as predicted delay increases — the model is more precise for on-time flights and increasingly imprecise for severely delayed ones.
Severity: High. This is the clearest diagnostic
signal in our analysis. Any inference from standard errors, confidence
intervals, or p-values should be treated with caution. One possible
remedy would be to use heteroscedasticity-robust standard errors or to
apply a transformation to arr_delay.
This plot identifies observations that are both unusual (high residual) and highly influential (high leverage) — meaning they have outsized impact on the fitted coefficients. Cook’s Distance combines both of these into a single influence measure.
cooks_threshold <- 4 / nrow(df2)
diag_data |>
sample_n(5000) |>
ggplot(aes(x = leverage, y = std_resid)) +
geom_point(aes(color = cooks_d > cooks_threshold), alpha = 0.3) +
geom_hline(yintercept = c(-3, 3), color = "firebrick", linetype = "dashed") +
scale_color_manual(
values = c("FALSE" = "steelblue", "TRUE" = "red"),
labels = c("Normal", "High Influence"),
name = "Cook's D"
) +
labs(
title = "Plot 4: Residuals vs. Leverage",
subtitle = "Identifies high-leverage and high-influence observations",
x = "Leverage",
y = "Standardized Residuals"
) +
theme_classic()## Observations with Cook's D > 4/n: 18191
cat("As % of total observations:",
round(100 * mean(diag_data$cooks_d > cooks_threshold), 2), "%\n")## As % of total observations: 5.56 %
Most observations cluster at low leverage values, which is typical for a large, well-spread dataset. A small number of points exceed the Cook’s Distance threshold (flagged in red), but they represent a very small fraction of the data. With 300,000+ observations, even influential points are unlikely to meaningfully distort the coefficient estimates.
Severity: Low. We should examine what these flights are — extreme weather diversions, mechanical failures — as they may represent a qualitatively different situation that the model cannot capture.
This provides an intuitive visual check of the residual distribution and can reveal skewness, outliers, or multimodality that the Q-Q plot might obscure.
ggplot(diag_data, aes(x = residuals)) +
geom_histogram(bins = 100, fill = "steelblue", color = "white") +
geom_vline(xintercept = 0, color = "firebrick", linetype = "dashed") +
labs(
title = "Plot 5: Histogram of Residuals",
subtitle = "Checks normality and identifies skewness",
x = "Residuals (minutes)",
y = "Count"
) +
theme_classic()The histogram confirms the pattern from the Q-Q plot: residuals are roughly symmetric around zero for the central mass, but there is a pronounced right tail. This corresponds to flights that arrived much later than the model predicted — the extreme delay events that no amount of departure delay information can fully account for.
Severity: Moderate. The right skew does not invalidate the model for typical flights, but it means the normality assumption is violated for extreme cases. Prediction intervals on the high end of the delay distribution will be too narrow.
| Assumption | Plot(s) Used | Verdict | Severity |
|---|---|---|---|
| Linearity | Plot 1 (Residuals vs. Fitted) | Mild non-linearity at high fitted values | Low–Moderate |
| Independence | Domain knowledge | Assumed met — flights are largely independent | Low |
| Homoscedasticity | Plots 1 & 3 (Scale-Location) | Clear fan shape — variance grows with fitted values | High |
| Normality | Plots 2 & 5 (Q-Q, Histogram) | Moderate violation — heavy right tail | Moderate |
Extending the SLR from last week, we added air_time and
peak_hours as predictors. Both are theoretically motivated,
statistically significant, and free of multicollinearity (all VIF values
near 1). We excluded distance due to near-perfect
redundancy with air_time, and we excluded the interaction
term because the slopes were parallel across peak and off-peak
groups.
The model improves on last week’s SLR, but the diagnostic plots reveal important limitations. Heteroscedasticity is the primary concern — residual variance grows with fitted values, which undermines the validity of standard errors and hypothesis tests. The right-skewed residuals point to omitted variables that drive extreme delays, such as weather events, airport congestion, or mechanical issues.
This analysis reinforces a core principle: regression results should never be reported without diagnostics. Had we stopped at the model summary alone, we would have drawn confident conclusions from p-values and confidence intervals that are technically compromised by the constant variance violation.
Further questions for future analysis:
arr_delay
stabilize variance?