Data Dive: Multiple Regression & Diagnostics

Introduction

Last week, we found that departure delay alone explains about 83.7% of the variation in arrival delay, with a slope of approximately 1.02 , meaning departure delays are almost fully carried through to arrival. This week, we extend that model by adding up to three more variables, evaluate each candidate carefully, and then rigorously diagnose the final model using five diagnostic plots.

The guiding questions are:

Which additional predictors are theoretically justified and not redundant?
Does adding them improve the model without introducing multicolinearity?
Do the regression assumptions hold? If not, how severe are the violations?

Setup

library(tidyverse)
library(nycflights13)
library(car)   # for vif()

df2 <- flights |>
  filter(!is.na(dep_delay), !is.na(arr_delay), !is.na(air_time))

We reuse df2 from last week (filtering on dep_delay and arr_delay) and additionally drop rows missing air_time, which we will consider as a new predictor. This leaves us with 327346 observations.

Recap: The Simple Linear Regression Model

model_slr <- lm(arr_delay ~ dep_delay, data = df2)
summary(model_slr)

## 
## Call:
## lm(formula = arr_delay ~ dep_delay, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.587  -11.005   -1.883    8.938  201.938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.8994935  0.0330195  -178.7   <2e-16 ***
## dep_delay    1.0190929  0.0007864  1295.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.03 on 327344 degrees of freedom
## Multiple R-squared:  0.8369, Adjusted R-squared:  0.8369 
## F-statistic: 1.679e+06 on 1 and 327344 DF,  p-value: < 2.2e-16

Recap: Our SLR from last week gave us:

\[\widehat{\text{arr_delay}} = -5.90 + 1.02 \times \text{dep\_delay}\]

The model had R² ≈ 0.837 and a residual standard error of about 18 minutes. While the fit is strong, that 18-minute residual standard error tells us meaningful variation remains unexplained. We now ask: can additional variables reduce that error without causing problems like multicollinearity?

Considering New Variables

Candidate 1: `air_time` (continuous)

Flights that are in the air longer have more opportunity to make up time by flying faster or taking a more direct routing. This “catch-up” effect is well known in aviation operations. If it exists in the data, we would expect a negative relationship between air_time and arrival delay, even after controlling for departure delay.

df2 |>
  sample_n(5000) |>
  ggplot(aes(x = air_time, y = arr_delay)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", color = "steelblue") +
  labs(
    title = "Arrival Delay vs. Air Time",
    x = "Air Time (minutes)",
    y = "Arrival Delay (minutes)"
  ) +
  theme_classic()

cor(df2$air_time, df2$arr_delay)

## [1] -0.03529709

cor(df2$air_time, df2$dep_delay)

## [1] -0.02240508

The scatter plot and correlation confirm a weak negative relationship with arrival delay, consistent with the catch-up hypothesis. Importantly, the correlation between air_time and dep_delay is very small, meaning these two predictors are nearly independent — multicollinearity is not a concern here.

Decision: Include air_time.

Candidate 2: `distance` (continuous)

Distance and air time are closely related, longer flights take more time in the air.

cor(df2$distance, df2$air_time)

## [1] 0.9906496

The correlation between distance and air_time is extremely high (near 1.0). Including both in the same model would cause severe multicollinearity: the coefficient estimates would become unstable, standard errors would inflate dramatically, and the coefficients would be nearly impossible to interpret. Between the two, air_time is the better choice because it directly measures time available for recovery, whereas distance is just a proxy for that.

Decision: Exclude distance due to near-perfect multicollinearity with air_time.

Candidate 3: `peak_hours` (binary variable)

Flights departing during busy windows , early morning (7–9 AM) and evening rush (5–8 PM), may face greater congestion, more gate competition, and cascading delays from earlier flights. We create a binary indicator to capture this.

df2 <- df2 |>
  mutate(
    peak_hours = if_else(
      (dep_time >= 700 & dep_time <= 900) | (dep_time >= 1700 & dep_time <= 2000),
      1L, 0L
    )
  )

df2 |>
  group_by(peak_hours) |>
  summarise(
    n = n(),
    mean_arr_delay = round(mean(arr_delay), 2),
    sd_arr_delay   = round(sd(arr_delay), 2)
  )

## # A tibble: 2 × 4
##   peak_hours      n mean_arr_delay sd_arr_delay
##        <int>  <int>          <dbl>        <dbl>
## 1          0 215352           7.84         45.9
## 2          1 111994           5.08         42.0

df2 |>
  mutate(period = if_else(peak_hours == 1, "Peak", "Off-Peak")) |>
  ggplot(aes(x = period, y = arr_delay)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(-50, 150)) +
  labs(
    title = "Arrival Delay by Departure Period",
    x = "Departure Period",
    y = "Arrival Delay (minutes)"
  ) +
  theme_classic()

t.test(arr_delay ~ peak_hours, data = df2)

## 
##  Welch Two Sample t-test
## 
## data:  arr_delay by peak_hours
## t = 17.278, df = 245093, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  2.448484 3.075048
## sample estimates:
## mean in group 0 mean in group 1 
##        7.840252        5.078486

The t-test confirms a statistically significant difference in mean arrival delay between peak and off-peak departures. The boxplot shows that peak-hour flights tend to arrive later on average. The correlation between peak_hours and dep_delay is also low, so multicollinearity is not an issue.

Decision: Include peak_hours.

Candidate 4: Interaction Term: `dep_delay × peak_hours`

Should the effect of departure delay on arrival delay be different during peak hours? We can check this visually by plotting the two groups separately.

df2 |>
  sample_n(5000) |>
  mutate(period = if_else(peak_hours == 1, "Peak", "Off-Peak")) |>
  ggplot(aes(x = dep_delay, y = arr_delay, color = period)) +
  geom_point(alpha = 0.15) +
  geom_smooth(method = "lm", se = FALSE) +
  coord_cartesian(xlim = c(-20, 120), ylim = c(-60, 180)) +
  scale_color_manual(values = c("Off-Peak" = "steelblue", "Peak" = "firebrick")) +
  labs(
    title = "dep_delay vs. arr_delay by Peak Hour Status",
    x = "Departure Delay (minutes)",
    y = "Arrival Delay (minutes)",
    color = "Period"
  ) +
  theme_classic()

The two regression lines are nearly parallel which means the relationship between departure delay and arrival delay is essentially the same regardless of whether it is a peak hour departure. An interaction term would add a parameter to the model without meaningfully improving fit. The main effect of peak_hours as a simple shift in the intercept is sufficient.

Decision: Exclude the interaction term, parallel slopes do not justify the added complexity.

The Final Model

Based on the reasoning above, our final model includes three terms:

dep_delay — the core predictor from last week
air_time — captures airborne delay recovery
peak_hours — captures congestion-related intercept shifts

model_final <- lm(arr_delay ~ dep_delay + air_time + peak_hours, data = df2)
summary(model_final)

## 
## Call:
## lm(formula = arr_delay ~ dep_delay + air_time + peak_hours, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.515  -10.962   -1.778    8.809  203.145 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -4.7685732  0.0635192  -75.073  < 2e-16 ***
## dep_delay    1.0186537  0.0007864 1295.402  < 2e-16 ***
## air_time    -0.0069626  0.0003373  -20.643  < 2e-16 ***
## peak_hours  -0.2228167  0.0666180   -3.345 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.02 on 327342 degrees of freedom
## Multiple R-squared:  0.8371, Adjusted R-squared:  0.8371 
## F-statistic: 5.607e+05 on 3 and 327342 DF,  p-value: < 2.2e-16

Variance Inflation Factors

vif(model_final)

##  dep_delay   air_time peak_hours 
##   1.001203   1.007210   1.007506

All VIF values are close to 1, confirming that multicollinearity is not a problem. We can trust the coefficient estimates and their standard errors.

Coefficient Interpretation

The fitted equation is:

\[\widehat{\text{arr_delay}} = \beta_0 + 1.02 \cdot \text{dep\_delay} - (\text{coef}) \cdot \text{air\_time} + (\text{coef}) \cdot \text{peak\_hours}\]

dep_delay: Each additional minute of departure delay adds approximately 1.02 minutes to arrival delay — almost the same as last week, which is reassuring.
air_time: A negative coefficient confirms that longer flights tend to recover some delay. This supports the catch-up hypothesis.
peak_hours: A positive coefficient means peak-hour departures arrive later on average, even after controlling for departure delay and air time.

The R² increased slightly compared to last week’s SLR, and the residual standard error decreased. The new predictors add modest but meaningful explanatory power.

Model Diagnostics

Regression relies on four core assumptions — Linearity, Independence, Constant Variance (Homoscedasticity), and Normality of residuals. We evaluate each using five diagnostic plots.

diag_data <- data.frame(
  fitted    = fitted(model_final),
  residuals = residuals(model_final),
  std_resid = rstandard(model_final),
  leverage  = hatvalues(model_final),
  cooks_d   = cooks.distance(model_final)
) |>
  mutate(obs = row_number())

Plot 1: Residuals vs. Fitted Values

This is the primary check for both linearity and homoscedasticity (constant variance). Ideally, residuals should be randomly scattered around zero with no pattern.

diag_data |>
  sample_n(5000) |>
  ggplot(aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.1) +
  geom_hline(yintercept = 0, color = "firebrick", linetype = "dashed") +
  geom_smooth(method = "loess", color = "steelblue", se = FALSE) +
  labs(
    title = "Plot 1: Residuals vs. Fitted Values",
    subtitle = "Checks linearity and homoscedasticity",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_classic()

Linearity: The LOESS smoother is approximately flat and centered near zero for the bulk of the data. However, at higher fitted values the smoother begins to drift upward, suggesting the model slightly underestimates arrival delay for severely delayed flights. This is a mild non-linearity, not severe enough to invalidate the model for typical flights, but worth noting.

Homoscedasticity: There is a clear fan shape, where residual spread increases as fitted values increase. This is heteroscedasticity, a real violation of the constant variance assumption. Standard errors and p-values are technically invalid as a result, though the coefficient estimates themselves remain unbiased.

Severity: Moderate to high. The fan shape is pronounced and is the most significant issue in this model.

Plot 2: Normal Q-Q Plot

This plot checks whether the residuals follow a normal distribution by comparing their quantiles to theoretical normal quantiles. Points should fall along the diagonal reference line.

ggplot(diag_data, aes(sample = std_resid)) +
  stat_qq(alpha = 0.1) +
  stat_qq_line(color = "firebrick", linetype = "dashed") +
  labs(
    title = "Plot 2: Normal Q-Q Plot",
    subtitle = "Checks normality of standardized residuals",
    x = "Theoretical Quantiles",
    y = "Standardized Residuals"
  ) +
  theme_classic()

The central bulk of the residuals follows the diagonal line reasonably well, which is encouraging. However, both tails — especially the upper tail — deviate significantly. This indicates heavy right skew: the model consistently underpredicts the worst delays.

Severity: Moderate. The normality assumption is violated in the tails. With over 300,000 observations, the Central Limit Theorem provides some reassurance that inference on coefficients is approximately valid. However, prediction intervals for extreme delays will be unreliable.

Further question: Is the heavy tail driven by specific carriers, routes, or weather events? These could be important omitted variables.

Plot 3: Scale-Location Plot

This plot is a second, cleaner check of homoscedasticity. We plot the square root of absolute standardized residuals against fitted values. A flat LOESS line would indicate constant variance.

diag_data |>
  sample_n(5000) |>
  mutate(sqrt_abs_resid = sqrt(abs(std_resid))) |>
  ggplot(aes(x = fitted, y = sqrt_abs_resid)) +
  geom_point(alpha = 0.15) +
  geom_smooth(method = "loess", color = "steelblue", se = FALSE) +
  labs(
    title = "Plot 3: Scale-Location Plot",
    subtitle = "Checks homoscedasticity — a flat line indicates constant variance",
    x = "Fitted Values",
    y = "√|Standardized Residuals|"
  ) +
  theme_classic()

The LOESS line increases from left to right, confirming the heteroscedasticity seen in Plot 1. Residual spread grows as predicted delay increases — the model is more precise for on-time flights and increasingly imprecise for severely delayed ones.

Severity: High. This is the clearest diagnostic signal in our analysis. Any inference from standard errors, confidence intervals, or p-values should be treated with caution. One possible remedy would be to use heteroscedasticity-robust standard errors or to apply a transformation to arr_delay.

Plot 4: Residuals vs. Leverage (Cook’s Distance)

This plot identifies observations that are both unusual (high residual) and highly influential (high leverage) — meaning they have outsized impact on the fitted coefficients. Cook’s Distance combines both of these into a single influence measure.

cooks_threshold <- 4 / nrow(df2)

diag_data |>
  sample_n(5000) |>
  ggplot(aes(x = leverage, y = std_resid)) +
  geom_point(aes(color = cooks_d > cooks_threshold), alpha = 0.3) +
  geom_hline(yintercept = c(-3, 3), color = "firebrick", linetype = "dashed") +
  scale_color_manual(
    values = c("FALSE" = "steelblue", "TRUE" = "red"),
    labels = c("Normal", "High Influence"),
    name = "Cook's D"
  ) +
  labs(
    title = "Plot 4: Residuals vs. Leverage",
    subtitle = "Identifies high-leverage and high-influence observations",
    x = "Leverage",
    y = "Standardized Residuals"
  ) +
  theme_classic()

cat("Observations with Cook's D > 4/n:",
    sum(diag_data$cooks_d > cooks_threshold), "\n")

## Observations with Cook's D > 4/n: 18191

cat("As % of total observations:",
    round(100 * mean(diag_data$cooks_d > cooks_threshold), 2), "%\n")

## As % of total observations: 5.56 %

Most observations cluster at low leverage values, which is typical for a large, well-spread dataset. A small number of points exceed the Cook’s Distance threshold (flagged in red), but they represent a very small fraction of the data. With 300,000+ observations, even influential points are unlikely to meaningfully distort the coefficient estimates.

Severity: Low. We should examine what these flights are — extreme weather diversions, mechanical failures — as they may represent a qualitatively different situation that the model cannot capture.

Plot 5: Histogram of Residuals

This provides an intuitive visual check of the residual distribution and can reveal skewness, outliers, or multimodality that the Q-Q plot might obscure.

ggplot(diag_data, aes(x = residuals)) +
  geom_histogram(bins = 100, fill = "steelblue", color = "white") +
  geom_vline(xintercept = 0, color = "firebrick", linetype = "dashed") +
  labs(
    title = "Plot 5: Histogram of Residuals",
    subtitle = "Checks normality and identifies skewness",
    x = "Residuals (minutes)",
    y = "Count"
  ) +
  theme_classic()

The histogram confirms the pattern from the Q-Q plot: residuals are roughly symmetric around zero for the central mass, but there is a pronounced right tail. This corresponds to flights that arrived much later than the model predicted — the extreme delay events that no amount of departure delay information can fully account for.

Severity: Moderate. The right skew does not invalidate the model for typical flights, but it means the normality assumption is violated for extreme cases. Prediction intervals on the high end of the delay distribution will be too narrow.

Diagnostic Summary

Assumption	Plot(s) Used	Verdict	Severity
Linearity	Plot 1 (Residuals vs. Fitted)	Mild non-linearity at high fitted values	Low–Moderate
Independence	Domain knowledge	Assumed met — flights are largely independent	Low
Homoscedasticity	Plots 1 & 3 (Scale-Location)	Clear fan shape — variance grows with fitted values	High
Normality	Plots 2 & 5 (Q-Q, Histogram)	Moderate violation — heavy right tail	Moderate

Conclusion

Extending the SLR from last week, we added air_time and peak_hours as predictors. Both are theoretically motivated, statistically significant, and free of multicollinearity (all VIF values near 1). We excluded distance due to near-perfect redundancy with air_time, and we excluded the interaction term because the slopes were parallel across peak and off-peak groups.

The model improves on last week’s SLR, but the diagnostic plots reveal important limitations. Heteroscedasticity is the primary concern — residual variance grows with fitted values, which undermines the validity of standard errors and hypothesis tests. The right-skewed residuals point to omitted variables that drive extreme delays, such as weather events, airport congestion, or mechanical issues.

This analysis reinforces a core principle: regression results should never be reported without diagnostics. Had we stopped at the model summary alone, we would have drawn confident conclusions from p-values and confidence intervals that are technically compromised by the constant variance violation.

Further questions for future analysis:

Would a square-root or log transformation of arr_delay stabilize variance?
Would adding a weather severity indicator reduce the extreme right tail?
Is there significant carrier-level variation not captured by the current model?
How do results change using heteroscedasticity-robust standard errors?

Data Dive: Multiple Regression & Diagnostics

Anuska Acharya

2026-03-23

Introduction

Setup

Recap: The Simple Linear Regression Model

Considering New Variables

Candidate 1: `air_time` (continuous)

Candidate 2: `distance` (continuous)

Candidate 3: `peak_hours` (binary variable)

Candidate 4: Interaction Term: `dep_delay × peak_hours`

The Final Model

Variance Inflation Factors

Coefficient Interpretation

Model Diagnostics

Plot 1: Residuals vs. Fitted Values

Plot 2: Normal Q-Q Plot

Plot 3: Scale-Location Plot

Plot 4: Residuals vs. Leverage (Cook’s Distance)

Plot 5: Histogram of Residuals

Diagnostic Summary

Conclusion

Data Dive: Multiple Regression & Diagnostics

Anuska Acharya

2026-03-23

Introduction

Setup

Recap: The Simple Linear Regression Model

Considering New Variables

Candidate 1: air_time (continuous)

Candidate 2: distance (continuous)

Candidate 3: peak_hours (binary variable)

Candidate 4: Interaction Term: dep_delay × peak_hours

The Final Model

Variance Inflation Factors

Coefficient Interpretation

Model Diagnostics

Plot 1: Residuals vs. Fitted Values

Plot 2: Normal Q-Q Plot

Plot 3: Scale-Location Plot

Plot 4: Residuals vs. Leverage (Cook’s Distance)

Plot 5: Histogram of Residuals

Diagnostic Summary

Conclusion

Candidate 1: `air_time` (continuous)

Candidate 2: `distance` (continuous)

Candidate 3: `peak_hours` (binary variable)

Candidate 4: Interaction Term: `dep_delay × peak_hours`