This data dive builds and evaluates a generalized linear model to predict whether a flight arrives late. The goal is not only to fit a model, but also to diagnose its limitations, interpret its coefficients, and understand potential issues that may affect its reliability.
Response Variable:late (binary: arrival
delay > 0)
Model Type: Logistic Regression
Explanatory Variable
dep_delaydistanceair_time## Warning: package 'pROC' was built under R version 4.5.3
## Warning: package 'arm' was built under R version 4.5.3
df <- flights |>
filter(!is.na(arr_delay), !is.na(dep_delay),
!is.na(distance), !is.na(air_time)) |>
mutate(late = arr_delay > 0)
model_glm <- glm(late ~ dep_delay + distance + air_time,
data = df,
family = binomial)## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = late ~ dep_delay + distance + air_time, family = binomial,
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.294e+00 1.218e-02 -188.3 <2e-16 ***
## dep_delay 1.362e-01 6.114e-04 222.9 <2e-16 ***
## distance -1.095e-02 6.109e-05 -179.3 <2e-16 ***
## air_time 8.447e-02 4.689e-04 180.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 442236 on 327345 degrees of freedom
## Residual deviance: 255133 on 327342 degrees of freedom
## AIC: 255141
##
## Number of Fisher Scoring iterations: 7
df |>
ggplot(aes(x = dep_delay, y = as.numeric(late))) +
geom_jitter(alpha = 0.1) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(
title = "Probability of Late Arrival vs Departure Delay",
x = "Departure Delay",
y = "Probability of Late Arrival"
) +
theme_classic()## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
The visualization shows a strong logistic relationship between departure delay and the probability of arriving late. As departure delay increases, the probability of late arrival rapidly approaches 1, indicating a strong predictive relationship.
However, the model produces a warning that fitted probabilities are numerically 0 or 1 for some observations. This suggests that the model is extremely confident in certain predictions, likely due to the strong influence of departure delay. This is known as near-separation and can indicate that the model may be over-reliant on a single predictor.
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(roc_obj,
main = paste("ROC Curve (AUC =", round(auc(roc_obj), 3), ")"),
col = "steelblue",
lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "gray")The model achieves an AUC of 0.885, indicating strong discriminatory ability, it correctly distinguishes late from on-time flights about 88.5% of the time. The curve bows well above the diagonal (random chance baseline), confirming the model is meaningful. This is a solid result for a model with only three predictors.
binnedplot(predicted_probs, residuals(model_glm, type = "response"),
main = "Binned Residual Plot",
xlab = "Estimated Probability",
ylab = "Average Residual")The binned residual plot reveals that the model’s fit is imperfect
across the range of predicted probabilities. Ideally, approximately 95%
of binned residuals should fall within the gray ±2 standard error bands,
with points scattered randomly around zero. However, the plot shows
systematic patterns, residuals dip below zero in the low-to-mid
probability range and rise above zero at higher probabilities, with many
points falling outside the bands entirely. This non-random pattern
suggests the model is not capturing some underlying structure in the
data, likely because the relationship between the predictors and late
arrival is more complex than the logistic model assumes. This is
consistent with the near-separation warning observed during model
fitting, as well as the suspected multicollinearity between
distance and air_time. These diagnostic
signals suggest that additional predictors such as weather conditions,
carrier, or time of day may be needed to improve model fit.
predicted_class <- ifelse(predicted_probs > 0.5, TRUE, FALSE)
conf_matrix <- table(Predicted = predicted_class, Actual = df$late)
print(conf_matrix)## Actual
## Predicted FALSE TRUE
## FALSE 179157 41135
## TRUE 15185 91869
## [1] 0.8279496
The model is better at identifying on-time flights (correctly classifying ~92% of FALSE cases) than late arrivals (~69% of TRUE cases). The relatively high false negative rate means the model misses a meaningful share of actual late arrivals.
df |>
mutate(pred_prob = predicted_probs) |>
ggplot(aes(x = pred_prob, fill = late)) +
geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
labs(title = "Distribution of Predicted Probabilities",
x = "Predicted Probability of Late Arrival",
fill = "Actually Late") +
theme_classic()This plot vividly illustrates the near-separation problem. Instead of
a smooth spread, predictions are heavily clustered near 0 and 1 with a
gap in the middle, the model is extremely confident rather than nuanced.
On-time flights (pink) pile up near 0, and late flights (teal) spike
sharply at 1.0, confirming the dominance of dep_delay is
pushing the model to near-certain predictions.
Several issues are present in this model:
Near Separation
The warning indicates that some predicted probabilities are extremely
close to 0 or 1. This suggests that the model is almost perfectly
predicting outcomes for certain observations, which can lead to
instability in coefficient estimates.
Dominance of a Single Predictor
Departure delay has a much stronger effect than other variables, which
may reduce the usefulness of additional predictors in the
model.
Multicollinearity
Distance and air time are closely related variables, which may introduce
redundancy and affect coefficient estimates.
Missing Important Variables
The model does not include factors such as weather, carrier, or airport
conditions, which are likely to influence arrival delays.
The coefficient for dep_delay is positive and highly
significant. Its odds ratio is approximately 1.146, meaning that each
additional minute of departure delay increases the odds of arriving late
by about 14.6%.
This shows that even small increases in departure delay have a substantial impact on the likelihood of arriving late, making it the most important predictor in the model.
coef_summary <- summary(model_glm)$coefficients
beta <- coef_summary["dep_delay", "Estimate"]
se <- coef_summary["dep_delay", "Std. Error"]
lower <- beta - 1.96 * se
upper <- beta + 1.96 * se
exp(c(lower, upper))## [1] 1.144588 1.147335
The 95% confidence interval for the odds ratio of departure delay is approximately (1.1446, 1.1473). Because this interval is entirely above 1, we have strong evidence that departure delay increases the likelihood of arriving late.
The interval is very narrow, indicating that this effect is estimated with high precision due to the large sample size. This suggests that the relationship between departure delay and late arrival is both strong and consistent across the dataset.
The model demonstrates that departure delay is the primary driver of late arrivals. As departure delay increases, the probability of arriving late rises rapidly, indicating strong delay propagation.
While distance and air time are statistically significant, their effects are much smaller, suggesting that operational timing at departure is more important than route characteristics.
This finding is significant because it suggests that improving on-time departures would likely be the most effective way to reduce late arrivals.