This data dive builds and evaluates a generalized linear model to predict whether a flight arrives late. The goal is not only to fit a model, but also to diagnose its limitations, interpret its coefficients, and understand potential issues that may affect its reliability.
Response Variable:late (binary: arrival
delay > 0)
Model Type: Logistic Regression
Explanatory Variable
dep_delaydistanceair_timedf <- flights |>
filter(!is.na(arr_delay), !is.na(dep_delay),
!is.na(distance), !is.na(air_time)) |>
mutate(late = arr_delay > 0)
model_glm <- glm(late ~ dep_delay + distance + air_time,
data = df,
family = binomial)## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = late ~ dep_delay + distance + air_time, family = binomial,
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.294e+00 1.218e-02 -188.3 <2e-16 ***
## dep_delay 1.362e-01 6.114e-04 222.9 <2e-16 ***
## distance -1.095e-02 6.109e-05 -179.3 <2e-16 ***
## air_time 8.447e-02 4.689e-04 180.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 442236 on 327345 degrees of freedom
## Residual deviance: 255133 on 327342 degrees of freedom
## AIC: 255141
##
## Number of Fisher Scoring iterations: 7
df |>
ggplot(aes(x = dep_delay, y = as.numeric(late))) +
geom_jitter(alpha = 0.1) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(
title = "Probability of Late Arrival vs Departure Delay",
x = "Departure Delay",
y = "Probability of Late Arrival"
) +
theme_classic()## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
The visualization shows a strong logistic relationship between departure delay and the probability of arriving late. As departure delay increases, the probability of late arrival rapidly approaches 1, indicating a strong predictive relationship.
However, the model produces a warning that fitted probabilities are numerically 0 or 1 for some observations. This suggests that the model is extremely confident in certain predictions, likely due to the strong influence of departure delay. This is known as near-separation and can indicate that the model may be over-reliant on a single predictor.
Several issues are present in this model:
Near Separation
The warning indicates that some predicted probabilities are extremely
close to 0 or 1. This suggests that the model is almost perfectly
predicting outcomes for certain observations, which can lead to
instability in coefficient estimates.
Dominance of a Single Predictor
Departure delay has a much stronger effect than other variables, which
may reduce the usefulness of additional predictors in the
model.
Multicollinearity
Distance and air time are closely related variables, which may introduce
redundancy and affect coefficient estimates.
Missing Important Variables
The model does not include factors such as weather, carrier, or airport
conditions, which are likely to influence arrival delays.
The coefficient for dep_delay is positive and highly
significant. Its odds ratio is approximately 1.146, meaning that each
additional minute of departure delay increases the odds of arriving late
by about 14.6%.
This shows that even small increases in departure delay have a substantial impact on the likelihood of arriving late, making it the most important predictor in the model.
coef_summary <- summary(model_glm)$coefficients
beta <- coef_summary["dep_delay", "Estimate"]
se <- coef_summary["dep_delay", "Std. Error"]
lower <- beta - 1.96 * se
upper <- beta + 1.96 * se
exp(c(lower, upper))## [1] 1.144588 1.147335
The 95% confidence interval for the odds ratio of departure delay is approximately (1.1446, 1.1473). Because this interval is entirely above 1, we have strong evidence that departure delay increases the likelihood of arriving late.
The interval is very narrow, indicating that this effect is estimated with high precision due to the large sample size. This suggests that the relationship between departure delay and late arrival is both strong and consistent across the dataset.
The model demonstrates that departure delay is the primary driver of late arrivals. As departure delay increases, the probability of arriving late rises rapidly, indicating strong delay propagation.
While distance and air time are statistically significant, their effects are much smaller, suggesting that operational timing at departure is more important than route characteristics.
This finding is significant because it suggests that improving on-time departures would likely be the most effective way to reduce late arrivals.