Introduction

This data dive builds and evaluates a generalized linear model to predict whether a flight arrives late. The goal is not only to fit a model, but also to diagnose its limitations, interpret its coefficients, and understand potential issues that may affect its reliability.


Model Construction

Response Variable:late (binary: arrival delay > 0)

Model Type: Logistic Regression

Explanatory Variable

  • dep_delay
  • distance
  • air_time
library(tidyverse)
library(nycflights13)
df <- flights |>
  filter(!is.na(arr_delay), !is.na(dep_delay),
         !is.na(distance), !is.na(air_time)) |>
  mutate(late = arr_delay > 0)

model_glm <- glm(late ~ dep_delay + distance + air_time,
                 data = df,
                 family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_glm)
## 
## Call:
## glm(formula = late ~ dep_delay + distance + air_time, family = binomial, 
##     data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.294e+00  1.218e-02  -188.3   <2e-16 ***
## dep_delay    1.362e-01  6.114e-04   222.9   <2e-16 ***
## distance    -1.095e-02  6.109e-05  -179.3   <2e-16 ***
## air_time     8.447e-02  4.689e-04   180.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 442236  on 327345  degrees of freedom
## Residual deviance: 255133  on 327342  degrees of freedom
## AIC: 255141
## 
## Number of Fisher Scoring iterations: 7

Model Diagnostics

df |>
  ggplot(aes(x = dep_delay, y = as.numeric(late))) +
  geom_jitter(alpha = 0.1) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(
    title = "Probability of Late Arrival vs Departure Delay",
    x = "Departure Delay",
    y = "Probability of Late Arrival"
  ) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

The visualization shows a strong logistic relationship between departure delay and the probability of arriving late. As departure delay increases, the probability of late arrival rapidly approaches 1, indicating a strong predictive relationship.

However, the model produces a warning that fitted probabilities are numerically 0 or 1 for some observations. This suggests that the model is extremely confident in certain predictions, likely due to the strong influence of departure delay. This is known as near-separation and can indicate that the model may be over-reliant on a single predictor.


Issue with the Model

Several issues are present in this model:

  1. Near Separation
    The warning indicates that some predicted probabilities are extremely close to 0 or 1. This suggests that the model is almost perfectly predicting outcomes for certain observations, which can lead to instability in coefficient estimates.

  2. Dominance of a Single Predictor
    Departure delay has a much stronger effect than other variables, which may reduce the usefulness of additional predictors in the model.

  3. Multicollinearity
    Distance and air time are closely related variables, which may introduce redundancy and affect coefficient estimates.

  4. Missing Important Variables
    The model does not include factors such as weather, carrier, or airport conditions, which are likely to influence arrival delays.


Interpretation of a Coefficient

The coefficient for dep_delay is positive and highly significant. Its odds ratio is approximately 1.146, meaning that each additional minute of departure delay increases the odds of arriving late by about 14.6%.

This shows that even small increases in departure delay have a substantial impact on the likelihood of arriving late, making it the most important predictor in the model.


Confidence Interval Insight

coef_summary <- summary(model_glm)$coefficients

beta <- coef_summary["dep_delay", "Estimate"]
se <- coef_summary["dep_delay", "Std. Error"]

lower <- beta - 1.96 * se
upper <- beta + 1.96 * se

exp(c(lower, upper))
## [1] 1.144588 1.147335

The 95% confidence interval for the odds ratio of departure delay is approximately (1.1446, 1.1473). Because this interval is entirely above 1, we have strong evidence that departure delay increases the likelihood of arriving late.

The interval is very narrow, indicating that this effect is estimated with high precision due to the large sample size. This suggests that the relationship between departure delay and late arrival is both strong and consistent across the dataset.


Insights and Significance

The model demonstrates that departure delay is the primary driver of late arrivals. As departure delay increases, the probability of arriving late rises rapidly, indicating strong delay propagation.

While distance and air time are statistically significant, their effects are much smaller, suggesting that operational timing at departure is more important than route characteristics.

This finding is significant because it suggests that improving on-time departures would likely be the most effective way to reduce late arrivals.


Further Questions

  • Do certain airlines perform better at recovering delays than others?
  • How does weather affect the probability of late arrival?
  • Does airport congestion contribute to delays?
  • Would adding categorical variables improve the model?
  • Can we model delay recovery instead of delay occurrence?