Data Dive: Model Building and Diagnosis

Introduction

This data dive builds and evaluates a generalized linear model to predict whether a flight arrives late. The goal is not only to fit a model, but also to diagnose its limitations, interpret its coefficients, and understand potential issues that may affect its reliability.

Model Construction

Response Variable:late (binary: arrival delay > 0)

Model Type: Logistic Regression

Explanatory Variable

dep_delay
distance
air_time

library(tidyverse)
library(nycflights13)
library(pROC)

## Warning: package 'pROC' was built under R version 4.5.3

library(arm)

## Warning: package 'arm' was built under R version 4.5.3

df <- flights |>
  filter(!is.na(arr_delay), !is.na(dep_delay),
         !is.na(distance), !is.na(air_time)) |>
  mutate(late = arr_delay > 0)

model_glm <- glm(late ~ dep_delay + distance + air_time,
                 data = df,
                 family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model_glm)

## 
## Call:
## glm(formula = late ~ dep_delay + distance + air_time, family = binomial, 
##     data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.294e+00  1.218e-02  -188.3   <2e-16 ***
## dep_delay    1.362e-01  6.114e-04   222.9   <2e-16 ***
## distance    -1.095e-02  6.109e-05  -179.3   <2e-16 ***
## air_time     8.447e-02  4.689e-04   180.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 442236  on 327345  degrees of freedom
## Residual deviance: 255133  on 327342  degrees of freedom
## AIC: 255141
## 
## Number of Fisher Scoring iterations: 7

Model Diagnostics

df |>
  ggplot(aes(x = dep_delay, y = as.numeric(late))) +
  geom_jitter(alpha = 0.1) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(
    title = "Probability of Late Arrival vs Departure Delay",
    x = "Departure Delay",
    y = "Probability of Late Arrival"
  ) +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

The visualization shows a strong logistic relationship between departure delay and the probability of arriving late. As departure delay increases, the probability of late arrival rapidly approaches 1, indicating a strong predictive relationship.

However, the model produces a warning that fitted probabilities are numerically 0 or 1 for some observations. This suggests that the model is extremely confident in certain predictions, likely due to the strong influence of departure delay. This is known as near-separation and can indicate that the model may be over-reliant on a single predictor.

ROC Curve

predicted_probs <- predict(model_glm, type = "response")
roc_obj <- roc(df$late, predicted_probs)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

plot(roc_obj,
     main = paste("ROC Curve (AUC =", round(auc(roc_obj), 3), ")"),
     col = "steelblue",
     lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "gray")

The model achieves an AUC of 0.885, indicating strong discriminatory ability, it correctly distinguishes late from on-time flights about 88.5% of the time. The curve bows well above the diagonal (random chance baseline), confirming the model is meaningful. This is a solid result for a model with only three predictors.

Residual Diagnostics

binnedplot(predicted_probs, residuals(model_glm, type = "response"),
           main = "Binned Residual Plot",
           xlab = "Estimated Probability",
           ylab = "Average Residual")

The binned residual plot reveals that the model’s fit is imperfect across the range of predicted probabilities. Ideally, approximately 95% of binned residuals should fall within the gray ±2 standard error bands, with points scattered randomly around zero. However, the plot shows systematic patterns, residuals dip below zero in the low-to-mid probability range and rise above zero at higher probabilities, with many points falling outside the bands entirely. This non-random pattern suggests the model is not capturing some underlying structure in the data, likely because the relationship between the predictors and late arrival is more complex than the logistic model assumes. This is consistent with the near-separation warning observed during model fitting, as well as the suspected multicollinearity between distance and air_time. These diagnostic signals suggest that additional predictors such as weather conditions, carrier, or time of day may be needed to improve model fit.

Confusion Matrix

predicted_class <- ifelse(predicted_probs > 0.5, TRUE, FALSE)
conf_matrix <- table(Predicted = predicted_class, Actual = df$late)
print(conf_matrix)

##          Actual
## Predicted  FALSE   TRUE
##     FALSE 179157  41135
##     TRUE   15185  91869

# Accuracy
mean(predicted_class == df$late)

## [1] 0.8279496

The model is better at identifying on-time flights (correctly classifying ~92% of FALSE cases) than late arrivals (~69% of TRUE cases). The relatively high false negative rate means the model misses a meaningful share of actual late arrivals.

Predicted Probability Distribution

df |>
  mutate(pred_prob = predicted_probs) |>
  ggplot(aes(x = pred_prob, fill = late)) +
  geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
  labs(title = "Distribution of Predicted Probabilities",
       x = "Predicted Probability of Late Arrival",
       fill = "Actually Late") +
  theme_classic()

This plot vividly illustrates the near-separation problem. Instead of a smooth spread, predictions are heavily clustered near 0 and 1 with a gap in the middle, the model is extremely confident rather than nuanced. On-time flights (pink) pile up near 0, and late flights (teal) spike sharply at 1.0, confirming the dominance of dep_delay is pushing the model to near-certain predictions.

Issue with the Model

Several issues are present in this model:

Near Separation
The warning indicates that some predicted probabilities are extremely close to 0 or 1. This suggests that the model is almost perfectly predicting outcomes for certain observations, which can lead to instability in coefficient estimates.
Dominance of a Single Predictor
Departure delay has a much stronger effect than other variables, which may reduce the usefulness of additional predictors in the model.
Multicollinearity
Distance and air time are closely related variables, which may introduce redundancy and affect coefficient estimates.
Missing Important Variables
The model does not include factors such as weather, carrier, or airport conditions, which are likely to influence arrival delays.

Interpretation of a Coefficient

The coefficient for dep_delay is positive and highly significant. Its odds ratio is approximately 1.146, meaning that each additional minute of departure delay increases the odds of arriving late by about 14.6%.

This shows that even small increases in departure delay have a substantial impact on the likelihood of arriving late, making it the most important predictor in the model.

Confidence Interval Insight

coef_summary <- summary(model_glm)$coefficients

beta <- coef_summary["dep_delay", "Estimate"]
se <- coef_summary["dep_delay", "Std. Error"]

lower <- beta - 1.96 * se
upper <- beta + 1.96 * se

exp(c(lower, upper))

## [1] 1.144588 1.147335

The 95% confidence interval for the odds ratio of departure delay is approximately (1.1446, 1.1473). Because this interval is entirely above 1, we have strong evidence that departure delay increases the likelihood of arriving late.

The interval is very narrow, indicating that this effect is estimated with high precision due to the large sample size. This suggests that the relationship between departure delay and late arrival is both strong and consistent across the dataset.

Insights and Significance

The model demonstrates that departure delay is the primary driver of late arrivals. As departure delay increases, the probability of arriving late rises rapidly, indicating strong delay propagation.

While distance and air time are statistically significant, their effects are much smaller, suggesting that operational timing at departure is more important than route characteristics.

This finding is significant because it suggests that improving on-time departures would likely be the most effective way to reduce late arrivals.

Further Questions

Do certain airlines perform better at recovering delays than others?
How does weather affect the probability of late arrival?
Does airport congestion contribute to delays?
Would adding categorical variables improve the model?
Can we model delay recovery instead of delay occurrence?