Introduction

This data dive applies a generalized linear model using logistic regression to model a binary outcome in the nycflights13 dataset.The goal is to understand which factors influence whether a flight arrives late and to interpret the model coefficients in a meaningful, real-world context.

library(tidyverse)
library(nycflights13)
df <- flights |>
  filter(!is.na(arr_delay), !is.na(dep_delay), !is.na(distance), !is.na(air_time)) |>
  mutate(late = arr_delay>0)

Here, we defined a binary variable, where:

late = 1, if arrival delay > 0 late = 0 otherwise

This is a meaningful outcome because late arrivals directly impact passengers and airline operations.

Explanatory Variables

For explanatory variables, we select: dep_delay (continuous) distance (continuous) air_time (continuous)

These are reasonable predictors of arrival delay.

Logistic Regression Model

model_glm <- glm(late ~ dep_delay + distance + air_time,
                 data = df,
                 family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_glm)
## 
## Call:
## glm(formula = late ~ dep_delay + distance + air_time, family = binomial, 
##     data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.294e+00  1.218e-02  -188.3   <2e-16 ***
## dep_delay    1.362e-01  6.114e-04   222.9   <2e-16 ***
## distance    -1.095e-02  6.109e-05  -179.3   <2e-16 ***
## air_time     8.447e-02  4.689e-04   180.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 442236  on 327345  degrees of freedom
## Residual deviance: 255133  on 327342  degrees of freedom
## AIC: 255141
## 
## Number of Fisher Scoring iterations: 7
exp(coef(model_glm))
## (Intercept)   dep_delay    distance    air_time 
##   0.1009105   1.1459607   0.9891073   1.0881371

Interpretation of Coefficients

The logistic regression model estimates how different variables affect the probability that a flight arrives late.

The coefficient for dep_delay is positive and highly significant. Its odds ratio is approximately 1.146, meaning that each additional minute of departure delay increases the odds of arriving late by about 14.6%. This indicates that departure delay is the strongest predictor of late arrival.

The coefficient for distance has an odds ratio slightly below 1 (≈ 0.989), suggesting that longer flights are slightly less likely to arrive late. This may reflect that longer flights have more opportunity to recover delays during flight.

The coefficient for air_time has an odds ratio of approximately 1.088, meaning that longer air time slightly increases the odds of arriving late. This may be due to longer exposure to potential disruptions such as weather or air traffic.

All coefficients are highly statistically significant, indicating strong evidence that these variables influence the probability of late arrival.

The model produces a warning indicating that some fitted probabilities are numerically 0 or 1. This suggests that the model is very confident in predicting outcomes for certain observations, likely due to strong predictors such as departure delay. While this does not invalidate the model, it indicates that the relationship between departure delay and late arrival is very strong and may lead to near-perfect separation in some cases.

Confidence Interval for One Coefficient

coef_summary <- summary(model_glm)$coefficients

beta <- coef_summary["dep_delay", "Estimate"]
se <- coef_summary["dep_delay", "Std. Error"]

lower <- beta - 1.96 * se
upper <- beta + 1.96 * se

c(lower, upper)
## [1] 0.1350451 0.1374416
exp(c(lower, upper))
## [1] 1.144588 1.147335

Interpretation of Confidence Interval

The 95% confidence interval for the dep_delay coefficient represents the range of plausible values for its effect on the log-odds of arriving late. Because the interval does not include zero, we have strong evidence that departure delay is a significant predictor.

When converted to odds ratios, this interval indicates that each additional minute of departure delay consistently increases the likelihood of arriving late. This reinforces the conclusion that departure timing plays a critical role in flight performance.

The narrow confidence interval reflects the large sample size and indicates that the effect of departure delay on late arrival is both stable and reliable across the dataset.

Visualization

df |>
  ggplot(aes(x=dep_delay, y= as.numeric(late)))+
  geom_jitter(alpha = 0.1)+
  geom_smooth(method = "glm", method.args = list(family = "binomial"))+
  labs(
    title = "Probability of Late Arrival vs Departure Delay",
    x= "Departure Delay (minutes)",
    y = "Probility of Late Arrival"
  )+
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Insights and Significance

The logistic regression model shows that departure delay is the most influential factor in determining whether a flight arrives late. Even small increases in departure delay significantly raise the probability of late arrival, as shown by the strong positive coefficient and narrow confidence interval. This indicates that delay propagation is a dominant pattern in the data.

While distance and air time are statistically significant, their effects are much smaller in comparison. This suggests that operational timing at departure plays a more critical role than route characteristics in predicting delays.

From a practical perspective, this means that improving on-time departures would likely have the greatest impact on reducing late arrivals. Airlines and airports should therefore prioritize minimizing delays at departure rather than relying on in-flight recovery.

Further Questions

  • Do certain airlines have a higher probability of recovering delays than others?
  • How does weather influence the probability of a flight arriving late?
  • Does airport congestion contribute to delay propagation?
  • Would including categorical variables such as carrier or origin improve the model?
  • Are there thresholds of departure delay beyond which recovery becomes unlikely?