Introduction

This data dive explores how categorical and continuous variables influence a valuable outcome in the nycflights13 dataset. The response variable selected is arrival delay, which is operationally important for airlines and passengers. We first test whether arrival delay differs across airlines using ANOVA. Then, we build a linear regression model to evaluate whether departure delay predicts arrival delay.

Selecting the Response Variable

Response variable: arr_delay (arrival delay, continuous)

Why valuable?

Arrival delay is one of the most important measures of flight performance. It directly affects passenger satisfaction, missed connections, and operational efficiency.

library(tidyverse)
library(nycflights13)

df <- flights |>
  filter(!is.na(arr_delay))

Categorical Explanatory Variable

Explanatory variable: carrier

Because there are many airlines, we consolidate to the top 6 carriers by numbers of flights.

top_carriers <- flights |>
  count(carrier, sort = TRUE) |>
  slice_head(n = 6) |>
  pull(carrier)

df1 <- flights |>
  filter(carrier %in% top_carriers, !is.na(arr_delay))

ANOVA Hypothesis

Null Hypothesis:

\[ H_0: \mu_{AA} = \mu_{DL} = \mu_{UA} = ..... \]

Alternative Hypothesis:

At least one airline has a different mean arrival delay.

anova_model <- aov(arr_delay ~ carrier, data = df1)
summary(anova_model)

##                 Df    Sum Sq Mean Sq F value Pr(>F)    
## carrier          5   8095693 1619139   830.5 <2e-16 ***
## Residuals   267575 521692077    1950                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

df1 |>
  ggplot(aes(x = carrier, y = arr_delay)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(-50, 200)) +
  labs(
    title = "Arrival Delay by Airline",
    x = "Carrier",
    y = "Arrival Delay (minutes)"
  ) +
  theme_classic()

Interpretation

The ANOVA test yields an F-statistic of 830.5 with a p-value less than 2 × 10⁻¹⁶. Because the p-value is far below the conventional alpha level of 0.05, we reject the null hypothesis that all airlines have the same mean arrival delay. This provides strong statistical evidence that at least one airline’s mean arrival delay differs from the others.

The boxplot supports this conclusion by showing visible differences in median arrival delays and variability across carriers. Some airlines exhibit higher central delays and wider spread, suggesting differences in operational performance.

However, given the very large sample size, even small differences between airlines can become statistically significant. Therefore, while we can conclude that differences exist, further investigation would be needed to determine whether these differences are practically meaningful for passengers or operational decision-making.

The F-statistic of 830.5 compares between-group variability to within-group variability. A large F-value indicates that differences between airline means are much larger than random variation within airlines. That is why the p-value is extremely small. In real-world, it means that airline choice is associated with differences in average arrival delay. This may influence passenger decisions, performance benchmarking, or operational strategy. However, the magnitude of the difference (in minutes) should be examined to determine practical importance.

Continuous Explanatory Variable for Regression

Explanatory variable: dep_delay (departure delay)

We expect a roughly linear relationship between departure delay and arrival delay.

df2 <- flights |>
  filter(!is.na(dep_delay), !is.na(arr_delay))

df2 |>
  ggplot(aes(x = dep_delay, y = arr_delay)) +
  geom_point(alpha = 0.2) +
  labs(
    title = "Arrival Delay vs Departure Delay",
    x = "Departure Delay (minutes)",
    y = "Arrival Delay (minutes)"
  ) +
  theme_classic()

Linear Regression Model

model <- lm(arr_delay ~ dep_delay, data = df2)
summary(model)

## 
## Call:
## lm(formula = arr_delay ~ dep_delay, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.587  -11.005   -1.883    8.938  201.938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.8994935  0.0330195  -178.7   <2e-16 ***
## dep_delay    1.0190929  0.0007864  1295.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.03 on 327344 degrees of freedom
## Multiple R-squared:  0.8369, Adjusted R-squared:  0.8369 
## F-statistic: 1.679e+06 on 1 and 327344 DF,  p-value: < 2.2e-16

Interpretation

The linear regression model estimates the relationship between departure delay and arrival delay. The fitted equation is:

arr_delay = -5.90 + 1.02 × dep_delay

The slope coefficient for departure delay is approximately 1.02 and is highly statistically significant (p-value < 2 × 10⁻¹⁶). This indicates that for every additional one minute of departure delay, arrival delay increases by approximately 1.02 minutes on average. In practical terms, this suggests that departure delays are almost fully carried through to arrival, with little recovery time.

The intercept is approximately -5.90. This means that if a flight departs exactly on time (dep_delay = 0), the model predicts an average arrival delay of about -5.9 minutes. This suggests that flights may recover some time during flight under normal conditions. The model fit is very strong, with an R² value of approximately 0.837. This means that about 83.7% of the variability in arrival delay is explained by departure delay alone. This indicates a very strong linear relationship between the two variables.

The residual standard error is about 18 minutes, meaning that even after accounting for departure delay, arrival delay can vary by roughly ±18 minutes due to other factors such as weather, air traffic congestion, or airline operational differences.

The regression results suggest that minimizing departure delays is critical for reducing arrival delays. Because the slope is near 1, most departure delays propagate directly to arrival. Therefore, operational strategies that improve on-time departures are likely the most effective way to reduce overall arrival delays.

Conclusion

This analysis demonstrates that both categorical and continuous factors influence arrival delay in meaningful ways. The ANOVA results provide strong statistical evidence that mean arrival delay differs across airlines, suggesting that carrier choice is associated with performance differences. However, while these differences are statistically significant, their practical magnitude should be carefully evaluated before drawing strong operational conclusions.

The linear regression model reveals that departure delay is a powerful predictor of arrival delay, explaining approximately 84% of its variation. The estimated slope of about 1.02 indicates that most departure delays carry through to arrival, with minimal recovery time. This suggests that improving on-time departures would likely be the most effective strategy for reducing arrival delays overall.

Together, these findings highlight that while airline differences exist, systemic operational timing plays a dominant role in determining arrival performance. Future analyses could incorporate additional variables such as weather, airport congestion, or route distance to further refine the model and better understand remaining sources of variability.

RegressionModeling

Introduction

Selecting the Response Variable

Categorical Explanatory Variable

ANOVA Hypothesis

Null Hypothesis:

Alternative Hypothesis:

Interpretation

Continuous Explanatory Variable for Regression

Linear Regression Model

Interpretation

Conclusion