Modeling Flight Delays with Linear Regression

Motivation: Flight Delays

Airlines and passengers both care about on-time flights.
Question: Do longer flights tend to have longer arrival delays?
hllo
We will use a small simulated data-set of flights with:
- distance (in miles)
- seats (number of seats on the plane)
- arr_delay (arrival delay in minutes)
Goal: use linear regression to model the relationship between distance and arrival delay.

The Linear Regression Model

We model arrival delay as a linear function of distance:
\[arr\_delay = \beta_0 + \beta_1 * distance + \epsilon\]

where:

\(\beta_0: intercept\ (baseline\ delay\ when\ distance = 0)\)
\(\beta_1: change\ in\ delay\ (minutes)\ for\ each\ extra\ mile\ of\ distance\)
\(\epsilon: random\ error\ (other\ factors:weather,\ traffic,\ etc.)\)

We will estimate \(\beta_0\) and \(\beta_1\) from our flight data.

Simulated Flight Data in R

Here we create a small dataset of 50 flights

n <- 50

distance <- runif(n, min = 200, max = 2500) # miles
seats <- sample(c(100, 150, 180, 220), size = n, replace = TRUE)

#true relationship: base 5 min delay + 0.01 min per mile + noise

arr_delay <- 5 + 0.01 * distance + rnorm(n, mean = 0, sd = 10)

flights <- data.frame(distance, seats, arr_delay)

arr_delay is the response (Y)
distance and seats are aviation related predictors (X)

ggplot #1: Arrival Delay vs Distance

Each point = one flight
The smooth line is fitted regression line

Fitting the Regression Model in R

We fit the model:

\[arr\_delay = \beta_0 + \beta_1 * distance + \epsilon\]

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	8.1166	2.9785	2.7251	0.0089
distance	0.0088	0.0019	4.5841	0.0000

Important output:

Estimated slope \(\hat{\beta}_1 =\) 0.0088: each additional mile adds ~0.009 minutes of delay
p-value for \(\hat{\beta}_1 =\) 0.000033: strong evidence of a relationship between distance and delay
\(R^2 =\) 0.304: about 30% of the variability in arrival delay is explained by distance

Interpreting the Slope

Suppose the fitted model is:

\[\hat{arr\_delay} = \hat{\beta}_0 + \hat{\beta}_1 * distance\]

Interpretation of \(\hat{\beta}_1\):

Estimated change in arrival delay (minutes) for a one-mile increase in distance

Example:

With \(\hat{\beta}_1 = 0.0088\), a 100-mile increase in distance adds about

\[100 * 0.0088 = 0.88\ minutes\]

to the expected arrival delay (on average)

Residuals vs Fitted Values

Residuals check whether the linear model is reasonable:

\[e_i = y_i - \hat{y}_i\]

Interpreting the Residual Plot

Residuals should be randomly scattered around 0.
A strong pattern (curve or trend) would suggest the linear model may not be adequate.
Points far from 0 may indicate unusual flights (potential outliers).
If the vertical spread grows with fitted values, it could mean non-constant variance.

Delay vs Distance and Seats

Now add a second aviation variable: plane size (seats).

Rotate the plot to see how delay changes with both distance and seats

Conclusion

We used simple linear regression to study flight delays.
Longer flights tended to have different expected arrival delays than shorter flights.
Linear regression helps us:
- Quantify relationships between aviation variables
- make predictions for new flights
- Check model assumptions using residual plots.