Motivation: Flight Delays

  • Airlines and passengers both care about on-time flights.
  • Question: Do longer flights tend to have longer arrival delays?
  • hllo
  • We will use a small simulated data-set of flights with:
    • distance (in miles)
    • seats (number of seats on the plane)
    • arr_delay (arrival delay in minutes)
  • Goal: use linear regression to model the relationship between distance and arrival delay.

The Linear Regression Model

We model arrival delay as a linear function of distance:
\[arr\_delay = \beta_0 + \beta_1 * distance + \epsilon\]

where:

  • \(\beta_0: intercept\ (baseline\ delay\ when\ distance = 0)\)

  • \(\beta_1: change\ in\ delay\ (minutes)\ for\ each\ extra\ mile\ of\ distance\)

  • \(\epsilon: random\ error\ (other\ factors:weather,\ traffic,\ etc.)\)

We will estimate \(\beta_0\) and \(\beta_1\) from our flight data.

Simulated Flight Data in R

Here we create a small dataset of 50 flights

n <- 50

distance <- runif(n, min = 200, max = 2500) # miles
seats <- sample(c(100, 150, 180, 220), size = n, replace = TRUE)

#true relationship: base 5 min delay + 0.01 min per mile + noise

arr_delay <- 5 + 0.01 * distance + rnorm(n, mean = 0, sd = 10)

flights <- data.frame(distance, seats, arr_delay)
  • arr_delay is the response (Y)

  • distance and seats are aviation related predictors (X)

ggplot #1: Arrival Delay vs Distance

  • Each point = one flight

  • The smooth line is fitted regression line

Fitting the Regression Model in R

We fit the model:

\[arr\_delay = \beta_0 + \beta_1 * distance + \epsilon\]

Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.1166 2.9785 2.7251 0.0089
distance 0.0088 0.0019 4.5841 0.0000

Important output:

  • Estimated slope \(\hat{\beta}_1 =\) 0.0088: each additional mile adds ~0.009 minutes of delay

  • p-value for \(\hat{\beta}_1 =\) 0.000033: strong evidence of a relationship between distance and delay

  • \(R^2 =\) 0.304: about 30% of the variability in arrival delay is explained by distance

Interpreting the Slope

Suppose the fitted model is:

\[\hat{arr\_delay} = \hat{\beta}_0 + \hat{\beta}_1 * distance\]

Interpretation of \(\hat{\beta}_1\):

  • Estimated change in arrival delay (minutes) for a one-mile increase in distance

Example:

  • With \(\hat{\beta}_1 = 0.0088\), a 100-mile increase in distance adds about

    \[100 * 0.0088 = 0.88\ minutes\]

    to the expected arrival delay (on average)

Residuals vs Fitted Values

Residuals check whether the linear model is reasonable:

\[e_i = y_i - \hat{y}_i\]

Interpreting the Residual Plot

  • Residuals should be randomly scattered around 0.

  • A strong pattern (curve or trend) would suggest the linear model may not be adequate.

  • Points far from 0 may indicate unusual flights (potential outliers).

  • If the vertical spread grows with fitted values, it could mean non-constant variance.

Delay vs Distance and Seats

Now add a second aviation variable: plane size (seats).

  • Rotate the plot to see how delay changes with both distance and seats

Conclusion

  • We used simple linear regression to study flight delays.

  • Longer flights tended to have different expected arrival delays than shorter flights.

  • Linear regression helps us:

    • Quantify relationships between aviation variables

    • make predictions for new flights

    • Check model assumptions using residual plots.