2025-10-16

What is Linear Regression?

A simple statistical method used to predict some dependent variable given one or more independent variables.

Ex.

  • Given a person’s height, could we predict what their weight is most likely to be?

If we had a bunch of data of people’s height and weight, we could define some equation that “best fits” the data, the make predictions on future inputs. This is linear regression!

Example

With the given red data points, we can create a graph such as this. Now, if we are asked to guess the weight of someone who is 64in tall, we can reasonably predict that they will weigh about 130lbs.

Linear Regression, Mathematically

With one independent variable, we model a response \(Y\) as a linear function of a predictor \(X\): \[Y = \beta_0 + \beta_1 X\]

  • \(\beta_0\): y-intercept
  • \(\beta_1\): slope (change in \(Y\) for a one-unit increase in \(X\))

We can scale this up to include more independent variables, \(X_1, X_2, . . ., X_n\): \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + . . . + \beta_n X_n\]

  • \(\beta_1, \beta_2, . . ., \beta_n\): constants which are calculated to reduce the overall error of the predictions based on the given data

Assumptions

Linearity:
- The true relationship between \(X\) and \(Y\) is linear.
\[E[Y_i \mid X_i] = \beta_0 + \beta_1 X_i\]

Independence:
- Observations are independent of each other; one observation’s error does not influence another. \[\text{Cov}(\varepsilon_i, \varepsilon_j) = 0 \text{ for } i \neq j\]

Homoscedasticity (Constant Variance):
- The spread of errors is the same across all values of \(X\).
\[\text{Var}(\varepsilon_i) = \sigma^2\]

Real Data Example: January 2013 NYC Flight Data

Residuals

Residual: difference between the observed and predicted value,
\(r_i=y_i-\hat{y}_i\).

# Add fitted values & residuals for all rows
aug = mutate(flights, predicted = fitted(mod), residual = resid(mod))

# Show a few residual rows
head(aug[, c("dep_delay", "arr_delay", "predicted", "residual")], 6)
## # A tibble: 6 × 4
##   dep_delay arr_delay predicted residual
##       <dbl>     <dbl>     <dbl>    <dbl>
## 1         2        11  -2.04        13.0
## 2         4        20   0.00712     20.0
## 3         2        33  -2.04        35.0
## 4        -1       -18  -5.11       -12.9
## 5        -6       -25 -10.2        -14.8
## 6        -4        12  -8.18        20.2

Plotting Residuals

Residuals being closer to zero indicate that the linear model is making accurate predictions, so it’s good that most of the points are clustered around the x-axis.

3D Linear Regression

Regression can also be done in 3 dimensions, as shown when I add “air time” as a possible factor in predicting arrival delay. In fact, the math for linear regression works for as many dimensions as needed!