2026-02-09

What is Simple Linear Regression?

Simple Linear Regression models the relationship between two continuous variables:

  • A dependent variable (response) \(Y\)
  • An independent variable (predictor) \(X\)

Goal: Find the best-fitting straight line through the data that minimizes the sum of squared residuals.

Real-world examples:

  • Predicting house price from square footage
  • Estimating exam score from hours studied
  • Forecasting sales based on advertising spend

The Mathematical Model

Defining the relationship

The population regression model is:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad i = 1, 2, \ldots, n\]

where:

  • \(Y_i\) is the observed response for the \(i\)-th observation
  • \(\beta_0\) is the y-intercept (value of \(Y\) when \(X = 0\))
  • \(\beta_1\) is the slope (change in \(Y\) per unit change in \(X\))
  • \(\varepsilon_i \sim N(0, \sigma^2)\) are independent error terms

The fitted (estimated) regression line is:

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\]

Estimating Parameters with OLS

Least Squares Estimation

The Ordinary Least Squares (OLS) method minimizes:

\[S(\beta_0, \beta_1) = \sum_{i=1}^{n}(Y_i - \beta_0 - \beta_1 X_i)^2\]

Taking partial derivatives and solving gives:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} = \frac{S_{XY}}{S_{XX}}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

These are the Best Linear Unbiased Estimators (BLUE) under the Gauss-Markov assumptions.

Example: Cars Dataset

Stopping distance vs. speed

We use R’s built-in cars dataset, which contains 50 observations of speed (mph) and stopping distance (ft).

# Fit the linear model
model = lm(dist ~ speed, data = cars)
summary(model)$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -17.579095  6.7584402 -2.601058 1.231882e-02
## speed         3.932409  0.4155128  9.463990 1.489836e-12

Interpretation: For each 1 mph increase in speed, stopping distance increases by approximately 3.93 feet.

ggplot: Scatter Plot with Regression Line

ggplot: Residual Analysis

3D Plotly: SSE Loss Surface

OLS finds the minimum

Hypothesis Testing

Evaluating the model

Testing if the slope is significant:

\[H_0: \beta_1 = 0 \quad \text{vs.} \quad H_1: \beta_1 \neq 0\]

\[t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)} = \frac{3.9324}{0.4155} = 9.46\]

The p-value \(= 1.49e-12\), which is \(\ll 0.05\), so we reject \(H_0\). Speed is a statistically significant predictor of stopping distance.

Coefficient of Determination

Goodness of fit

\[R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 0.6511\]

This means 65.1% of the variability in stopping distance is explained by speed.

Confidence & Prediction Intervals

Key Assumptions & Takeaways

Assumptions of Simple Linear Regression:

  1. Linearity — The relationship between \(X\) and \(Y\) is linear
  2. Independence — Observations are independent of each other
  3. Homoscedasticity — Constant variance of residuals: \(\text{Var}(\varepsilon_i) = \sigma^2\)
  4. Normality — Residuals are normally distributed: \(\varepsilon_i \sim N(0, \sigma^2)\)

Summary of our analysis:

  • Speed is a statistically significant predictor of stopping distance (\(p < 0.001\))
  • The model explains ~65% of the variance (\(R^2 \approx 0.65\))
  • Both confidence and prediction intervals widen at extreme speed values
  • Residual plot suggests possible non-linearity — a quadratic term might improve the fit

References

  • Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models (5th ed.). McGraw-Hill.
  • R Core Team (2024). R: A Language and Environment for Statistical Computing. https://www.R-project.org/
  • Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag.
  • Sievert, C. (2020). Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC.