What Is Simple Linear Regression?

Simple Linear Regression models the linear relationship between:

  • A response variable \(Y\) (dependent)
  • A single predictor variable \(X\) (independent)

Goal: find the best-fitting line through the data to predict \(Y\) from \(X\).

Applications:

  • Predicting house prices from size
  • Estimating crop yield from rainfall
  • Forecasting sales from advertising spend

The Model

The population model is:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

Symbol Meaning
\(\beta_0\) Intercept — value of \(Y\) when \(X = 0\)
\(\beta_1\) Slope — change in \(Y\) per unit increase in \(X\)
\(\varepsilon_i\) Error term, \(\varepsilon_i \sim N(0, \sigma^2)\)

The fitted line is: \(\quad \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\)

Least Squares Estimation

We estimate \(\beta_0\) and \(\beta_1\) by minimizing the Residual Sum of Squares:

\[\text{RSS} = \sum_{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2\]

The OLS closed-form solutions are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} = \frac{S_{XY}}{S_{XX}}, \qquad \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

These estimators are BLUE (Best Linear Unbiased Estimators) by the Gauss-Markov theorem.

Example: Car Speed vs. Stopping Distance

Using R’s built-in cars dataset (n = 50): \(X\) = Speed (mph), \(Y\) = Stopping distance (ft).

model <- lm(dist ~ speed, data = cars)
round(summary(model)$coefficients, 4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791     6.7584 -2.6011   0.0123
## speed         3.9324     0.4155  9.4640   0.0000

The fitted model is: \(\quad \hat{\text{dist}} = -17.58 + 3.93 \times \text{speed}\)

Each additional mph is associated with ~3.93 more feet of stopping distance.

ggplot: Scatter Plot with Regression Line

ggplot: Residuals vs. Fitted Values

Interactive 3D Plot with Plotly

Goodness of Fit & Inference

R² — coefficient of determination:

\[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}\]

round(summary(model)$r.squared, 4)
## [1] 0.6511

About 65% of variation in stopping distance is explained by speed.

Hypothesis test for slope: \(H_0: \beta_1 = 0\) vs. \(H_1: \beta_1 \neq 0\), using \(t = \hat{\beta}_1 / \text{SE}(\hat{\beta}_1) \sim t_{n-2}\)

round(confint(model, "speed", level = 0.95), 4)
##       2.5 % 97.5 %
## speed 3.097 4.7679

We are 95% confident the true slope is between 3.10 and 4.77 ft/mph.

R Code: Fitting the Model

# Load data and fit model
data(cars)
model <- lm(dist ~ speed, data = cars)
summary(model)

# Scatter plot with regression line
library(ggplot2)
ggplot(cars, aes(x = speed, y = dist)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(title = "Speed vs. Stopping Distance",
       x = "Speed (mph)", y = "Distance (ft)") +
  theme_minimal()

# Confidence interval for slope
confint(model, "speed", level = 0.95)

Assumptions & Summary

Assumptions (LINE):

  1. Linearity — \(Y\) is a linear function of \(X\)
  2. Independence — observations are independent
  3. Normality — errors \(\varepsilon_i \sim N(0, \sigma^2)\)
  4. Equal variance — \(\text{Var}(\varepsilon_i) = \sigma^2\) for all \(i\)

Key formulas:

Concept Formula
Model \(Y = \beta_0 + \beta_1 X + \varepsilon\)
Slope \(\hat\beta_1 = S_{XY}/S_{XX}\)
Intercept \(\hat\beta_0 = \bar Y - \hat\beta_1 \bar X\)
Fit \(R^2 = 1 - \text{RSS}/\text{TSS}\)
Test stat \(t = \hat\beta_1 / \text{SE}(\hat\beta_1)\)