The Idea

Simple linear regression fits a straight line to data. One predictor, one response. We use the cars dataset. Speed predicts stopping distance.

The Data

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

The Model

We assume the response is a line plus random noise.

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

\(\beta_0\) is the intercept. \(\beta_1\) is the slope. \(\varepsilon_i\) is the error.

Least Squares

We pick the line that makes the squared errors as small as possible.

\[ \text{SSE} = \sum_{i=1}^{n}\left(y_i - \beta_0 - \beta_1 x_i\right)^2 \]

Solving gives:

\[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

Compute It By Hand

x <- cars$speed
y <- cars$dist
n <- length(x)

b1 <- sum((x - mean(x)) * (y - mean(y))) / sum((x - mean(x))^2)
b0 <- mean(y) - b1 * mean(x)

c(intercept = b0, slope = b1)
##  intercept      slope 
## -17.579095   3.932409

Same Result With lm

fit <- lm(dist ~ speed, data = cars)
coef(fit)
## (Intercept)       speed 
##  -17.579095    3.932409

The two match. lm does the same math.

Scatter Plot With the Line

ggplot(cars, aes(speed, dist)) +
  geom_point() +
  geom_abline(intercept = b0, slope = b1, color = "blue")

Residuals

cars$pred <- b0 + b1 * cars$speed
cars$resid <- cars$dist - cars$pred

ggplot(cars, aes(pred, resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "blue")

Points scatter around zero. No clear pattern. The line fits.

Cost Surface in 3D

b0s <- seq(-40, 10, length.out = 50)
b1s <- seq(1, 6, length.out = 50)
sse <- matrix(0, length(b0s), length(b1s))

for (i in seq_along(b0s)) {
  for (j in seq_along(b1s)) {
    sse[i, j] <- sum((y - (b0s[i] + b1s[j] * x))^2)
  }
}

plot_ly(x = b1s, y = b0s, z = sse) %>%
  add_surface() %>%
  layout(scene = list(
    xaxis = list(title = "slope"),
    yaxis = list(title = "intercept"),
    zaxis = list(title = "SSE")))

Predict

new <- data.frame(speed = c(10, 20, 30))
predict(fit, new)
##         1         2         3 
##  21.74499  61.06908 100.39317

Faster speed means longer stopping distance.

Summary

Fit a line by minimizing squared error. The hand math and lm agree. The slope says stopping distance rises about 3.9 feet per unit of speed. Residuals look random, so the line is a reasonable fit.