Simple linear regression fits a straight line to data. One predictor, one response. We use the cars dataset. Speed predicts stopping distance.
Simple linear regression fits a straight line to data. One predictor, one response. We use the cars dataset. Speed predicts stopping distance.
head(cars)
## speed dist ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 4 7 22 ## 5 8 16 ## 6 9 10
summary(cars)
## speed dist ## Min. : 4.0 Min. : 2.00 ## 1st Qu.:12.0 1st Qu.: 26.00 ## Median :15.0 Median : 36.00 ## Mean :15.4 Mean : 42.98 ## 3rd Qu.:19.0 3rd Qu.: 56.00 ## Max. :25.0 Max. :120.00
We assume the response is a line plus random noise.
\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
\(\beta_0\) is the intercept. \(\beta_1\) is the slope. \(\varepsilon_i\) is the error.
We pick the line that makes the squared errors as small as possible.
\[ \text{SSE} = \sum_{i=1}^{n}\left(y_i - \beta_0 - \beta_1 x_i\right)^2 \]
Solving gives:
\[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]
x <- cars$speed y <- cars$dist n <- length(x) b1 <- sum((x - mean(x)) * (y - mean(y))) / sum((x - mean(x))^2) b0 <- mean(y) - b1 * mean(x) c(intercept = b0, slope = b1)
## intercept slope ## -17.579095 3.932409
fit <- lm(dist ~ speed, data = cars) coef(fit)
## (Intercept) speed ## -17.579095 3.932409
The two match. lm does the same math.
ggplot(cars, aes(speed, dist)) + geom_point() + geom_abline(intercept = b0, slope = b1, color = "blue")
cars$pred <- b0 + b1 * cars$speed cars$resid <- cars$dist - cars$pred ggplot(cars, aes(pred, resid)) + geom_point() + geom_hline(yintercept = 0, color = "blue")
Points scatter around zero. No clear pattern. The line fits.
b0s <- seq(-40, 10, length.out = 50)
b1s <- seq(1, 6, length.out = 50)
sse <- matrix(0, length(b0s), length(b1s))
for (i in seq_along(b0s)) {
for (j in seq_along(b1s)) {
sse[i, j] <- sum((y - (b0s[i] + b1s[j] * x))^2)
}
}
plot_ly(x = b1s, y = b0s, z = sse) %>%
add_surface() %>%
layout(scene = list(
xaxis = list(title = "slope"),
yaxis = list(title = "intercept"),
zaxis = list(title = "SSE")))
new <- data.frame(speed = c(10, 20, 30)) predict(fit, new)
## 1 2 3 ## 21.74499 61.06908 100.39317
Faster speed means longer stopping distance.
Fit a line by minimizing squared error. The hand math and lm agree. The slope says stopping distance rises about 3.9 feet per unit of speed. Residuals look random, so the line is a reasonable fit.