- Predict a numeric outcome Y from one or more features X.
- Examples: predict MPG from weight of a car, predict house price from size, etc.
- Today: simple linear regression with one predictor (and a quick note on diagnostics).
library(ggplot2) library(dplyr)
## ## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats': ## ## filter, lag
## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
library(readr) library(broom) library(plotly)
## ## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2': ## ## last_plot
## The following object is masked from 'package:stats': ## ## filter
## The following object is masked from 'package:graphics': ## ## layout
library(knitr) set.seed(301)
We assume a linear relationship:
\[ Y_i \,=\, \beta_0 \, + \, \beta_1 X_i \, + \, \varepsilon_i, \qquad \text{with}\quad \varepsilon_i \sim \text{i.i.d. } (0, \sigma^2). \]
We’ll use a small, reproducible synthetic dataset that mimics a negative linear trend.
n <- 40 X <- runif(n, 1, 5) # "weight"-like predictor Y <- 45 - 5.5*X + rnorm(n, 0, 2.0) # "mpg"-like response with noise dat <- tibble(x = X, y = Y) head(dat) %>% kable()
| x | y |
|---|---|
| 3.388425 | 24.55554 |
| 1.528924 | 37.29958 |
| 1.008548 | 41.20038 |
| 4.015882 | 22.75446 |
| 3.456588 | 22.58406 |
| 2.734676 | 30.85291 |
ggplot(dat, aes(x = x, y = y)) +
geom_point(alpha = 0.8) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Scatter with Fitted Regression Line",
x = "x (predictor)",
y = "y (response)",
caption = "geom_smooth(method='lm') adds the least-squares line and 95% CI")
## `geom_smooth()` using formula = 'y ~ x'
mod <- lm(y ~ x, data = dat) tidy(mod) %>% kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 46.305 | 0.970 | 47.739 | 0 |
| x | -5.891 | 0.296 | -19.906 | 0 |
glance(mod) %>% kable(digits = 3)
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.912 | 0.91 | 1.745 | 396.23 | 0 | 1 | -77.998 | 161.996 | 167.063 | 115.689 | 38 | 40 |
tidy() shows coefficients (estimate, std. error, t-value, p-value).glance() shows model-level stats (R², adj. R², RMSE-ish via sigma, etc.).If \(\hat{\beta}_1 < 0\), then higher \(x\) predicts lower \(y\).
A 1-unit increase in \(x\) changes \(y\) by approximately \(\hat{\beta}_1\) units on average.
The residual for observation \(i\) is: \[ e_i \,=\, y_i - \hat{y}_i, \qquad \text{where}\quad \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i. \]
aug <- augment(mod)
ggplot(aug, aes(.fitted, .resid)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point(alpha = 0.85) +
labs(title = "Residuals vs Fitted",
x = "Fitted values",
y = "Residuals",
caption = "Random scatter around 0 suggests linearity & homoscedasticity.")
# Build an interactive scatter with fitted line
p_base <- ggplot(dat, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Interactive Scatter with Fitted Line",
x = "x (predictor)",
y = "y (response)")
ggplotly(p_base)
## `geom_smooth()` using formula = 'y ~ x'
new_x <- tibble(x = seq(min(dat$x), max(dat$x), length.out = 50)) pred <- predict(mod, newdata = new_x, interval = "prediction", level = 0.95) %>% as_tibble() pred_all <- bind_cols(new_x, pred) head(pred_all) %>% kable(digits = 2)
| x | fit | lwr | upr |
|---|---|---|---|
| 1.01 | 40.36 | 36.57 | 44.16 |
| 1.09 | 39.89 | 36.11 | 43.68 |
| 1.17 | 39.42 | 35.66 | 43.19 |
| 1.25 | 38.95 | 35.20 | 42.70 |
| 1.33 | 38.48 | 34.74 | 42.22 |
| 1.41 | 38.01 | 34.29 | 41.73 |
ggplot() +
geom_point(data = dat, aes(x, y), alpha = 0.8) +
geom_line(data = pred_all, aes(x, fit)) +
geom_ribbon(data = pred_all, aes(x, ymin = lwr, ymax = upr), alpha = 0.2) +
labs(title = "Fitted Line with 95% Prediction Band",
x = "x",
y = "y",
caption = "Prediction intervals are wider than confidence bands; they capture new points.")
For valid inference in OLS, we typically assume:
If these are violated, use robust methods, transforms, or different models.