library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(broom)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(knitr)
set.seed(301)

Why Linear Regression?

  • Predict a numeric outcome Y from one or more features X.
  • Examples: predict MPG from weight of a car, predict house price from size, etc.
  • Today: simple linear regression with one predictor (and a quick note on diagnostics).

The Model (Math)

We assume a linear relationship:

\[ Y_i \,=\, \beta_0 \, + \, \beta_1 X_i \, + \, \varepsilon_i, \qquad \text{with}\quad \varepsilon_i \sim \text{i.i.d. } (0, \sigma^2). \]

  • \(\beta_0\): intercept (where the line crosses when \(X=0\))
  • \(\beta_1\): slope (change in \(Y\) for one unit increase in \(X\))
  • Fit \(\beta_0, \beta_1\) by least squares (minimize sum of squared residuals).

The Data

We’ll use a small, reproducible synthetic dataset that mimics a negative linear trend.

n <- 40
X <- runif(n, 1, 5)                     # "weight"-like predictor
Y <- 45 - 5.5*X + rnorm(n, 0, 2.0)      # "mpg"-like response with noise
dat <- tibble(x = X, y = Y)

head(dat) %>% kable()
x y
3.388425 24.55554
1.528924 37.29958
1.008548 41.20038
4.015882 22.75446
3.456588 22.58406
2.734676 30.85291

Visualizing the Relationship (ggplot #1)

ggplot(dat, aes(x = x, y = y)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Scatter with Fitted Regression Line",
       x = "x (predictor)",
       y = "y (response)",
       caption = "geom_smooth(method='lm') adds the least-squares line and 95% CI")
## `geom_smooth()` using formula = 'y ~ x'

Fit the Model (R code)

mod <- lm(y ~ x, data = dat)
tidy(mod) %>% kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 46.305 0.970 47.739 0
x -5.891 0.296 -19.906 0
glance(mod) %>% kable(digits = 3)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.912 0.91 1.745 396.23 0 1 -77.998 161.996 167.063 115.689 38 40
  • tidy() shows coefficients (estimate, std. error, t-value, p-value).
  • glance() shows model-level stats (R², adj. R², RMSE-ish via sigma, etc.).

Interpretation Slide (Math)

If \(\hat{\beta}_1 < 0\), then higher \(x\) predicts lower \(y\).
A 1-unit increase in \(x\) changes \(y\) by approximately \(\hat{\beta}_1\) units on average.

The residual for observation \(i\) is: \[ e_i \,=\, y_i - \hat{y}_i, \qquad \text{where}\quad \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i. \]

Diagnostics: Residuals vs Fitted (ggplot #2)

aug <- augment(mod)
ggplot(aug, aes(.fitted, .resid)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(alpha = 0.85) +
  labs(title = "Residuals vs Fitted",
       x = "Fitted values",
       y = "Residuals",
       caption = "Random scatter around 0 suggests linearity & homoscedasticity.")

Interactive Plot (plotly)

# Build an interactive scatter with fitted line
p_base <- ggplot(dat, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Interactive Scatter with Fitted Line",
       x = "x (predictor)",
       y = "y (response)")

ggplotly(p_base)
## `geom_smooth()` using formula = 'y ~ x'

Bonus: Prediction Bands & New Predictions (Code)

new_x <- tibble(x = seq(min(dat$x), max(dat$x), length.out = 50))
pred <- predict(mod, newdata = new_x, interval = "prediction", level = 0.95) %>% as_tibble()
pred_all <- bind_cols(new_x, pred)

head(pred_all) %>% kable(digits = 2)
x fit lwr upr
1.01 40.36 36.57 44.16
1.09 39.89 36.11 43.68
1.17 39.42 35.66 43.19
1.25 38.95 35.20 42.70
1.33 38.48 34.74 42.22
1.41 38.01 34.29 41.73

Putting It Together (ggplot + ribbons)

ggplot() +
  geom_point(data = dat, aes(x, y), alpha = 0.8) +
  geom_line(data = pred_all, aes(x, fit)) +
  geom_ribbon(data = pred_all, aes(x, ymin = lwr, ymax = upr), alpha = 0.2) +
  labs(title = "Fitted Line with 95% Prediction Band",
       x = "x",
       y = "y",
       caption = "Prediction intervals are wider than confidence bands; they capture new points.")

Assumptions (Math)

For valid inference in OLS, we typically assume:

  1. Linearity in parameters: \( E[Y|X] = \beta_0 + \beta_1 X \).
  2. Independence of errors: \( \varepsilon_i \) are independent.
  3. Homoscedasticity: \( \operatorname{Var}(\varepsilon_i) = \sigma^2 \) is constant.
  4. Normality of errors (approximately) for small-sample inference.

If these are violated, use robust methods, transforms, or different models.