Linear Regression (DAT 301 HW3)

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(broom)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(knitr)
set.seed(301)

Why Linear Regression?

Predict a numeric outcome Y from one or more features X.
Examples: predict MPG from weight of a car, predict house price from size, etc.
Today: simple linear regression with one predictor (and a quick note on diagnostics).

The Model (Math)

We assume a linear relationship:

\[ Y_i \,=\, \beta_0 \, + \, \beta_1 X_i \, + \, \varepsilon_i, \qquad \text{with}\quad \varepsilon_i \sim \text{i.i.d. } (0, \sigma^2). \]

\(\beta_0\): intercept (where the line crosses when \(X=0\))
\(\beta_1\): slope (change in \(Y\) for one unit increase in \(X\))
Fit \(\beta_0, \beta_1\) by least squares (minimize sum of squared residuals).

The Data

We’ll use a small, reproducible synthetic dataset that mimics a negative linear trend.

n <- 40
X <- runif(n, 1, 5)                     # "weight"-like predictor
Y <- 45 - 5.5*X + rnorm(n, 0, 2.0)      # "mpg"-like response with noise
dat <- tibble(x = X, y = Y)

head(dat) %>% kable()

x	y
3.388425	24.55554
1.528924	37.29958
1.008548	41.20038
4.015882	22.75446
3.456588	22.58406
2.734676	30.85291

Visualizing the Relationship (ggplot #1)

ggplot(dat, aes(x = x, y = y)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Scatter with Fitted Regression Line",
       x = "x (predictor)",
       y = "y (response)",
       caption = "geom_smooth(method='lm') adds the least-squares line and 95% CI")

## `geom_smooth()` using formula = 'y ~ x'

Fit the Model (R code)

mod <- lm(y ~ x, data = dat)
tidy(mod) %>% kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	46.305	0.970	47.739	0
x	-5.891	0.296	-19.906	0

glance(mod) %>% kable(digits = 3)

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.912	0.91	1.745	396.23	0	1	-77.998	161.996	167.063	115.689	38	40

tidy() shows coefficients (estimate, std. error, t-value, p-value).
glance() shows model-level stats (R², adj. R², RMSE-ish via sigma, etc.).

Interpretation Slide (Math)

If \(\hat{\beta}_1 < 0\), then higher \(x\) predicts lower \(y\).
A 1-unit increase in \(x\) changes \(y\) by approximately \(\hat{\beta}_1\) units on average.

The residual for observation \(i\) is: \[ e_i \,=\, y_i - \hat{y}_i, \qquad \text{where}\quad \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i. \]

Diagnostics: Residuals vs Fitted (ggplot #2)

aug <- augment(mod)
ggplot(aug, aes(.fitted, .resid)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(alpha = 0.85) +
  labs(title = "Residuals vs Fitted",
       x = "Fitted values",
       y = "Residuals",
       caption = "Random scatter around 0 suggests linearity & homoscedasticity.")

Interactive Plot (plotly)

# Build an interactive scatter with fitted line
p_base <- ggplot(dat, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Interactive Scatter with Fitted Line",
       x = "x (predictor)",
       y = "y (response)")

ggplotly(p_base)

## `geom_smooth()` using formula = 'y ~ x'

Bonus: Prediction Bands & New Predictions (Code)

new_x <- tibble(x = seq(min(dat$x), max(dat$x), length.out = 50))
pred <- predict(mod, newdata = new_x, interval = "prediction", level = 0.95) %>% as_tibble()
pred_all <- bind_cols(new_x, pred)

head(pred_all) %>% kable(digits = 2)

x	fit	lwr	upr
1.01	40.36	36.57	44.16
1.09	39.89	36.11	43.68
1.17	39.42	35.66	43.19
1.25	38.95	35.20	42.70
1.33	38.48	34.74	42.22
1.41	38.01	34.29	41.73

Putting It Together (ggplot + ribbons)

ggplot() +
  geom_point(data = dat, aes(x, y), alpha = 0.8) +
  geom_line(data = pred_all, aes(x, fit)) +
  geom_ribbon(data = pred_all, aes(x, ymin = lwr, ymax = upr), alpha = 0.2) +
  labs(title = "Fitted Line with 95% Prediction Band",
       x = "x",
       y = "y",
       caption = "Prediction intervals are wider than confidence bands; they capture new points.")

Assumptions (Math)

For valid inference in OLS, we typically assume:

Linearity in parameters: \( E[Y|X] = \beta_0 + \beta_1 X \).
Independence of errors: \( \varepsilon_i \) are independent.
Homoscedasticity: \( \operatorname{Var}(\varepsilon_i) = \sigma^2 \) is constant.
Normality of errors (approximately) for small-sample inference.

If these are violated, use robust methods, transforms, or different models.