Simple Linear Regression — Concepts & Example

Agenda

What is Simple Linear Regression (SLR)?
Assumptions (Gauss–Markov)
Example with mtcars (predicting MPG from Weight)
Fitting the model in R (lm)
Visuals: ggplot (scatter + fit), residuals
Visuals: interactive Plotly
Interpreting coefficients & \(R^2\)
Diagnostics & limitations

What is SLR? (Model)

We model a numerical response \(Y\) with a single predictor \(X\): \[ Y = \beta_0 + \beta_1 X + \varepsilon, \quad \mathbb{E}[\varepsilon]=0,\ \operatorname{Var}(\varepsilon)=\sigma^2 \]

Goal. Estimate \(\beta_0\) and \(\beta_1\), then use the line to explain/predict.

Least Squares Estimators \[ \hat{\beta}_1 = \frac{\sum_i (x_i-\bar{x})(y_i-\bar{y})}{\sum_i (x_i-\bar{x})^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}. \]

Assumptions (Gauss–Markov)

Linearity: relationship is approximately linear
Independence of errors
Homoscedasticity: constant error variance

\[ \mathbb{E}[\varepsilon_i]=0,\ \operatorname{Var}(\varepsilon_i)=\sigma^2 \]

Assumption for Inference

Normality of errors (for t-tests / confidence intervals)

\[ \varepsilon_i \stackrel{iid}{\sim} N(0, \sigma^2) \]

Example Dataset: `mtcars`

We’ll predict Miles per Gallon (MPG) from Weight (1000 lbs).

library(ggplot2)
library(plotly)
theme_set(theme_minimal(base_size = 16))

df <- mtcars[, c("mpg", "wt")]
names(df) <- c("MPG", "Weight")
head(df)
summary(df)

Fit the Model in R (Code)

fit <- lm(MPG ~ Weight, data = df)
summary(fit)

# For re-use
fitted_vals <- fitted(fit)
resid_vals  <- resid(fit)
hat_vals    <- hatvalues(fit)
cooks_vals  <- cooks.distance(fit)

Model Fit: Key Numbers

## R^2: 0.7528 | Adj R^2: 0.7446 | p-value (slope): 1.294e-10 | Residual SD: 3.0459

ggplot #1 — Scatter with Fitted Line

ggplot(df, aes(x = Weight, y = MPG)) +
  geom_point(alpha = 0.85) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "MPG vs Weight with OLS Fit",
       subtitle = "Heavier cars tend to have lower fuel efficiency",
       x = "Weight (1000 lbs)",
       y = "Miles per Gallon")

## `geom_smooth()` using formula = 'y ~ x'

ggplot #2 — Residuals vs Fitted

data.frame(Fitted = fitted_vals, Residuals = resid_vals) |>
  ggplot(aes(x = Fitted, y = Residuals)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(alpha = 0.85) +
  labs(title = "Residuals vs Fitted",
       subtitle = "Check for nonlinearity or heteroscedasticity",
       x = "Fitted MPG",
       y = "Residuals")

Interactive Plotly — Explore Points

p <- ggplot(df, aes(Weight, MPG)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Interactive MPG vs Weight",
       x = "Weight (1000 lbs)", y = "MPG")
ggplotly(p)

## `geom_smooth()` using formula = 'y ~ x'

Interpreting Coefficients & \(R^2\)

Slope (\(\hat{\beta}_1\)): expected change in MPG for +1 (1000 lbs) in Weight
Intercept (\(\hat{\beta}_0\)): expected MPG when Weight = 0
\(R^2\): fraction of variance in MPG explained by Weight

## (Intercept)      Weight 
##   37.285126   -5.344472

Inference: CI for Slope

Under standard assumptions, a \((1-\alpha)\times100\%\) confidence interval for \(\beta_1\) is \[ \hat{\beta}_1 \pm t_{\alpha/2,\,n-2} \cdot SE(\hat{\beta}_1). \]

##                 2.5 %    97.5 %
## (Intercept) 33.450500 41.119753
## Weight      -6.486308 -4.202635

Model Diagnostics & Limitations (1/2)

A single predictor may omit important variables (confounding)
Nonlinearity → consider transforms or polynomial terms

Model Diagnostics & Limitations (2/2)

Heteroscedasticity → consider robust SE or transform response
Influential points → check leverage/Cook’s distance

data.frame(Leverage = hat_vals, CooksD = cooks_vals) |>
  ggplot(aes(x = Leverage, y = CooksD)) +
  geom_point(alpha = 0.85) +
  labs(title = "Leverage vs Cook's Distance",
       x = "Leverage (hat values)",
       y = "Cook's Distance")

TL;DR

SLR fits a straight line: \(Y = \beta_0 + \beta_1 X + \varepsilon\)
Use visuals (scatter + fit), metrics (\(R^2\), p-values), and diagnostics
Keep assumptions in mind; validate with plots and domain context