Simple Linear Regression

2025-09-16

Introduction

Regression models the relationship between a predictor and an outcome.
Example: Predict fuel efficiency (mpg) from engine size (liters) and optionally vehicle weight.

The Model (Math 1)

We assume a linear model: \[ Y = \beta_0 + \beta_1 X + \epsilon, \] where \(Y\) is mpg, \(X\) is engine size, \(\beta_0\) is the intercept, \(\beta_1\) the slope, and \(\epsilon\) the random error.

Setup & Data

library(ggplot2)
library(plotly)

set.seed(123)
n <- 50
engine_size <- runif(n, 1.0, 5.0)
weight <- runif(n, 2000, 4000)
mpg <- 40 - 3*engine_size - 0.002*weight + rnorm(n, 0, 2)
df <- data.frame(engine_size, weight, mpg)

# Fit models
fit1 <- lm(mpg ~ engine_size, data = df)
fit2 <- lm(mpg ~ engine_size + weight, data = df)

Scatterplot: Engine Size vs MPG

ggplot(df, aes(engine_size, mpg)) +
  geom_point() +
  labs(title = "Engine Size vs MPG",
       x = "Engine Size (L)", y = "MPG")

Regression Fit: Adding a Line

ggplot(df, aes(engine_size, mpg)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Linear Regression Fit",
       x = "Engine Size (L)", y = "MPG")

## `geom_smooth()` using formula = 'y ~ x'

Model Summary

coef(fit1)

## (Intercept) engine_size 
##   35.031109   -3.209846

summary(fit1)$r.squared

## [1] 0.7647601

3D Scatter (plotly)

plot_ly(
  df, x = ~engine_size, y = ~weight, z = ~mpg,
  type = "scatter3d", mode = "markers"
)

The Fitted Equation

Let \(\hat{\beta}_0\) and \(\hat{\beta}_1\) be the OLS estimates. The fitted line is: \[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X. \] For this dataset (rounded), \[ \hat{Y} \approx \hat{\beta}_0 + \hat{\beta}_1 X. \]

Code to Recreate Plots (Code-only slide)

# Fit and plot
fit1 <- lm(mpg ~ engine_size, data = df)
ggplot(df, aes(engine_size, mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

# 3D plot
library(plotly)
plot_ly(df, x=~engine_size, y=~weight, z=~mpg,
        type="scatter3d", mode="markers")

Conclusion

Linear regression captures a straight-line relationship between predictors and a response.
Extensions: add predictors (multiple regression), check residuals/assumptions, try nonlinear models.