Simple Linear Regression

2026-04-12

What is Simple Linear Regression?

Simple Linear Regression models the linear relationship between:

A response variable \(Y\) (continuous, numeric)
A single predictor variable \(X\) (numeric)

The goal is to find the “best fit” line through the data that minimizes prediction error.

Example: Can we predict a car’s fuel efficiency (mpg) from its weight?

The Model

The population regression model is:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

where:

\(\beta_0\) = intercept (value of \(Y\) when \(X = 0\))
\(\beta_1\) = slope (change in \(Y\) per unit increase in \(X\))
\(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\) = random error term

The fitted (estimated) model is:

\[\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\]

Ordinary Least Squares (OLS) Estimation

We estimate \(\hat{\beta}_0\) and \(\hat{\beta}_1\) by minimizing the Residual Sum of Squares:

\[\text{RSS} = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n}(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2\]

The closed-form OLS solutions are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} = \frac{S_{XY}}{S_{XX}}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

Example: `mtcars` Dataset

We use the built-in R dataset mtcars to predict mpg (miles per gallon) from wt (weight in 1000 lbs).

model <- lm(mpg ~ wt, data = mtcars)
summary(model)$coefficients

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 37.285126   1.877627 19.857575 8.241799e-19
## wt          -5.344472   0.559101 -9.559044 1.293959e-10

Interpretation:

For every 1,000 lb increase in weight, mpg decreases by ~5.34 on average
A zero-weight car would get ~37.29 mpg (extrapolation — not meaningful here)

Scatter Plot with Regression Line

Residual Diagnostics

No strong pattern → linearity and constant variance assumptions look reasonable.

3D Interactive Plot: MPG, Weight & Horsepower

R Code: Building the Model

# Load dataset
data(mtcars)

# Fit simple linear regression
model <- lm(mpg ~ wt, data = mtcars)

# View summary
summary(model)

# Plot with ggplot2
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = "#8C1D40", size = 3) +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(title = "MPG vs Weight",
       x = "Weight (1000 lbs)", y = "MPG") +
  theme_minimal()

Model Evaluation: \(R^2\) and RMSE

The coefficient of determination \(R^2\) measures the proportion of variance in \(Y\) explained by \(X\):

\[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}\]

r2   <- summary(model)$r.squared
rmse <- sqrt(mean(residuals(model)^2))
cat("R-squared:", round(r2, 4), "| RMSE:", round(rmse, 4), "mpg")

## R-squared: 0.7528 | RMSE: 2.9492 mpg

\(R^2 \approx 0.75\) means ~75% of the variability in fuel efficiency is explained by weight alone.

Summary

Simple Linear Regression fits the line \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\) using OLS
The slope \(\hat{\beta}_1\) quantifies the effect of \(X\) on \(Y\)
Residual plots help check model assumptions
\(R^2\) measures goodness of fit
In our example, weight is a strong negative predictor of mpg (\(R^2 \approx 0.75\), \(p < 0.001\))

Simple Linear Regression is the foundation of all regression modeling — master it first!