What is Simple Linear Regression?

Simple Linear Regression models the linear relationship between two variables:

  • A response variable \(Y\) — the outcome we want to predict
  • A predictor variable \(X\) — the variable used for prediction

Core idea: Find the best straight line through the data that minimizes prediction error.

Used widely in Statistics, Data Science, Economics, Biology, and Engineering.

The Model Equation

The simple linear regression model is:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

Symbol Meaning
\(\beta_0\) Intercept — value of \(Y\) when \(X = 0\)
\(\beta_1\) Slope — change in \(Y\) per unit increase in \(X\)
\(\varepsilon_i\) Error term, \(\varepsilon_i \sim \mathcal{N}(0,\, \sigma^2)\)

The predicted value: \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\)

Estimating Coefficients — OLS

We estimate coefficients by Ordinary Least Squares, minimizing:

\[\text{SSE} = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2\]

The closed-form solutions are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}, \qquad \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\,\bar{X}\]

Our Example: mtcars Dataset

We model fuel efficiency (mpg) using car weight (wt):

\[\widehat{\text{mpg}} = \hat{\beta}_0 + \hat{\beta}_1 \times \text{wt}\]

─────────────────────────────────────
  Intercept  (β₀) :  37.2851
  Slope      (β₁) :  -5.3445
  R-squared       :   0.7528
  p-value  (β₁)   :  1.29e-10
─────────────────────────────────────

Every extra 1000 lbs reduces mpg by ~5.34 miles.

Scatter Plot with Regression Line

Residual Diagnostics

R Code for This Analysis

library(ggplot2)
library(dplyr)
library(tibble)

# Step 1: Load data
data <- mtcars %>% rownames_to_column("car")

# Step 2: Fit model
model <- lm(mpg ~ wt, data = data)

# Step 3: View results
summary(model)

# Step 4: Add diagnostics
data <- data %>%
  mutate(fitted_values = fitted(model),
         residuals     = resid(model))

# Step 5: Plot
ggplot(data, aes(x = wt, y = mpg)) +
  geom_point(color = "#8C1D40", size = 3) +
  geom_smooth(method = "lm", se = TRUE) +
  theme_minimal()

3D Interactive Plot (Plotly)

mpg vs. weight and horsepower from mtcars

Goodness of Fit — R²

\(R^2\) measures how much variance in \(Y\) is explained by the model:

\[R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}\]

Key Takeaways

Results from our model:

  • \(\hat{\beta}_1 \approx -5.34\) — each extra 1000 lbs reduces mpg by ~5.3
  • \(\hat{\beta}_0 \approx 37.29\) — theoretical mpg at zero weight
  • \(R^2 \approx 0.75\) — weight explains 75% of variance in fuel efficiency

Four assumptions to always verify:

  1. Linearity — relationship is truly linear
  2. Independence — observations are independent
  3. Homoscedasticity — residuals have constant variance
  4. Normality — residuals are approximately normal

Always inspect residual plots — a high R² alone is not enough!