What is Simple Linear Regression?

Simple Linear Regression is a statistical method used to model the relationship between:

  • One independent variable (predictor) → \(X\)
  • One dependent variable (response) → \(Y\)

It draws the best-fit straight line through a scatter of data points.

Real-world uses: - Predicting house prices from square footage - Estimating fuel efficiency from car weight - Forecasting sales from advertising spend

The Math Behind It

The model equation is:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

Where:

  • \(Y\) = predicted (response) variable
  • \(X\) = predictor (explanatory) variable
  • \(\beta_0\) = intercept — value of \(Y\) when \(X = 0\)
  • \(\beta_1\) = slope — change in \(Y\) for a one-unit increase in \(X\)
  • \(\varepsilon\) = error term — random noise (assumed: \(\varepsilon \sim N(0, \sigma^2)\))

Estimating the Coefficients

We use the Ordinary Least Squares (OLS) method to estimate \(\beta_0\) and \(\beta_1\).

OLS minimizes the sum of squared residuals:

\[\text{Minimize} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\]

The formulas for the estimates are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\]

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Our Dataset: mtcars

I am going to use R’s built-in mtcars dataset:

  • wt — weight of the car (1000 lbs) → predictor \(X\)
  • mpg — fuel efficiency (miles per gallon) → response \(Y\)
##                      wt  mpg
## Mazda RX4         2.620 21.0
## Mazda RX4 Wag     2.875 21.0
## Datsun 710        2.320 22.8
## Hornet 4 Drive    3.215 21.4
## Hornet Sportabout 3.440 18.7
## Valiant           3.460 18.1
## Duster 360        3.570 14.3
## Merc 240D         3.190 24.4

Question: Can we predict a car’s fuel efficiency based on its weight?

Scatter Plot: Weight vs Fuel Efficiency

Fitting the Regression Line

The R Code That Fits This Model

# Fit a simple linear regression model
model <- lm(mpg ~ wt, data = mtcars)

# View model summary
summary(model)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Understanding the Output

From the summary output:

  • Intercept \(\hat{\beta}_0 \approx 37.29\) — estimated mpg when weight is zero (theoretical baseline)
  • Slope \(\hat{\beta}_1 \approx -5.34\) — each extra 1000 lbs reduces fuel efficiency by ~5.34 mpg
  • R-squared \(\approx 0.753\) — weight explains about 75% of variation in fuel efficiency
  • p-value < 0.001 — the relationship is statistically significant

So the fitted model is:

\[\hat{\text{mpg}} = 37.29 - 5.34 \times \text{wt}\]

Residuals: How Good is the Fit?

A residual is the difference between observed and predicted values:

\[e_i = y_i - \hat{y}_i\]

Points should scatter randomly around zero — no clear pattern ideally.

3D Interactive Plot: Weight, Miles and Residuals

Key Assumptions of Linear Regression

For OLS estimates to be valid, we assume:

  1. Linearity — the relationship between \(X\) and \(Y\) is linear
  2. Independence — observations are independent of each other
  3. Homoscedasticity — constant variance of errors: \(\text{Var}(\varepsilon_i) = \sigma^2\)
  4. Normality — errors are normally distributed: \(\varepsilon \sim N(0, \sigma^2)\)

Violations of these assumptions can lead to biased or inefficient estimates.

Making Predictions

Once we have the model, we can predict fuel efficiency for new car weights:

# Predict mpg for cars weighing 2, 3, and 4 thousand lbs
new_weights <- data.frame(wt = c(2, 3, 4))
preds <- predict(model, newdata = new_weights, interval = "confidence")
cbind(new_weights, round(preds, 2))
##   wt   fit   lwr   upr
## 1  2 26.60 24.82 28.37
## 2  3 21.25 20.12 22.38
## 3  4 15.91 14.49 17.32

For example, a car weighing 3000 lbs is predicted to get approximately 26.6 mpg.

Summary

Simple Linear Regression gives us a powerful tool to:

  • Quantify relationships between variables
  • Predict outcomes from new input values
  • Understand how much one variable explains another

Key takeaways from our mtcars example:

  • Weight is a significant predictor of fuel efficiency
  • The model explains ~75% of the variance
  • Every extra 1000 lbs reduces fuel efficiency by ~5.34 mpg

Limitations: Linear regression assumes a straight-line relationship. For curved patterns, consider polynomial or nonlinear regression.