Simple Linear Regression: From Model to Inference

Agenda

What simple linear regression is
Model assumptions
How we find the line (OLS)
Visualizing fit & residuals
Testing if the slope matters
Key takeaways

The Model (Math)

We assume a straight-line relationship between a predictor \(x\) and a response \(y\): \[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

Where: - \(\beta_0\): intercept (value of y when x = 0) - \(\beta_1\): slope (change in y per 1 unit x) - \(\varepsilon_i\): random error (unexplained stuff)

The best-fit line minimizes the squared differences between actual and predicted values.

Model Assumptions

For linear regression to be valid, we assume:

Linearity – relationship between x and y is straight.
Independence – each data point is independent.
Equal variance – residuals have similar spread.
Normality – residuals are roughly normal.
No big outliers – nothing extreme pulling the line.

If these don’t hold, results may be off.

Fitting the Model in R

library(tidyverse)
library(ggplot2)

data(mtcars)
df <- mtcars %>% transmute(mpg = mpg, wt = wt)
fit <- lm(mpg ~ wt, data = df)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

The Regression Line

How Good is the Fit? (R²)

The R² value tells how much of the variation in y is explained by x: \[ R^2 = 1 - \frac{SSE}{SST} \]

\(R^2 = 0\): line explains nothing.
\(R^2 = 1\): perfect fit.

Here, R² ≈ 0.75 — about 75% of MPG variation is explained by car weight.

Checking Residuals

Is the Slope Significant?

We test whether \(\beta_1 = 0\): \[ H_0: \beta_1 = 0 \quad vs \quad H_a: \beta_1 \neq 0 \]

If the p-value < 0.05 → slope is statistically significant. This means weight and MPG are actually related, not just random.

Quick Checks

Linearity: residuals look random?
Equal variance: spread looks even?
Normality: residuals roughly bell-shaped?
Outliers: any extreme points?

These quick checks help confirm if our model’s okay.

Takeaways

Regression finds the best-fit line for predicting y from x.
The slope shows how much y changes when x changes.
R² shows how well the model explains the data.
Always check residuals to make sure the model makes sense.