Agenda

  • What simple linear regression is
  • Model assumptions
  • How we find the line (OLS)
  • Visualizing fit & residuals
  • Testing if the slope matters
  • Key takeaways

The Model (Math)

We assume a straight-line relationship between a predictor \(x\) and a response \(y\): \[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

Where: - \(\beta_0\): intercept (value of y when x = 0) - \(\beta_1\): slope (change in y per 1 unit x) - \(\varepsilon_i\): random error (unexplained stuff)

The best-fit line minimizes the squared differences between actual and predicted values.

Model Assumptions

For linear regression to be valid, we assume:

  1. Linearity – relationship between x and y is straight.
  2. Independence – each data point is independent.
  3. Equal variance – residuals have similar spread.
  4. Normality – residuals are roughly normal.
  5. No big outliers – nothing extreme pulling the line.

If these don’t hold, results may be off.

Fitting the Model in R

library(tidyverse)
library(ggplot2)

data(mtcars)
df <- mtcars %>% transmute(mpg = mpg, wt = wt)
fit <- lm(mpg ~ wt, data = df)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

The Regression Line

How Good is the Fit? (R²)

The R² value tells how much of the variation in y is explained by x: \[ R^2 = 1 - \frac{SSE}{SST} \]

  • \(R^2 = 0\): line explains nothing.
  • \(R^2 = 1\): perfect fit.

Here, R² ≈ 0.75 — about 75% of MPG variation is explained by car weight.

Checking Residuals

Is the Slope Significant?

We test whether \(\beta_1 = 0\): \[ H_0: \beta_1 = 0 \quad vs \quad H_a: \beta_1 \neq 0 \]

If the p-value < 0.05 → slope is statistically significant. This means weight and MPG are actually related, not just random.

Quick Checks

  • Linearity: residuals look random?
  • Equal variance: spread looks even?
  • Normality: residuals roughly bell-shaped?
  • Outliers: any extreme points?

These quick checks help confirm if our model’s okay.

Takeaways

  • Regression finds the best-fit line for predicting y from x.
  • The slope shows how much y changes when x changes.
  • R² shows how well the model explains the data.
  • Always check residuals to make sure the model makes sense.