2026-03-05

Introduction

Linear Regression is a statistical method that we can use to answer questions like:

  • How does Y change when X changes?
  • How can we predict Y based on X?

Simple Linear Regression usually includes a singular independent variable. In other words it often takes the form of:

\(y = mx+b\)

Alternatively,

\(y = \beta_0 + \beta_1x\)

Example 1

ggplot(cars, aes(speed, dist)) + geom_point() +geom_smooth(method = lm) + 
  labs(x = 'Speed (mph)', y = 'Stopping Distance (ft)',
       title = 'Stopping distance vs. Speed')
## `geom_smooth()` using formula = 'y ~ x'

How do we decide which line is the best fit?

Linear Regression is the line that should minimize the error between the estimated and the actual y value. This would look like:

\(y_{actual} - y_{estimated}\)

However, this gives a proportional penalty for any value, no matter how far it is from the actual y value. However, we want to penalize more as it gets further from the actual y value.

So, linear regression finds the line that minimizes the total squared error - otherwise called Ordinary Least Squares (OLS).

\(\sum (y_{actual} - y_{estimated})^2\)

Interpretting Significance/Accuracy?

So, sure. Linear regression will find the line that minimizes the total squared error, but how do we know how accurate, how statistically significant that is?

We can calculate a p-value for the linear regression line, where if p < 0.05, we can reject the null hypothesis (that the linear regression has no correlation) and say that the linear regression is statistically significant and accurately predicts y in correspondence to x.

Example 2

## `geom_smooth()` using formula = 'y ~ x'

Example 2 Expanded

mod = lm(airquality$Wind ~ airquality$Temp)
summary(mod)
## 
## Call:
## lm(formula = airquality$Wind ~ airquality$Temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5784 -2.4489 -0.2261  1.9853  9.7398 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     23.23369    2.11239  10.999  < 2e-16 ***
## airquality$Temp -0.17046    0.02693  -6.331 2.64e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.142 on 151 degrees of freedom
## Multiple R-squared:  0.2098, Adjusted R-squared:  0.2045 
## F-statistic: 40.08 on 1 and 151 DF,  p-value: 2.642e-09

Using the summary function built into R, we can obtain a p-value, which here we see is much less than 0.05. Meaning that the slope in the previous slide is not just due to random noise and is statistically significant.

Final Example - Interactive Plot