Why Regression?

  • Predict a numeric outcome and explain relationships.
  • We’ll use the built-in mtcars dataset.
  • Topics covered today:
    • Simple Linear Regression (SLR)
    • Multiple Linear Regression (MLR)
    • Inference for coefficients
    • Diagnostics
    • An interactive 3D view of the data

The Data (mtcars)

  • Outcome: mpg (miles per gallon)
  • Predictors we’ll use: wt (weight, 1000 lbs), hp (horsepower)
  • Quick peek:
mtcars %>% 
  dplyr::select(mpg, wt, hp) %>% 
  head() %>% 
  kable()
mpg wt hp
Mazda RX4 21.0 2.620 110
Mazda RX4 Wag 21.0 2.875 110
Datsun 710 22.8 2.320 93
Hornet 4 Drive 21.4 3.215 110
Hornet Sportabout 18.7 3.440 175
Valiant 18.1 3.460 105

Model (Math)

We begin with the SLR model: \[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad \varepsilon_i \sim \text{i.i.d. } N(0, \sigma^2). \]

The OLS estimates minimize the residual sum of squares (RSS): \[ \text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2. \]

Inference for Slope (Math)

To test \(H_0: \beta_1 = 0\) vs \(H_a: \beta_1 \neq 0\), we use \[ t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)} \sim t_{n-2}\,, \] and a \((1-\alpha)\times 100\%\) CI is \[ \hat{\beta}_1 \pm t_{\alpha/2, n-2} \text{SE}(\hat{\beta}_1). \]

SLR Example: mpg ~ wt (ggplot)

## `geom_smooth()` using formula = 'y ~ x'

Residual Diagnostics (ggplot)

Multiple Regression + 3D View (plotly)

We add hp to build a multiple regression: \[ \text{mpg} = \beta_0 + \beta_1\,\text{wt} + \beta_2\,\text{hp} + \varepsilon. \]

R Code (reproducibility)

summary(slr)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
summary(mlr)
## 
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Takeaways

  • Heavier cars tend to have lower mpg (negative slope for wt).
  • Adding hp explains additional variation in mpg.
  • Always check assumptions using residual diagnostics.
  • Share your interactive slides on RPubs!