Multiple Linear Regression (Part 2)

M. Drew LaMar
February 11, 2019


https://xkcd.com/1725/

Introduction to Multiple Linear Regression

Definition: Multiple regression extends simple two-variable regression to the case that still has one response but many predictors (denoted \( x_1 \), \( x_2 \), \( x_3 \), …).

The method is motivated by scenarios where many variables may be simultaneously connected to an output.

Our Example

We're going to look at auction data for the Mario Kart Wii game.

The Data

'data.frame':   141 obs. of  5 variables:
 $ price      : num  51.5 37 45.5 44 71 ...
 $ cond       : Factor w/ 2 levels "used","new": 2 1 2 2 2 2 1 2 1 1 ...
 $ stock_photo: Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
 $ duration   : int  3 7 3 3 1 3 1 1 3 7 ...
 $ wheels     : int  1 1 1 1 2 0 0 2 1 1 ...

Price vs. condition

mario_kart %>% ggplot(aes(x = cond, y = price)) + geom_point(position = position_jitter(width = 0.1))

plot of chunk unnamed-chunk-2

Price vs. condition

plot of chunk unnamed-chunk-3

Price vs. condition

summary(mdl_cond)

Call:
lm(formula = price ~ cond, data = mario_kart)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.8911  -5.8311   0.1289   4.1289  22.1489 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   42.871      0.814  52.668  < 2e-16 ***
condnew       10.900      1.258   8.662 1.06e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.371 on 139 degrees of freedom
Multiple R-squared:  0.3506,    Adjusted R-squared:  0.3459 
F-statistic: 75.03 on 1 and 139 DF,  p-value: 1.056e-14

Multiple linear regression

A multiple regression model is a linear model with many predictors. In general, we write the model as \[ \hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} \] when there are \( k \) predictors.

All predictors included

mdl_full <- lm(price ~ cond + stock_photo + duration + wheels, data = mario_kart)
summary(mdl_full)

Call:
lm(formula = price ~ cond + stock_photo + duration + wheels, 
    data = mario_kart)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.3788  -2.9854  -0.9654   2.6915  14.0346 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    36.21097    1.51401  23.917  < 2e-16 ***
condnew         5.13056    1.05112   4.881 2.91e-06 ***
stock_photoyes  1.08031    1.05682   1.022    0.308    
duration       -0.02681    0.19041  -0.141    0.888    
wheels          7.28518    0.55469  13.134  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.901 on 136 degrees of freedom
Multiple R-squared:  0.719, Adjusted R-squared:  0.7108 
F-statistic: 87.01 on 4 and 136 DF,  p-value: < 2.2e-16

How good is the fit?

For only one predictor variable:

\[ R^2 = 1- \frac{\textrm{Variability in residuals}}{\textrm{Variability in outcome}} \]

For multiple predictor variables, we use the adjusted R2:

\[ \hat{R}^2 = 1- \frac{\textrm{Variability in residuals}}{\textrm{Variability in outcome}} \times \frac{n-1}{n-k-1} \]

The adjusted R2 value is important for model selection.

Discuss: What is model selection and why is it important?

Model Selection

Four techniques:

  • Backward elimination, eliminating variables with…
    • … largest increase in \( \hat{R}^2 \) (if no increase, stop)
    • … largest \( P \)-value greater than 0.05 (if no \( P \)-values greater than 0.05, stop)
  • Forward selection by adding variables with…
    • … largest improvement in \( \hat{R}^2 \) (if no improvement possible, stop), or
    • … smallest \( P \)-values less than significance level (if no variables significant, stop)

Model Selection

Note: No guarantee that forward and backward selection will lead to same final model. In this case, choose model with highest adjusted \( R \)^2 value.

Note: When the sole goal is to improve prediction accuracy, use adjusted \( R^2 \) technique. This is commonly the case in machine learning applications.

Note: When we care about understanding which variables are statistically significant predictors of the response, or if there is interest in producing a simpler model at the potential cost of a little prediction accuracy, then the \( P \)-value approach is preferred.

Model Validation and Diagnostics

Okay, I've got a best fit model (i.e “Good squeezy squeezy”) BUT…

ARE MY ASSUMPTIONS MET? (i.e. is it the right fruit?!?!)