Multiple Linear Regression (Part 2)

M. Drew LaMar
February 12, 2020

Introduction to Multiple Linear Regression

Definition: Multiple regression extends simple two-variable regression to the case that still has one response but many predictors (denoted \( x_1 \), \( x_2 \), \( x_3 \), …).

The method is motivated by scenarios where many variables may be simultaneously connected to an output.

Our Example

We're going to look at auction data for the Mario Kart Wii game.

The Data

'data.frame':   141 obs. of  5 variables:
 $ price      : num  51.5 37 45.5 44 71 ...
 $ cond       : Factor w/ 2 levels "used","new": 2 1 2 2 2 2 1 2 1 1 ...
 $ stock_photo: Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
 $ duration   : int  3 7 3 3 1 3 1 1 3 7 ...
 $ wheels     : int  1 1 1 1 2 0 0 2 1 1 ...

Price vs. condition

plot of chunk unnamed-chunk-2

Price vs. condition

summary(mdl_cond)


Call:
lm(formula = price ~ cond, data = mario_kart)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.8911  -5.8311   0.1289   4.1289  22.1489 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   42.871      0.814  52.668  < 2e-16 ***
condnew       10.900      1.258   8.662 1.06e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.371 on 139 degrees of freedom
Multiple R-squared:  0.3506,    Adjusted R-squared:  0.3459 
F-statistic: 75.03 on 1 and 139 DF,  p-value: 1.056e-14

Multiple linear regression

A multiple regression model is a linear model with many predictors. In general, we write the model as \[ \hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} \] when there are \( k \) predictors.

All predictors included

mdl_full <- lm(price ~ cond + stock_photo + duration + wheels, data = mario_kart)
summary(mdl_full)


Call:
lm(formula = price ~ cond + stock_photo + duration + wheels, 
    data = mario_kart)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.3788  -2.9854  -0.9654   2.6915  14.0346 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    36.21097    1.51401  23.917  < 2e-16 ***
condnew         5.13056    1.05112   4.881 2.91e-06 ***
stock_photoyes  1.08031    1.05682   1.022    0.308    
duration       -0.02681    0.19041  -0.141    0.888    
wheels          7.28518    0.55469  13.134  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.901 on 136 degrees of freedom
Multiple R-squared:  0.719, Adjusted R-squared:  0.7108 
F-statistic: 87.01 on 4 and 136 DF,  p-value: < 2.2e-16

How good is the fit?

For only one predictor variable:

\[ R^2 = 1- \frac{\textrm{Variability in residuals}}{\textrm{Variability in outcome}} \]

For multiple predictor variables, we use the adjusted R²:

\[ \hat{R}^2 = 1- \frac{\textrm{Variability in residuals}}{\textrm{Variability in outcome}} \times \frac{n-1}{n-k-1} \]

The adjusted R² value is important for model selection.

Discuss: What is model selection and why is it important?

Model Selection

Four techniques:

Backward elimination, eliminating variables with…
- … largest increase in \( \hat{R}^2 \) (if no increase, stop)
- … largest \( P \)-value greater than 0.05 (if no \( P \)-values greater than 0.05, stop)
Forward selection by adding variables with…
- … largest improvement in \( \hat{R}^2 \) (if no improvement possible, stop), or
- … smallest \( P \)-values less than significance level (if no variables significant, stop)

Model Selection

Note: No guarantee that forward and backward selection will lead to same final model. In this case, choose model with highest adjusted \( R^2 \) value.

Note: When the sole goal is to improve prediction accuracy, use adjusted \( R^2 \) technique. This is commonly the case in machine learning applications.

Note: When we care about understanding which variables are statistically significant predictors of the response, or if there is interest in producing a simpler model at the potential cost of a little prediction accuracy, then the \( P \)-value approach is preferred.

Finding the winning model (Step 1)

Let's use backward elimination with the adjusted \( R^2 \) method:

full <- lm(price ~ cond + stock_photo + wheels + duration, data = mario_kart)
summary(full)


Call:
lm(formula = price ~ cond + stock_photo + wheels + duration, 
    data = mario_kart)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.3788  -2.9854  -0.9654   2.6915  14.0346 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    36.21097    1.51401  23.917  < 2e-16 ***
condnew         5.13056    1.05112   4.881 2.91e-06 ***
stock_photoyes  1.08031    1.05682   1.022    0.308    
wheels          7.28518    0.55469  13.134  < 2e-16 ***
duration       -0.02681    0.19041  -0.141    0.888    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.901 on 136 degrees of freedom
Multiple R-squared:  0.719, Adjusted R-squared:  0.7108 
F-statistic: 87.01 on 4 and 136 DF,  p-value: < 2.2e-16

Finding the winning model (Step 2)

Let's use backward elimination with the adjusted \( R^2 \) method:

First, compute models eliminating each variable individually

full_cond <- lm(price ~ stock_photo + wheels + duration, data = mario_kart)
full_stock_photo <- lm(price ~ cond + wheels + duration, data = mario_kart)
full_wheels <- lm(price ~ cond + stock_photo + duration, data = mario_kart)
full_duration <- lm(price ~ cond + stock_photo + wheels, data = mario_kart)

Second, look at the adjusted \( R^2 \) values for each

Full - cond: 0.66257465806698
Full - stock_photo: 0.71066728367746
Full - wheels: 0.348698623692822
Full - duration: 0.712831545419501

Are any of them larger than \( R^2 \) = 0.7107622 for the full model?

Finding the winning model (Step 3)

Let's use backward elimination with the adjusted \( R^2 \) method:

“Full - duration” wins, so now compute models eliminating variables from that one

full_duration_cond <- lm(price ~ stock_photo + wheels, data = mario_kart)
full_duration_stock_photo <- lm(price ~ cond + wheels, data = mario_kart)
full_duration_wheels <- lm(price ~ cond + stock_photo, data = mario_kart)

Second, look at the adjusted \( R^2 \) values for each

Full - duration - cond: 0.658720517517627
Full - duration - stock_photo: 0.712409794516323
Full - duration - wheels: 0.341432692541391

Are any of them larger than \( R^2 \) = 0.7128315 for the full model?

Winning model

Using backward elimination with the adjusted \( R^2 \) method:

winner <- lm(price ~ cond + stock_photo + wheels, data = mario_kart)
summary(winner)


Call:
lm(formula = price ~ cond + stock_photo + wheels, data = mario_kart)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.454  -2.959  -0.949   2.712  14.061 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     36.0483     0.9745  36.990  < 2e-16 ***
condnew          5.1763     0.9961   5.196 7.21e-07 ***
stock_photoyes   1.1177     1.0192   1.097    0.275    
wheels           7.2984     0.5448  13.397  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.884 on 137 degrees of freedom
Multiple R-squared:  0.719, Adjusted R-squared:  0.7128 
F-statistic: 116.8 on 3 and 137 DF,  p-value: < 2.2e-16

Model Validation and Diagnostics

Okay, I've got a best fit model (i.e “Good squeezy squeezy”) BUT…

ARE MY ASSUMPTIONS MET? (i.e. is it the right fruit?!?!)

Wrapping up Multiple Linear Regression

These were our steps for fitting a linear model:

Model Building: What type(s) of model to use?
Model Selection: Of these model types, which model fits the data the best?
Model Validation: Does this resulting model meet the assumptions?

Assumptions of Multiple Linear Regression

the residuals of the model are nearly normal,
the variability of the residuals is nearly constant,
the residuals are independent, and
each variable is linearly related to the outcome.

Nearly normal residuals?

the residuals of the model are nearly normal,
the variability of the residuals is nearly constant,
the residuals are independent, and
each variable is linearly related to the outcome.

Nearly normal residuals?

Do a qqplot of the residuals:

qqnorm(residuals(winner))

plot of chunk unnamed-chunk-13

Constant variability?

the residuals of the model are nearly normal,
the variability of the residuals is nearly constant,
the residuals are independent, and
each variable is linearly related to the outcome.

Constant variability?

Residuals vs. fitted values:

plot(fitted(winner), residuals(winner))

plot of chunk unnamed-chunk-14

Independence??

the residuals of the model are nearly normal,
the variability of the residuals is nearly constant,
the residuals are independent, and
each variable is linearly related to the outcome.

Independence?

Residuals vs order of collection (ID):

plot(rownames(mario_kart), residuals(winner))

plot of chunk unnamed-chunk-15

Linearity

the residuals of the model are nearly normal,
the variability of the residuals is nearly constant,
the residuals are independent, and
each variable is linearly related to the outcome.

Linearity

par(mfrow=c(1,3))
boxplot(residuals(winner) ~ cond, data = mario_kart)
boxplot(residuals(winner) ~ stock_photo, data = mario_kart)
plot(residuals(winner) ~ wheels, data = mario_kart)

plot of chunk unnamed-chunk-16