Multiple Linear Regression

M. Drew LaMar
February 10, 2020

Class announcements

  • Reading Assignments
    • OpenStats, Chapter 8: Introduction to linear regression (for Homework #3)
    • OpenStats, Chapter 9: Multiple and logistic regression (this week's lectures)
  • Homework #3 up on Blackboard after class (due Monday, February 17, 11:59pm)

Example: Biting lizards

Male lizards in the species Crotaphytus collaris use their jaws as weapons during territorial interactions. Lappin and Husak (2005) tested whether weapon performance (bite force) predicted territory size in this species.

Example: Biting lizards

plot of chunk unnamed-chunk-2

Example: Biting lizards

Computing best-fit line using R

(biteRegression <- lm(territory.area ~ bite, data = biteData))

Call:
lm(formula = territory.area ~ bite, data = biteData)

Coefficients:
(Intercept)         bite  
     -31.54        11.68  

Example: Biting lizards

Bonus!!! With lm, can add best-fit line to plot.

# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))

# Scatter plot
plot(biteData, pch=16, col="firebrick", cex=1.5, cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)))

# Add in the best-fit line
abline(biteRegression, lwd=3)

Example: Biting lizards

Bonus!!! With lm, can add best-fit line to plot.

plot of chunk unnamed-chunk-5

Predicted values and residuals

Definition: The predicted value of \( Y \) (denoted \( \hat{Y} \), or \( \mu_{Y\, |\, X} \)) from a regression line estimates the mean value of \( Y \) for all individuals having a given value of \( X \).

Definition: Residuals measure the scatter of points above and below the least-squares regression line, and are denoted by

\[ r_{i} = \hat{Y}_{i} - Y_{i}, \]
where \( \hat{Y}_{i} = a + bX_{i} \).

Predicted values and residuals

Prediction values

We can predict what the mean value of \( Y \) is for values of the explanatory variable \( X \) not represented in our data, as long as we are within the range of values of the data.

The function predict accomplishes this, and even gives us a standard error for our estimate.

(pred_5.1 <- predict(biteRegression, data.frame(bite = 5.1), se.fit = TRUE))
$fit
       1 
28.01492 

$se.fit
[1] 2.163259

$df
[1] 9

$residual.scale
[1] 5.788413

Prediction values

plot of chunk unnamed-chunk-7

Residual plot

Definition: a residual plot is a scatter plot of the residuals \( (\hat{Y}_{i}-Y_{i}) \) against the \( X_{i} \), the values of the explanatory variable.

plot of chunk unnamed-chunk-8

Residual plots

# Get residuals from regression output
biteData$res = resid(biteRegression)

# Plot residuals
plot(res ~ bite, data=biteData, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Bite force (N)", ylab="Residuals")

# Add a horizontal line at zero
abline(h=0, lty=2)

Residual plots to check assumptions

hist(biteData$res)

plot of chunk unnamed-chunk-10

qqnorm(biteData$res)

plot of chunk unnamed-chunk-11

Linear regression

summary(biteRegression)

Linear regression


Call:
lm(formula = territory.area ~ bite, data = biteData)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0472 -4.5101 -0.5504  3.6689 10.2237 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -31.539     23.513  -1.341   0.2127  
bite          11.677      4.848   2.409   0.0393 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared:  0.3919,    Adjusted R-squared:  0.3244 
F-statistic: 5.801 on 1 and 9 DF,  p-value: 0.03934

\( P \)-value is less than 0.05, so we can reject the null hypothesis that the slope \( \beta = 0 \).

Variation explained by explanatory variable

We can measure how well the line “fits” the data by estimating the \( R^{2} \) value, i.e.

\[ R^{2} = \frac{\sigma^{2}_{\mathrm{regression}}}{\sigma^{2}_{\mathrm{response}}} = \frac{\sigma^{2}_{\mathrm{response}} - \sigma^{2}_{\mathrm{residual}}}{\sigma^{2}_{\mathrm{response}}}. \]

This also can be said to measure the fraction of variation in \( Y \) that is “explained” by \( X \).

Variation explained by explanatory variable

Basic idea is:

  • If \( R^{2} \) is close to 1, then \( X \) is explaining most of the variation in \( Y \), and any other variation which could be caused by other sources is negligible in comparison.
  • If \( R^{2} \) is close to 0, then \( X \) is explaining very little of the variation in \( Y \), and the remaining variation is caused by other sources not accounted for or measured in the system of study.

Variation explained by explanatory variable

For the lizard example,

biteRegSummary <- summary(biteRegression)
biteRegSummary$r.squared
[1] 0.3919418

Thus, 39% of the variation in territory area is explained by bite force.

Annotate plot

# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))

# Scatter plot
plot(territory.area ~ bite, pch=16, col="firebrick", cex=1.5, cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)), data=biteData)

# Add in the best-fit line
abline(biteRegression, lwd=3)

# Text
text(5, 30.5, expression(R^{2} ~ "=" ~ 0.39), cex=1.5)
text(5.14, 31.7, expression(y ~ "=" -31.54 + 11.68), cex=1.5)

Annotate plot

plot of chunk unnamed-chunk-16

Summary of a regression in R

summary(biteRegression)

Summary of a regression in R


Call:
lm(formula = territory.area ~ bite, data = biteData)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0472 -4.5101 -0.5504  3.6689 10.2237 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -31.539     23.513  -1.341   0.2127  
bite          11.677      4.848   2.409   0.0393 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared:  0.3919,    Adjusted R-squared:  0.3244 
F-statistic: 5.801 on 1 and 9 DF,  p-value: 0.03934

Introduction to Multiple Linear Regression

Definition: Multiple regression extends simple two-variable regression to the case that still has one response but many predictors (denoted \( x_1 \), \( x_2 \), \( x_3 \), …).

The method is motivated by scenarios where many variables may be simultaneously connected to an output.

Assumptions of Multiple Linear Regression

  1. the residuals of the model are nearly normal,
  2. the variability of the residuals is nearly constant,
  3. the residuals are independent, and
  4. each variable is linearly related to the outcome.

Our Example

We're going to look at auction data for the Mario Kart Wii game.

The Data

mario_kart <- marioKart %>% 
  dplyr::select(price = totalPr, 
         cond, 
         stock_photo = stockPhoto, 
         duration, wheels) %>% 
  mutate(cond = forcats::fct_relevel(cond, c("used", "new"))) %>%
  filter(price < 100)
str(mario_kart)

The Data

'data.frame':   141 obs. of  5 variables:
 $ price      : num  51.5 37 45.5 44 71 ...
 $ cond       : Factor w/ 2 levels "used","new": 2 1 2 2 2 2 1 2 1 1 ...
 $ stock_photo: Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
 $ duration   : int  3 7 3 3 1 3 1 1 3 7 ...
 $ wheels     : int  1 1 1 1 2 0 0 2 1 1 ...

Price vs. condition

mario_kart %>% 
  ggplot(aes(x = cond, y = price)) + 
  geom_point(position = position_jitter(width = 0.1))

plot of chunk unnamed-chunk-20

Price vs. condition

plot of chunk unnamed-chunk-21

Price vs. condition

summary(mdl_cond)

Call:
lm(formula = price ~ cond, data = mario_kart)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.8911  -5.8311   0.1289   4.1289  22.1489 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   42.871      0.814  52.668  < 2e-16 ***
condnew       10.900      1.258   8.662 1.06e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.371 on 139 degrees of freedom
Multiple R-squared:  0.3506,    Adjusted R-squared:  0.3459 
F-statistic: 75.03 on 1 and 139 DF,  p-value: 1.056e-14

Multiple linear regression

A multiple regression model is a linear model with many predictors. In general, we write the model as \[ \hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} \] when there are \( k \) predictors.

All predictors included

mdl_full <- lm(price ~ cond + stock_photo + duration + wheels, data = mario_kart)
summary(mdl_full)

Call:
lm(formula = price ~ cond + stock_photo + duration + wheels, 
    data = mario_kart)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.3788  -2.9854  -0.9654   2.6915  14.0346 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    36.21097    1.51401  23.917  < 2e-16 ***
condnew         5.13056    1.05112   4.881 2.91e-06 ***
stock_photoyes  1.08031    1.05682   1.022    0.308    
duration       -0.02681    0.19041  -0.141    0.888    
wheels          7.28518    0.55469  13.134  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.901 on 136 degrees of freedom
Multiple R-squared:  0.719, Adjusted R-squared:  0.7108 
F-statistic: 87.01 on 4 and 136 DF,  p-value: < 2.2e-16