Exploring the Geometry of Multiple Linear Regression

PART I: Warm-Up

Example: Predicting Book Weights for Amazon Shipping

When you buy a book off of Amazon, you get a quote for how much it costs to ship. This is based on the weight of the book. If you didn’t know the weight a book, what other characteristics of it could you measure to help predict weight?

#install.packages("DAAG")
library(DAAG)
## Loading required package: lattice
data(allbacks)
books <- allbacks[, c(3, 1, 4)]

head(books)
##   weight volume cover
## 1    800    885    hb
## 2    950   1016    hb
## 3   1050   1125    hb
## 4    350    239    hb
## 5    750    701    hb
## 6    600    641    hb

First make a plot:

library(tidyverse)
ggplot(books, aes(x=volume, y=weight))+
  geom_point()

Now lets fit a simple linear regression model:

m1 <- lm(weight ~ volume, data = books)
summary(m1)
## 
## Call:
## lm(formula = weight ~ volume, data = books)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189.97 -109.86   38.08  109.73  145.57 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.67931   88.37758   1.218    0.245    
## volume        0.70864    0.09746   7.271 6.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.9 on 13 degrees of freedom
## Multiple R-squared:  0.8026, Adjusted R-squared:  0.7875 
## F-statistic: 52.87 on 1 and 13 DF,  p-value: 6.262e-06

What did we learn about the relationship between weight and volume in this model?

Add the regression line to the plot:

library(tidyverse)
ggplot(books, aes(x=volume, y=weight))+
  geom_point()+
  geom_abline(intercept = m1$coef[1], slope = m1$coef[2],
              col = "orchid", lwd=1)

Review SLR:

Q1: What is the equation for the line?

\[ \hat{y} = 107.7 + 0.708 x \] \[ \widehat{weight} = 107.7 + 0.708 volume \]

Q2: Does this appear to be a reasonable setting to apply linear regression?

We need to check:

  1. Linear trend
  2. Independent observations
  3. Normal residuals
  4. Equal variance

Make a residual plot:

ggplot(m1, aes(x=.fitted, y=.stdresid))+
  geom_point()+
  geom_abline(slope=0, intercept=0, col="red")

Make a qqplot:

qqnorm(m1$residuals)
qqline(m1$residuals)

Q3: Is volume a significant predictor?
m1 <- lm(weight ~ volume, data = books)
summary(m1)
## 
## Call:
## lm(formula = weight ~ volume, data = books)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189.97 -109.86   38.08  109.73  145.57 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.67931   88.37758   1.218    0.245    
## volume        0.70864    0.09746   7.271 6.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.9 on 13 degrees of freedom
## Multiple R-squared:  0.8026, Adjusted R-squared:  0.7875 
## F-statistic: 52.87 on 1 and 13 DF,  p-value: 6.262e-06
Q4: How much of the variation in weight is explained by the model containing volume?
anova(m1)
## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## volume     1 812132  812132  52.868 6.262e-06 ***
## Residuals 13 199701   15362                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fitting the Model with a Categorical Indicator (Different Intercepts):

Maybe the type of book cover (hardback or paperback) also affects weight:

ggplot(books, aes(x=volume, y=weight, color=cover))+
  geom_point()

Let’s make a model that includes cover:

m2 <- lm(weight ~ volume + cover, data = books)
summary(m2)
## 
## Call:
## lm(formula = weight ~ volume + cover, data = books)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.10  -32.32  -16.10   28.93  210.95 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  197.96284   59.19274   3.344 0.005841 ** 
## volume         0.71795    0.06153  11.669  6.6e-08 ***
## coverpb     -184.04727   40.49420  -4.545 0.000672 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared:  0.9275, Adjusted R-squared:  0.9154 
## F-statistic: 76.73 on 2 and 12 DF,  p-value: 1.455e-07

How do we interpret these estimates?

Add the parallel lines to the plot:

ggplot(books, aes(x=volume, y=weight, color=cover))+
  geom_point()+
  geom_abline(intercept = m2$coef[1], slope = m2$coef[2], col = "red") +
  geom_abline(intercept = m2$coef[1] + m2$coef[3], slope = m2$coef[2], col = "blue")

The slope corresponding to the dummy variable tell us:

  • How much vertical separation there is between our lines
  • How much weight is expected to increase if cover goes from 0 to 1 and volume is left unchanged.

Each \(\hat{\beta}_i\) tells you how much you expect the \(Y\) to change when you change the \(X_i\), while holding all other variables constant.

Questions:

Q1: Is the difference between cover types significant?
summary(m2)
## 
## Call:
## lm(formula = weight ~ volume + cover, data = books)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.10  -32.32  -16.10   28.93  210.95 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  197.96284   59.19274   3.344 0.005841 ** 
## volume         0.71795    0.06153  11.669  6.6e-08 ***
## coverpb     -184.04727   40.49420  -4.545 0.000672 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared:  0.9275, Adjusted R-squared:  0.9154 
## F-statistic: 76.73 on 2 and 12 DF,  p-value: 1.455e-07
Q2: How much of the variation in weight is explained by a model containing both volume and cover?
anova(m2)
## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## volume     1 812132  812132 132.809  7.58e-08 ***
## cover      1 126320  126320  20.657 0.0006719 ***
## Residuals 12  73381    6115                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fitting the Model with a Categorical Interaction (Different Intercepts and Slopes):

m3 <- lm(weight ~ volume + cover + volume:cover, data = books)
summary(m3)
## 
## Call:
## lm(formula = weight ~ volume + cover + volume:cover, data = books)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -89.67 -32.07 -21.82  17.94 215.91 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     161.58654   86.51918   1.868   0.0887 .  
## volume            0.76159    0.09718   7.837 7.94e-06 ***
## coverpb        -120.21407  115.65899  -1.039   0.3209    
## volume:coverpb   -0.07573    0.12802  -0.592   0.5661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.41 on 11 degrees of freedom
## Multiple R-squared:  0.9297, Adjusted R-squared:  0.9105 
## F-statistic:  48.5 on 3 and 11 DF,  p-value: 1.245e-06

Add lines to the plot:

ggplot(books, aes(x=volume, y=weight, color=cover))+
  geom_point()+
  geom_abline(intercept = m3$coef[1], slope = m3$coef[2], col = "red") +
  geom_abline(intercept = m3$coef[1] + m3$coef[3], slope = m3$coef[2]+m3$coef[4], col = "blue")

Question:

Q1: Do we have evidence that two types of books have different relationships between volume and weight?

Summary: Take home messages:

  • There is a statistically significant relationship between volume and weight.
  • There is a statistically significant difference in weight between paperback and hardcover books, when controlling for volume.
  • There is no strong evidence that the relationship between volume and weight differs betwen paperbacks and hardbacks.

PART II: Multiple Numeric Predictors

Multiple Regression

Allows us create a model to explain one \(numerical\) variable, the response, as a linear function of many explanatory variables that can be both \(numerical\) and \(categorical\).

We posit the true model:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon; \quad \epsilon \sim N(0, \sigma^2) \]

We use the data to estimate our fitted model:

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \ldots + \hat{\beta}_p X_p \] Estimating \(\beta_0, \beta_1\) etc.

In least-squares regression, we’re still finding the estimates that minimize the sum of squared residuals.

\[ e_i = y_i - \hat{y}_i \]

\[ \sum_{i = 1}^n e_i^2 \]

And yes, they have a closed-form solution with linear algebra.

\[ \hat{\beta} = (X'X)^{-1}X'Y \] ### Example: NYC Zagat Ratings

What determines the price of a meal?

Let’s look at the relationship between price, food rating, and decor rating.

nyc <- read.csv("http://andrewpbray.github.io/data/nyc.csv")
dim(nyc)
## [1] 168   7
head(nyc)
##   Case          Restaurant Price Food Decor Service East
## 1    1 Daniella Ristorante    43   22    18      20    0
## 2    2  Tello's Ristorante    32   20    19      19    0
## 3    3          Biricchino    34   21    13      18    0
## 4    4             Bottino    41   20    20      17    0
## 5    5          Da Umberto    54   24    19      21    0
## 6    6            Le Madri    52   22    22      21    0

Model 1: Food + Decor

m1 <- lm(Price ~ Food + Decor, data = nyc)
summary(m1)
## 
## Call:
## lm(formula = Price ~ Food + Decor, data = nyc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.945  -3.766  -0.153   3.701  18.757 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.5002     4.7230  -5.187 6.19e-07 ***
## Food          1.6461     0.2615   6.294 2.68e-09 ***
## Decor         1.8820     0.1919   9.810  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.788 on 165 degrees of freedom
## Multiple R-squared:  0.6167, Adjusted R-squared:  0.6121 
## F-statistic: 132.7 on 2 and 165 DF,  p-value: < 2.2e-16
anova(m1)
## Analysis of Variance Table
## 
## Response: Price
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Food        1 5670.3  5670.3 169.262 < 2.2e-16 ***
## Decor       1 3223.7  3223.7  96.228 < 2.2e-16 ***
## Residuals 165 5527.5    33.5                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 2: Food + Decor +East

Does the price depend on where the restaurant is located in Manhattan?

m2 <- lm(Price ~ Food + Decor + East, data = nyc)
summary(m2)
## 
## Call:
## lm(formula = Price ~ Food + Decor + East, data = nyc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0451  -3.8809   0.0389   3.3918  17.7557 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.0269     4.6727  -5.142 7.67e-07 ***
## Food          1.5363     0.2632   5.838 2.76e-08 ***
## Decor         1.9094     0.1900  10.049  < 2e-16 ***
## East          2.0670     0.9318   2.218   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.72 on 164 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6211 
## F-statistic: 92.24 on 3 and 164 DF,  p-value: < 2.2e-16

Model 3: Food + Decor*East

m3 <- lm(Price ~ Food + Decor + East + Decor:East, data = nyc)
summary(m3)
## 
## Call:
## lm(formula = Price ~ Food + Decor + East + Decor:East, data = nyc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7855  -3.6649   0.3785   3.7292  17.6358 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -29.3971     6.3770  -4.610 8.10e-06 ***
## Food          1.6634     0.2822   5.895 2.09e-08 ***
## Decor         2.0695     0.2298   9.006 5.42e-16 ***
## East          9.6616     6.2184   1.554    0.122    
## Decor:East   -0.4346     0.3518  -1.235    0.219    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.711 on 163 degrees of freedom
## Multiple R-squared:  0.6313, Adjusted R-squared:  0.6223 
## F-statistic: 69.78 on 4 and 163 DF,  p-value: < 2.2e-16

Summary: Take home messages

  • The East term was significant in model 2, suggesting that there is a significant relationship between location and price.
  • That term became nonsignificant when we allowed the slope of Decor to vary with location, and that difference in slopes was also nonsignificant.
  • Notice that slope estimate for a given variable will almost always change depending on the other variables that are in the model.