Exploring the Geometry of Multiple Linear Regression

PART I: Warm-Up

Example: Predicting Book Weights for Amazon Shipping

When you buy a book off of Amazon, you get a quote for how much it costs to ship. This is based on the weight of the book. If you didn’t know the weight a book, what other characteristics of it could you measure to help predict weight?

#install.packages("DAAG")
library(DAAG)

## Loading required package: lattice

data(allbacks)
books <- allbacks[, c(3, 1, 4)]

head(books)

##   weight volume cover
## 1    800    885    hb
## 2    950   1016    hb
## 3   1050   1125    hb
## 4    350    239    hb
## 5    750    701    hb
## 6    600    641    hb

First make a plot:

library(tidyverse)
ggplot(books, aes(x=volume, y=weight))+
  geom_point()

Now lets fit a simple linear regression model:

m1 <- lm(weight ~ volume, data = books)
summary(m1)

## 
## Call:
## lm(formula = weight ~ volume, data = books)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189.97 -109.86   38.08  109.73  145.57 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.67931   88.37758   1.218    0.245    
## volume        0.70864    0.09746   7.271 6.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.9 on 13 degrees of freedom
## Multiple R-squared:  0.8026, Adjusted R-squared:  0.7875 
## F-statistic: 52.87 on 1 and 13 DF,  p-value: 6.262e-06

What did we learn about the relationship between weight and volume in this model?

Add the regression line to the plot:

library(tidyverse)
ggplot(books, aes(x=volume, y=weight))+
  geom_point()+
  geom_abline(intercept = m1$coef[1], slope = m1$coef[2],
              col = "orchid", lwd=1)

Review SLR:

Q1: What is the equation for the line?

\[ \hat{y} = 107.7 + 0.708 x \] \[ \widehat{weight} = 107.7 + 0.708 volume \]

Q2: Does this appear to be a reasonable setting to apply linear regression?

We need to check:

Linear trend
Independent observations
Normal residuals
Equal variance

Make a residual plot:

ggplot(m1, aes(x=.fitted, y=.stdresid))+
  geom_point()+
  geom_abline(slope=0, intercept=0, col="red")

Make a qqplot:

qqnorm(m1$residuals)
qqline(m1$residuals)

Q3: Is volume a significant predictor?

m1 <- lm(weight ~ volume, data = books)
summary(m1)

## 
## Call:
## lm(formula = weight ~ volume, data = books)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189.97 -109.86   38.08  109.73  145.57 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.67931   88.37758   1.218    0.245    
## volume        0.70864    0.09746   7.271 6.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.9 on 13 degrees of freedom
## Multiple R-squared:  0.8026, Adjusted R-squared:  0.7875 
## F-statistic: 52.87 on 1 and 13 DF,  p-value: 6.262e-06

Q4: How much of the variation in weight is explained by the model containing volume?

anova(m1)

## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## volume     1 812132  812132  52.868 6.262e-06 ***
## Residuals 13 199701   15362                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fitting the Model with a Categorical Indicator (Different Intercepts):

Maybe the type of book cover (hardback or paperback) also affects weight:

ggplot(books, aes(x=volume, y=weight, color=cover))+
  geom_point()

Let’s make a model that includes cover:

m2 <- lm(weight ~ volume + cover, data = books)
summary(m2)

## 
## Call:
## lm(formula = weight ~ volume + cover, data = books)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.10  -32.32  -16.10   28.93  210.95 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  197.96284   59.19274   3.344 0.005841 ** 
## volume         0.71795    0.06153  11.669  6.6e-08 ***
## coverpb     -184.04727   40.49420  -4.545 0.000672 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared:  0.9275, Adjusted R-squared:  0.9154 
## F-statistic: 76.73 on 2 and 12 DF,  p-value: 1.455e-07

How do we interpret these estimates?

Add the parallel lines to the plot:

ggplot(books, aes(x=volume, y=weight, color=cover))+
  geom_point()+
  geom_abline(intercept = m2$coef[1], slope = m2$coef[2], col = "red") +
  geom_abline(intercept = m2$coef[1] + m2$coef[3], slope = m2$coef[2], col = "blue")

The slope corresponding to the dummy variable tell us:

How much vertical separation there is between our lines
How much weight is expected to increase if cover goes from 0 to 1 and volume is left unchanged.

Each \(\hat{\beta}_i\) tells you how much you expect the \(Y\) to change when you change the \(X_i\), while holding all other variables constant.

Questions:

Q1: Is the difference between cover types significant?

summary(m2)

## 
## Call:
## lm(formula = weight ~ volume + cover, data = books)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.10  -32.32  -16.10   28.93  210.95 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  197.96284   59.19274   3.344 0.005841 ** 
## volume         0.71795    0.06153  11.669  6.6e-08 ***
## coverpb     -184.04727   40.49420  -4.545 0.000672 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared:  0.9275, Adjusted R-squared:  0.9154 
## F-statistic: 76.73 on 2 and 12 DF,  p-value: 1.455e-07

Q2: How much of the variation in weight is explained by a model containing both volume and cover?

anova(m2)

## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## volume     1 812132  812132 132.809  7.58e-08 ***
## cover      1 126320  126320  20.657 0.0006719 ***
## Residuals 12  73381    6115                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fitting the Model with a Categorical Interaction (Different Intercepts and Slopes):

m3 <- lm(weight ~ volume + cover + volume:cover, data = books)
summary(m3)

## 
## Call:
## lm(formula = weight ~ volume + cover + volume:cover, data = books)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -89.67 -32.07 -21.82  17.94 215.91 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     161.58654   86.51918   1.868   0.0887 .  
## volume            0.76159    0.09718   7.837 7.94e-06 ***
## coverpb        -120.21407  115.65899  -1.039   0.3209    
## volume:coverpb   -0.07573    0.12802  -0.592   0.5661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.41 on 11 degrees of freedom
## Multiple R-squared:  0.9297, Adjusted R-squared:  0.9105 
## F-statistic:  48.5 on 3 and 11 DF,  p-value: 1.245e-06

Add lines to the plot:

ggplot(books, aes(x=volume, y=weight, color=cover))+
  geom_point()+
  geom_abline(intercept = m3$coef[1], slope = m3$coef[2], col = "red") +
  geom_abline(intercept = m3$coef[1] + m3$coef[3], slope = m3$coef[2]+m3$coef[4], col = "blue")

Question:

Q1: Do we have evidence that two types of books have different relationships between volume and weight?

Summary: Take home messages:

There is a statistically significant relationship between volume and weight.
There is a statistically significant difference in weight between paperback and hardcover books, when controlling for volume.
There is no strong evidence that the relationship between volume and weight differs betwen paperbacks and hardbacks.

PART II: Multiple Numeric Predictors

Multiple Regression

Allows us create a model to explain one \(numerical\) variable, the response, as a linear function of many explanatory variables that can be both \(numerical\) and \(categorical\).

We posit the true model:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon; \quad \epsilon \sim N(0, \sigma^2) \]

We use the data to estimate our fitted model:

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \ldots + \hat{\beta}_p X_p \] Estimating \(\beta_0, \beta_1\) etc.

In least-squares regression, we’re still finding the estimates that minimize the sum of squared residuals.

\[ e_i = y_i - \hat{y}_i \]

\[ \sum_{i = 1}^n e_i^2 \]

And yes, they have a closed-form solution with linear algebra.

\[ \hat{\beta} = (X'X)^{-1}X'Y \] ### Example: NYC Zagat Ratings

What determines the price of a meal?

Let’s look at the relationship between price, food rating, and decor rating.

nyc <- read.csv("http://andrewpbray.github.io/data/nyc.csv")
dim(nyc)

## [1] 168   7

head(nyc)

##   Case          Restaurant Price Food Decor Service East
## 1    1 Daniella Ristorante    43   22    18      20    0
## 2    2  Tello's Ristorante    32   20    19      19    0
## 3    3          Biricchino    34   21    13      18    0
## 4    4             Bottino    41   20    20      17    0
## 5    5          Da Umberto    54   24    19      21    0
## 6    6            Le Madri    52   22    22      21    0

Model 1: Food + Decor

m1 <- lm(Price ~ Food + Decor, data = nyc)
summary(m1)

## 
## Call:
## lm(formula = Price ~ Food + Decor, data = nyc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.945  -3.766  -0.153   3.701  18.757 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.5002     4.7230  -5.187 6.19e-07 ***
## Food          1.6461     0.2615   6.294 2.68e-09 ***
## Decor         1.8820     0.1919   9.810  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.788 on 165 degrees of freedom
## Multiple R-squared:  0.6167, Adjusted R-squared:  0.6121 
## F-statistic: 132.7 on 2 and 165 DF,  p-value: < 2.2e-16

anova(m1)

## Analysis of Variance Table
## 
## Response: Price
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Food        1 5670.3  5670.3 169.262 < 2.2e-16 ***
## Decor       1 3223.7  3223.7  96.228 < 2.2e-16 ***
## Residuals 165 5527.5    33.5                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 2: Food + Decor +East

Does the price depend on where the restaurant is located in Manhattan?

m2 <- lm(Price ~ Food + Decor + East, data = nyc)
summary(m2)

## 
## Call:
## lm(formula = Price ~ Food + Decor + East, data = nyc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0451  -3.8809   0.0389   3.3918  17.7557 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.0269     4.6727  -5.142 7.67e-07 ***
## Food          1.5363     0.2632   5.838 2.76e-08 ***
## Decor         1.9094     0.1900  10.049  < 2e-16 ***
## East          2.0670     0.9318   2.218   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.72 on 164 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6211 
## F-statistic: 92.24 on 3 and 164 DF,  p-value: < 2.2e-16

Model 3: Food + Decor*East

m3 <- lm(Price ~ Food + Decor + East + Decor:East, data = nyc)
summary(m3)

## 
## Call:
## lm(formula = Price ~ Food + Decor + East + Decor:East, data = nyc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7855  -3.6649   0.3785   3.7292  17.6358 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -29.3971     6.3770  -4.610 8.10e-06 ***
## Food          1.6634     0.2822   5.895 2.09e-08 ***
## Decor         2.0695     0.2298   9.006 5.42e-16 ***
## East          9.6616     6.2184   1.554    0.122    
## Decor:East   -0.4346     0.3518  -1.235    0.219    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.711 on 163 degrees of freedom
## Multiple R-squared:  0.6313, Adjusted R-squared:  0.6223 
## F-statistic: 69.78 on 4 and 163 DF,  p-value: < 2.2e-16

Summary: Take home messages

The East term was significant in model 2, suggesting that there is a significant relationship between location and price.
That term became nonsignificant when we allowed the slope of Decor to vary with location, and that difference in slopes was also nonsignificant.
Notice that slope estimate for a given variable will almost always change depending on the other variables that are in the model.

MATH239:MLRGeometry

Heather Kitada Smalley

3/16/2020