When you buy a book off of Amazon, you get a quote for how much it costs to ship. This is based on the weight of the book. If you didn’t know the weight a book, what other characteristics of it could you measure to help predict weight?
#install.packages("DAAG")
library(DAAG)
## Loading required package: lattice
data(allbacks)
books <- allbacks[, c(3, 1, 4)]
head(books)
## weight volume cover
## 1 800 885 hb
## 2 950 1016 hb
## 3 1050 1125 hb
## 4 350 239 hb
## 5 750 701 hb
## 6 600 641 hb
First make a plot:
library(tidyverse)
ggplot(books, aes(x=volume, y=weight))+
geom_point()
Now lets fit a simple linear regression model:
m1 <- lm(weight ~ volume, data = books)
summary(m1)
##
## Call:
## lm(formula = weight ~ volume, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -189.97 -109.86 38.08 109.73 145.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.67931 88.37758 1.218 0.245
## volume 0.70864 0.09746 7.271 6.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123.9 on 13 degrees of freedom
## Multiple R-squared: 0.8026, Adjusted R-squared: 0.7875
## F-statistic: 52.87 on 1 and 13 DF, p-value: 6.262e-06
What did we learn about the relationship between weight and volume in this model?
Add the regression line to the plot:
library(tidyverse)
ggplot(books, aes(x=volume, y=weight))+
geom_point()+
geom_abline(intercept = m1$coef[1], slope = m1$coef[2],
col = "orchid", lwd=1)
\[ \hat{y} = 107.7 + 0.708 x \] \[ \widehat{weight} = 107.7 + 0.708 volume \]
We need to check:
Make a residual plot:
ggplot(m1, aes(x=.fitted, y=.stdresid))+
geom_point()+
geom_abline(slope=0, intercept=0, col="red")
Make a qqplot:
qqnorm(m1$residuals)
qqline(m1$residuals)
m1 <- lm(weight ~ volume, data = books)
summary(m1)
##
## Call:
## lm(formula = weight ~ volume, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -189.97 -109.86 38.08 109.73 145.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.67931 88.37758 1.218 0.245
## volume 0.70864 0.09746 7.271 6.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123.9 on 13 degrees of freedom
## Multiple R-squared: 0.8026, Adjusted R-squared: 0.7875
## F-statistic: 52.87 on 1 and 13 DF, p-value: 6.262e-06
anova(m1)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## volume 1 812132 812132 52.868 6.262e-06 ***
## Residuals 13 199701 15362
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Maybe the type of book cover (hardback or paperback) also affects weight:
ggplot(books, aes(x=volume, y=weight, color=cover))+
geom_point()
Let’s make a model that includes cover:
m2 <- lm(weight ~ volume + cover, data = books)
summary(m2)
##
## Call:
## lm(formula = weight ~ volume + cover, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.10 -32.32 -16.10 28.93 210.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 197.96284 59.19274 3.344 0.005841 **
## volume 0.71795 0.06153 11.669 6.6e-08 ***
## coverpb -184.04727 40.49420 -4.545 0.000672 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared: 0.9275, Adjusted R-squared: 0.9154
## F-statistic: 76.73 on 2 and 12 DF, p-value: 1.455e-07
How do we interpret these estimates?
Add the parallel lines to the plot:
ggplot(books, aes(x=volume, y=weight, color=cover))+
geom_point()+
geom_abline(intercept = m2$coef[1], slope = m2$coef[2], col = "red") +
geom_abline(intercept = m2$coef[1] + m2$coef[3], slope = m2$coef[2], col = "blue")
The slope corresponding to the dummy variable tell us:
weight
is expected to increase if cover
goes from 0 to 1 and volume
is left unchanged.Each \(\hat{\beta}_i\) tells you how much you expect the \(Y\) to change when you change the \(X_i\), while holding all other variables constant.
summary(m2)
##
## Call:
## lm(formula = weight ~ volume + cover, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.10 -32.32 -16.10 28.93 210.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 197.96284 59.19274 3.344 0.005841 **
## volume 0.71795 0.06153 11.669 6.6e-08 ***
## coverpb -184.04727 40.49420 -4.545 0.000672 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared: 0.9275, Adjusted R-squared: 0.9154
## F-statistic: 76.73 on 2 and 12 DF, p-value: 1.455e-07
anova(m2)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## volume 1 812132 812132 132.809 7.58e-08 ***
## cover 1 126320 126320 20.657 0.0006719 ***
## Residuals 12 73381 6115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m3 <- lm(weight ~ volume + cover + volume:cover, data = books)
summary(m3)
##
## Call:
## lm(formula = weight ~ volume + cover + volume:cover, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89.67 -32.07 -21.82 17.94 215.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 161.58654 86.51918 1.868 0.0887 .
## volume 0.76159 0.09718 7.837 7.94e-06 ***
## coverpb -120.21407 115.65899 -1.039 0.3209
## volume:coverpb -0.07573 0.12802 -0.592 0.5661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.41 on 11 degrees of freedom
## Multiple R-squared: 0.9297, Adjusted R-squared: 0.9105
## F-statistic: 48.5 on 3 and 11 DF, p-value: 1.245e-06
Add lines to the plot:
ggplot(books, aes(x=volume, y=weight, color=cover))+
geom_point()+
geom_abline(intercept = m3$coef[1], slope = m3$coef[2], col = "red") +
geom_abline(intercept = m3$coef[1] + m3$coef[3], slope = m3$coef[2]+m3$coef[4], col = "blue")
Allows us create a model to explain one \(numerical\) variable, the response, as a linear function of many explanatory variables that can be both \(numerical\) and \(categorical\).
We posit the true model:
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon; \quad \epsilon \sim N(0, \sigma^2) \]
We use the data to estimate our fitted model:
\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \ldots + \hat{\beta}_p X_p \] Estimating \(\beta_0, \beta_1\) etc.
In least-squares regression, we’re still finding the estimates that minimize the sum of squared residuals.
\[ e_i = y_i - \hat{y}_i \]
\[ \sum_{i = 1}^n e_i^2 \]
And yes, they have a closed-form solution with linear algebra.
\[ \hat{\beta} = (X'X)^{-1}X'Y \] ### Example: NYC Zagat Ratings
What determines the price of a meal?
Let’s look at the relationship between price, food rating, and decor rating.
nyc <- read.csv("http://andrewpbray.github.io/data/nyc.csv")
dim(nyc)
## [1] 168 7
head(nyc)
## Case Restaurant Price Food Decor Service East
## 1 1 Daniella Ristorante 43 22 18 20 0
## 2 2 Tello's Ristorante 32 20 19 19 0
## 3 3 Biricchino 34 21 13 18 0
## 4 4 Bottino 41 20 20 17 0
## 5 5 Da Umberto 54 24 19 21 0
## 6 6 Le Madri 52 22 22 21 0
m1 <- lm(Price ~ Food + Decor, data = nyc)
summary(m1)
##
## Call:
## lm(formula = Price ~ Food + Decor, data = nyc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.945 -3.766 -0.153 3.701 18.757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.5002 4.7230 -5.187 6.19e-07 ***
## Food 1.6461 0.2615 6.294 2.68e-09 ***
## Decor 1.8820 0.1919 9.810 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.788 on 165 degrees of freedom
## Multiple R-squared: 0.6167, Adjusted R-squared: 0.6121
## F-statistic: 132.7 on 2 and 165 DF, p-value: < 2.2e-16
anova(m1)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value Pr(>F)
## Food 1 5670.3 5670.3 169.262 < 2.2e-16 ***
## Decor 1 3223.7 3223.7 96.228 < 2.2e-16 ***
## Residuals 165 5527.5 33.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Does the price depend on where the restaurant is located in Manhattan?
m2 <- lm(Price ~ Food + Decor + East, data = nyc)
summary(m2)
##
## Call:
## lm(formula = Price ~ Food + Decor + East, data = nyc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.0451 -3.8809 0.0389 3.3918 17.7557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.0269 4.6727 -5.142 7.67e-07 ***
## Food 1.5363 0.2632 5.838 2.76e-08 ***
## Decor 1.9094 0.1900 10.049 < 2e-16 ***
## East 2.0670 0.9318 2.218 0.0279 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.72 on 164 degrees of freedom
## Multiple R-squared: 0.6279, Adjusted R-squared: 0.6211
## F-statistic: 92.24 on 3 and 164 DF, p-value: < 2.2e-16
m3 <- lm(Price ~ Food + Decor + East + Decor:East, data = nyc)
summary(m3)
##
## Call:
## lm(formula = Price ~ Food + Decor + East + Decor:East, data = nyc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.7855 -3.6649 0.3785 3.7292 17.6358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.3971 6.3770 -4.610 8.10e-06 ***
## Food 1.6634 0.2822 5.895 2.09e-08 ***
## Decor 2.0695 0.2298 9.006 5.42e-16 ***
## East 9.6616 6.2184 1.554 0.122
## Decor:East -0.4346 0.3518 -1.235 0.219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.711 on 163 degrees of freedom
## Multiple R-squared: 0.6313, Adjusted R-squared: 0.6223
## F-statistic: 69.78 on 4 and 163 DF, p-value: < 2.2e-16
East
term was significant in model 2, suggesting that there is a significant relationship between location and price.Decor
to vary with location, and that difference in slopes was also nonsignificant.