Linear Models with Multiple Explanatory Variables

Recall Example: Scottish Hill Races
Races<-read.table("http://stat4ds.rwth-aachen.de/data/ScotsRaces.dat", header=TRUE)
head(Races,3) # timeM for men, timeW for women
race distance climb timeM timeW
1 AnTeallach 10.6 1.062 74.68 89.72
2 ArrocharAlps 25.0 2.400 187.32 222.03
3 BaddinsgillRound 16.4 0.650 87.18 102.48
fit.dc<-lm(timeW~distance + climb,data=Races)
summary(fit.dc)
Call:
lm(formula = timeW ~ distance + climb, data = Races)
Residuals:
Min 1Q Median 3Q Max
-37.209 -8.637 0.235 8.504 33.901
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.5997 3.4680 -4.21 8.02e-05 ***
distance 5.0362 0.1683 29.92 < 2e-16 ***
climb 35.5610 3.7002 9.61 4.22e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.96 on 65 degrees of freedom
Multiple R-squared: 0.9641, Adjusted R-squared: 0.963
F-statistic: 872.6 on 2 and 65 DF, p-value: < 2.2e-16

Interpreting Effects in Multiple Regression Models
Association and Causation




Confounding, Spuriousness and Conditional Independence

Confounding - Wikipedia
(see Sections 6.2.3 and 6.2.4 of the book)
Example: Modeling Crime Rate in Florida

Florida<-read.table("http://stat4ds.rwth-aachen.de/data/Florida.dat", header=TRUE)
head(Florida,2)
County Crime Income HS Urban
1 ALACHUA 104 22.1 82.7 73.2
2 BAKER 20 25.8 64.1 21.5
summary(lm(Crime~HS,data=Florida))
Call:
lm(formula = Crime ~ HS, data = Florida)
Residuals:
Min 1Q Median 3Q Max
-43.74 -21.36 -4.82 17.42 82.27
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -50.8569 24.4507 -2.080 0.0415 *
HS 1.4860 0.3491 4.257 6.81e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.12 on 65 degrees of freedom
Multiple R-squared: 0.218, Adjusted R-squared: 0.206
F-statistic: 18.12 on 1 and 65 DF, p-value: 6.806e-05
summary(lm(Crime~HS + Urban, data=Florida))
Call:
lm(formula = Crime ~ HS + Urban, data = Florida)
Residuals:
Min 1Q Median 3Q Max
-34.693 -15.742 -6.226 15.812 50.678
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.1181 28.3653 2.084 0.0411 *
HS -0.5834 0.4725 -1.235 0.2214
Urban 0.6825 0.1232 5.539 6.11e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.82 on 64 degrees of freedom
Multiple R-squared: 0.4714, Adjusted R-squared: 0.4549
F-statistic: 28.54 on 2 and 64 DF, p-value: 1.379e-09
cor(Florida$HS,Florida$Urban) # urbanization strongly positively correlated
cor(Florida$Crime,Florida$Urban)


Least Squares Estimates in Multiple Regression
Later, we will use linear algebra to solve this set of equations.
Interaction between Explanatory Variables in Their Effects

Recall Example: Scottish Hill Races
Races<-read.table("http://stat4ds.rwth-aachen.de/data/ScotsRaces.dat", header=TRUE)
head(Races,3) # timeM for men, timeW for women
race distance climb timeM timeW
1 AnTeallach 10.6 1.062 74.68 89.72
2 ArrocharAlps 25.0 2.400 187.32 222.03
3 BaddinsgillRound 16.4 0.650 87.18 102.48
summary(lm(timeW ~ distance + climb + distance:climb, data=Races))
Call:
lm(formula = timeW ~ distance + climb + distance:climb, data = Races)
Residuals:
Min 1Q Median 3Q Max
-41.295 -7.589 -0.743 6.090 29.513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.0162 6.6830 -0.751 0.45565
distance 4.3682 0.4332 10.083 7.61e-15 ***
climb 23.9446 7.8579 3.047 0.00335 **
distance:climb 0.6582 0.3943 1.669 0.09993 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.78 on 64 degrees of freedom
Multiple R-squared: 0.9656, Adjusted R-squared: 0.964
F-statistic: 598.7 on 3 and 64 DF, p-value: < 2.2e-16
Cook’s Distance: Detecting Unusual and Influential Observations
Leverage (statistics) - Wikipedia

Races<-read.table("http://stat4ds.rwth-aachen.de/data/ScotsRaces.dat", header=TRUE)
head(Races,3) # timeM for men, timeW for women
race distance climb timeM timeW
1 AnTeallach 10.6 1.062 74.68 89.72
2 ArrocharAlps 25.0 2.400 187.32 222.03
3 BaddinsgillRound 16.4 0.650 87.18 102.48
fit.dc<-lm(timeW~distance + climb,data=Races)
res<-residuals(fit.dc);fitval<- fitted(fit.dc); leverage<-hatvalues(fit.dc)
cooks.ds<-cooks.distance(fit.dc);tail(sort(cooks.ds),3)
55 2 41
0.1392402 0.2162927 9.0682767
hist(res) # Histogram display of residuals (not shown)
plot(cooks.ds) # Plot of Cook's distance values (not shown)
out<-cbind(Races$race, Races$timeW, fitval, res, leverage, cooks.ds, rank(cooks.ds))
out[c(1,41,68),] # print output for the 1st, 41th and 68th observations
fitval res
1 "AnTeallach" "89.72" "76.5495971370441" "13.1704028629559"
41 "HighlandFling" "490.05" "456.148689887272" "33.9013101127278"
68 "Yetholm" "71.55" "76.8897752048972" "-5.33977520489718"
leverage cooks.ds
1 "0.0256060611271758" "0.00799666069978033" "46"
41 "0.630433759585242" "9.0682766524142" "68"
68 "0.0163750766634276" "0.000824911126512746" "21"
# **largest Cook's distance = 9.07 has rank 68 of 68 observations
fit.dc2<-lm(timeW~distance + climb,data=Races[-41,])# re-fit without observ.41
summary(fit.dc2)
Call:
lm(formula = timeW ~ distance + climb, data = Races[-41, ])
Residuals:
Min 1Q Median 3Q Max
-28.4353 -6.1442 -0.1459 7.0130 29.8179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.931 3.281 -2.723 0.00834 **
distance 4.172 0.240 17.383 < 2e-16 ***
climb 43.852 3.715 11.806 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.23 on 64 degrees of freedom
Multiple R-squared: 0.952, Adjusted R-squared: 0.9505
F-statistic: 634.3 on 2 and 64 DF, p-value: < 2.2e-16
Exercises

