##Chapter 3, problem number 7.
In the punting data,we find the average distance punted and hang times of 10 punts of an American football as related to various measures of leg strength for 13 volunteers.
###a.Fit a regression model with Distance as the response and the right and left leg strengths and flexibilities as predictors. Which predictors are significant at the 5% level?
data(punting, package="faraway")
lmod7a <- lm(Distance ~ RStr + LStr + RFlex + LFlex ,punting)
lmod7a
Call:
lm(formula = Distance ~ RStr + LStr + RFlex + LFlex, data = punting)
Coefficients:
(Intercept) RStr LStr RFlex LFlex
-79.6236 0.5116 -0.1862 2.3745 -0.5277
summary(lmod7a)
Call:
lm(formula = Distance ~ RStr + LStr + RFlex + LFlex, data = punting)
Residuals:
Min 1Q Median 3Q Max
-23.941 -8.958 -4.441 13.523 17.016
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -79.6236 65.5935 -1.214 0.259
RStr 0.5116 0.4856 1.054 0.323
LStr -0.1862 0.5130 -0.363 0.726
RFlex 2.3745 1.4374 1.652 0.137
LFlex -0.5277 0.8255 -0.639 0.541
Residual standard error: 16.33 on 8 degrees of freedom
Multiple R-squared: 0.7365, Adjusted R-squared: 0.6047
F-statistic: 5.59 on 4 and 8 DF, p-value: 0.01902
From the summary,we can know that no variables’ p-value<0.05 which means that none predictors are statistically significant at the 5% level.
###b.Use an F-test to determine whether collectively these four predictors have a relationship to the response.
nullmod7 <- lm(Distance ~ 1,punting)
anova(nullmod7,lmod7a)
Analysis of Variance Table
Model 1: Distance ~ 1
Model 2: Distance ~ RStr + LStr + RFlex + LFlex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 12 8093.3
2 8 2132.6 4 5960.7 5.5899 0.01902 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can see that the p-value of the F test is less than 0.05, so we can reject the null hypothesis. This means that these four predictors have a linear relationship to the response.
###c.Relative to the model in (a), test whether the right and left leg strengths have the same effect.
lmod7c<- lm(Distance ~ I(RStr + LStr) + RFlex + LFlex ,punting)
summary(lmod7c)
Call:
lm(formula = Distance ~ I(RStr + LStr) + RFlex + LFlex, data = punting)
Residuals:
Min 1Q Median 3Q Max
-21.698 -9.494 -5.155 9.081 20.611
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -71.2694 63.1447 -1.129 0.288
I(RStr + LStr) 0.1741 0.1940 0.898 0.393
RFlex 2.3137 1.4013 1.651 0.133
LFlex -0.5772 0.8035 -0.718 0.491
Residual standard error: 15.94 on 9 degrees of freedom
Multiple R-squared: 0.7174, Adjusted R-squared: 0.6232
F-statistic: 7.615 on 3 and 9 DF, p-value: 0.00769
anova(lmod7c,lmod7a)
Analysis of Variance Table
Model 1: Distance ~ I(RStr + LStr) + RFlex + LFlex
Model 2: Distance ~ RStr + LStr + RFlex + LFlex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 9 2287.4
2 8 2132.6 1 154.72 0.5804 0.468
Because the p-value of the F test is larger than 0.05, we cannot reject the null hypothesis which means that the right and left leg strengths don’t have the same effect.
###d.Construct a 95% confidence region for (βRStr,βLStr). Explain how the test in (c) relates to this region.
confint(lmod7a, c("RStr", "LStr"))
2.5 % 97.5 %
RStr -0.6080871 1.6313618
LStr -1.3690973 0.9966981
From test C,we know that βRStr doesn’t equal to βLStr and their 95% confidence regions are significantly different, which is verified by it.
###e.Fit a model to test the hypothesis that it is total leg strength defined by adding the right and left leg strengths that is sufficient to predict the response in com- parison to using individual left and right leg strengths.
lmod7e <- lm(Distance ~ RStr + LStr, punting)
summary(lmod7e)
Call:
lm(formula = Distance ~ RStr + LStr, data = punting)
Residuals:
Min 1Q Median 3Q Max
-29.280 -9.583 3.147 10.266 26.450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.8490 33.0334 0.389 0.705
RStr 0.7208 0.4913 1.467 0.173
LStr 0.2011 0.4883 0.412 0.689
Residual standard error: 17.24 on 10 degrees of freedom
Multiple R-squared: 0.6327, Adjusted R-squared: 0.5592
F-statistic: 8.611 on 2 and 10 DF, p-value: 0.00669
lmod7total<- lm(Distance ~I(RStr + LStr),punting)
summary(lmod7total)
Call:
lm(formula = Distance ~ I(RStr + LStr), data = punting)
Residuals:
Min 1Q Median 3Q Max
-27.632 -11.531 2.171 8.443 30.672
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.0936 31.8838 0.442 0.66703
I(RStr + LStr) 0.4601 0.1082 4.252 0.00136 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.68 on 11 degrees of freedom
Multiple R-squared: 0.6217, Adjusted R-squared: 0.5874
F-statistic: 18.08 on 1 and 11 DF, p-value: 0.001361
anova(lmod7e,lmod7total)
Analysis of Variance Table
Model 1: Distance ~ RStr + LStr
Model 2: Distance ~ I(RStr + LStr)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 10 2973.1
2 11 3061.3 -1 -88.281 0.2969 0.5978
Because the p-value of the F test is larger than 0.05, we cannot reject the null hypothesis which means that total leg strength is not sufficient to predict the response in comparison to using individual left and right leg strengths.
###h.Fit a model with Hang as the response and the same four predictors. Can we make a test to compare this model to that used in (a)? Explain.
lmod7h<- lm(Hang ~ RStr + LStr + RFlex + LFlex ,punting)
lmod7h
Call:
lm(formula = Hang ~ RStr + LStr + RFlex + LFlex, data = punting)
Coefficients:
(Intercept) RStr LStr RFlex LFlex
-0.225239 0.005153 0.007697 0.019404 0.004614
summary(lmod7h)
Call:
lm(formula = Hang ~ RStr + LStr + RFlex + LFlex, data = punting)
Residuals:
Min 1Q Median 3Q Max
-0.36297 -0.13528 -0.07849 0.09938 0.35893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.225239 1.032784 -0.218 0.833
RStr 0.005153 0.007645 0.674 0.519
LStr 0.007697 0.008077 0.953 0.369
RFlex 0.019404 0.022631 0.857 0.416
LFlex 0.004614 0.012998 0.355 0.732
Residual standard error: 0.2571 on 8 degrees of freedom
Multiple R-squared: 0.8156, Adjusted R-squared: 0.7235
F-statistic: 8.848 on 4 and 8 DF, p-value: 0.004925
No, we can’t make a F-test to compare model(a) and model(b) casue they are not nest-models. While we can compare by their values of R-squared.For model(h), the R-squared is 0.8156 which is higher than that in model(a) which means that four predictors have stronger linear relationship with response “Hang” rather than “Distance”.
##Chapter 4, problem number 1.For the prostate data, fit a model with lpsa as the response and the other variables as predictors. ###a.Suppose a new patient with the following values arrives:Predict the lpsa for this patient along with an appropriate 95% CI.
data(prostate, package="faraway")
lmod4a<- lm(lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45 ,prostate)
summary(lmod4a)
Call:
lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp +
gleason + pgg45, data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.7331 -0.3713 -0.0170 0.4141 1.6381
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.669337 1.296387 0.516 0.60693
lcavol 0.587022 0.087920 6.677 2.11e-09 ***
lweight 0.454467 0.170012 2.673 0.00896 **
age -0.019637 0.011173 -1.758 0.08229 .
lbph 0.107054 0.058449 1.832 0.07040 .
svi 0.766157 0.244309 3.136 0.00233 **
lcp -0.105474 0.091013 -1.159 0.24964
gleason 0.045142 0.157465 0.287 0.77503
pgg45 0.004525 0.004421 1.024 0.30886
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7084 on 88 degrees of freedom
Multiple R-squared: 0.6548, Adjusted R-squared: 0.6234
F-statistic: 20.86 on 8 and 88 DF, p-value: < 2.2e-16
newp <- data.frame(lcavol= 1.44692, lweight= 3.62301, age= 65.00000, lbph= 0.30010, svi= 0.00000, lcp= -0.79851,gleason= 7.00000,pgg45= 15.00000)
predict (lmod4a , newp,interval="confidence")
fit lwr upr
1 2.389053 2.172437 2.605669
###b.Repeat the last question for a patient with the same values except that he is age 20. Explain why the CI is wider.
newb <- data.frame(lcavol= 1.44692, gleason= 7.00000, lweight= 3.62301, pgg45= 15.00000, age= 20, lbph= 0.30010, svi= 0.00000, lcp= -0.79851)
predict(lmod4a, newb,interval="confidence")
fit lwr upr
1 3.272726 2.260444 4.285007
Since the age 20 lies far outside of the original data, so that CI is wider.
###c.For the model of the previous question, remove all the predictors that are not significant at the 5% level. Now recompute the predictions of the previous question. Are the CIs wider or narrower? Which predictions would you prefer? Explain.
lmod42c <- lm(lpsa ~ lcavol + lweight + svi, data = prostate)
summary(lmod42c)
Call:
lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.72964 -0.45764 0.02812 0.46403 1.57013
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.26809 0.54350 -0.493 0.62298
lcavol 0.55164 0.07467 7.388 6.3e-11 ***
lweight 0.50854 0.15017 3.386 0.00104 **
svi 0.66616 0.20978 3.176 0.00203 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7168 on 93 degrees of freedom
Multiple R-squared: 0.6264, Adjusted R-squared: 0.6144
F-statistic: 51.99 on 3 and 93 DF, p-value: < 2.2e-16
newb <- data.frame(lcavol= 1.44692, gleason= 7.00000, lweight= 3.62301, pgg45= 15.00000, age= 20, lbph= 0.30010, svi= 0.00000, lcp= -0.79851)
predict(lmod42c, newb, interval="confidence")
fit lwr upr
1 2.372534 2.197274 2.547794
anova(lmod4a,lmod42c)
Analysis of Variance Table
Model 1: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason +
pgg45
Model 2: lpsa ~ lcavol + lweight + svi
Res.Df RSS Df Sum of Sq F Pr(>F)
1 88 44.163
2 93 47.785 -5 -3.6218 1.4434 0.2167
We can see that the CIs are narrower. Since the p-value of the T test is larger than 0.05, we cannot reject the null hypothesis. I would prefer the original model (with lpsa as the response and the other variables as predictors).
##Chapter 4, problem number 2. Using the teengamb data, fit a model with gamble as the response and the other variables as predictors.
###a.Predict the amount that a male with average (given these data) status, income and verbal score would gamble along with an appropriate 95% CI.
data(teengamb, package="faraway")
lmod42a <- lm(gamble ~ sex + status + income + verbal , teengamb)
summary(lmod42a)
Call:
lm(formula = gamble ~ sex + status + income + verbal, data = teengamb)
Residuals:
Min 1Q Median 3Q Max
-51.082 -11.320 -1.451 9.452 94.252
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.55565 17.19680 1.312 0.1968
sex -22.11833 8.21111 -2.694 0.0101 *
status 0.05223 0.28111 0.186 0.8535
income 4.96198 1.02539 4.839 1.79e-05 ***
verbal -2.95949 2.17215 -1.362 0.1803
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.69 on 42 degrees of freedom
Multiple R-squared: 0.5267, Adjusted R-squared: 0.4816
F-statistic: 11.69 on 4 and 42 DF, p-value: 1.815e-06
x <- model.matrix(lmod42a)
mean <- apply(x, 2, mean)
mean["sex"] <- 0
predict(lmod42a, data.frame(t(mean)), interval="confidence")
fit lwr upr
1 28.24252 18.78277 37.70227
###b.Repeat the prediction for a male with maximal values (for this data) of status, income and verbal score. Which CI is wider and why is this result expected?
max <- apply(x, 2, max)
max["sex"] <- 0
predict(lmod42a, data.frame(t(max)), interval="confidence")
fit lwr upr
1 71.30794 42.23237 100.3835
The CI of b is wider. The maximal value lies far outside from the original data.