1a
library(alr4)
## Loading required package: car
## Loading required package: carData
## Loading required package: effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
attach(BGSall)
library(ggplot2)
ggplot(BGSall, aes(x=HT2, y=HT18, color=factor(Sex)))+geom_point(shape=1)+ geom_smooth(method=lm, se=FALSE)
A linear relationship appears appropriate for both groups, and the slopes look similar between males and females.
1b
BGS.lm = lm(HT18 ~ HT2*Sex, data = BGSall)
summary(BGS.lm)
##
## Call:
## lm(formula = HT18 ~ HT2 * Sex, data = BGSall)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2921 -3.0481 -0.0345 3.1263 12.9258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.95704 16.48559 4.850 3.42e-06 ***
## HT2 1.12052 0.18642 6.011 1.69e-08 ***
## Sex -18.98222 22.80001 -0.833 0.407
## HT2:Sex 0.08941 0.25940 0.345 0.731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.99 on 132 degrees of freedom
## Multiple R-squared: 0.6887, Adjusted R-squared: 0.6816
## F-statistic: 97.34 on 3 and 132 DF, p-value: < 2.2e-16
Since the p-value for the interaction HT2:Sex is 0.731, the evidence is strong that the slope differs between the two groups. Hence, the parallel regression model should suffice.
1c
BGSp.lm = lm(HT18 ~ HT2+Sex, data = BGSall)
confint(BGSp.lm)
## 2.5 % 97.5 %
## (Intercept) 53.2604815 98.492691
## HT2 0.9111399 1.422249
## Sex -12.8416680 -9.417779
We are 95% confident that, for any height at age 2 the difference between males and females are between -12.841 and -9.4178 centimeters shorter than males.
2a
library(alr4)
# Forward selection
mantel.0 <- lm(Y ~ 1, data = mantel)
mantel.fw <- step(mantel.0, scope = c(upper = ~ X1 + X2 + X3), direction = 'forward')
## Start: AIC=9.59
## Y ~ 1
##
## Df Sum of Sq RSS AIC
## + X3 1 20.6879 2.1121 -0.3087
## + X1 1 8.6112 14.1888 9.2151
## + X2 1 8.5064 14.2936 9.2519
## <none> 22.8000 9.5866
##
## Step: AIC=-0.31
## Y ~ X3
##
## Df Sum of Sq RSS AIC
## <none> 2.1121 -0.30875
## + X2 1 0.066328 2.0458 1.53172
## + X1 1 0.064522 2.0476 1.53613
# Backward selection
mantel1 <- lm(Y~ X1 + X2 + X3, data = mantel) # full model
mantel1.bw <- step(mantel.fw, scope = c(lower = ~ 1), direct = 'backward')
## Start: AIC=-0.31
## Y ~ X3
##
## Df Sum of Sq RSS AIC
## <none> 2.1121 -0.3087
## - X3 1 20.688 22.8000 9.5866
The final model after forward selection is Y~X3. For backward selection, the final model is also Y~X3.
2b
library(leaps)
library(alr4)
attach(mantel)
mod.reg <- regsubsets(cbind(mantel$X1,mantel$X2,mantel$X3), mantel$Y, data=mantel)
summary.reg <- summary(mod.reg)
names(summary.reg)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
summary.reg$which
## (Intercept) a b c
## 1 TRUE FALSE FALSE TRUE
## 2 TRUE TRUE TRUE FALSE
## 3 TRUE TRUE TRUE TRUE
summary.reg$rsq
## [1] 0.9073635 1.0000000 1.0000000
summary.reg$adjr2
## [1] 0.8764847 1.0000000 1.0000000
summary.reg$cp
## [1] 1.838726e+27 6.802184e+00 4.000000e+00
summary.reg$bic
## [1] -8.676486 -312.170229 -319.351963
From using the best subsets regression, we found the most important predictors to be X1, X2, and X3 using R^2, adjusted R^2, mallow’s cp, and BIC. When we evaluate the regression using R^2 and adjusted R^2, we find the predictors with the largest increase in R^2 to be X2. When using BIC as the criterion we get X3. From mallows’ Cp statistic, we get X1, X2, and X3.
3a
attach(lathe1)
lathe1.lm = lm(Life~poly(Feed, 2) + poly(Speed, 2) + Feed:Speed, data = lathe1)
boxCox(lathe1.lm)
3b
lathe1F.lm = lm(log(Life)~Speed+Feed+I(Speed^2)+I(Feed^2)+Speed*Feed, data = lathe1)
lathe1R.lm = lm(log(Life)~Feed+I(Feed^2), data = lathe1)
anova(lathe1R.lm,lathe1F.lm)
## Analysis of Variance Table
##
## Model 1: log(Life) ~ Feed + I(Feed^2)
## Model 2: log(Life) ~ Speed + Feed + I(Speed^2) + I(Feed^2) + Speed * Feed
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17 32.300
## 2 14 1.237 3 31.063 117.22 3.726e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: H0: β1=β2=β11=β22=β12=0 H1: at least 1 is not equal to 0 The p-value for the global F-test is 3.726e-1 which is less than α=0.05, so we fail to reject the null hypothesis that the predictors are equal to 0. So, there is not significant evidence that the predictors are useful.
3c We are testing if the predictors Speed, Speed^2, and Feed^2 are useful. From the above test, we found that none of the predictors were useful.
3d
lathe1F.lm = lm(log(Life)~Speed+I(Speed^2)+I(Feed^2), data = lathe1)
lathe1R.lm = lm(log(Life)~Feed+I(Feed^2), data = lathe1)
anova(lathe1R.lm,lathe1F.lm)
## Analysis of Variance Table
##
## Model 1: log(Life) ~ Feed + I(Feed^2)
## Model 2: log(Life) ~ Speed + I(Speed^2) + I(Feed^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17 32.300
## 2 16 8.772 1 23.528 42.915 6.675e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value for the F-test is 6.675e-06 which is significantly less than α=0.05, so we fail to reject the null hypothesis that the predictors Speed, Speed^2, and Feed^2 are equal to 0. So, there is not significant evidence that the predictors are useful.
3e
lathe2.lm = lm(log(Life)~Speed+Feed+I(Speed^2)+I(Feed^2)+Speed*Feed, data = lathe1)
influenceIndexPlot(lathe2.lm)
The two most influential cases are 9 and 10, due to their large Cook’s distance. They also have the largest leverages. However, there are two other points with the same leverage values, so the influence is likely more due to the standardized residuals being large.
lathe.rmv = lathe1[-c(9,10),]
lathe3.lm = lm(log(Life)~Speed+Feed+I(Speed^2)+I(Feed^2)+Speed*Feed, data = lathe.rmv)
summary(lathe3.lm)
##
## Call:
## lm(formula = log(Life) ~ Speed + Feed + I(Speed^2) + I(Feed^2) +
## Speed * Feed, data = lathe.rmv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39963 -0.14660 0.00387 0.14917 0.32783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.18809 0.08241 14.417 6.11e-09 ***
## Speed -1.43300 0.08241 -17.388 7.10e-10 ***
## Feed -0.79023 0.06729 -11.743 6.15e-08 ***
## I(Speed^2) 0.28022 0.12363 2.267 0.042700 *
## I(Feed^2) 0.42244 0.09217 4.583 0.000629 ***
## Speed:Feed -0.07286 0.08241 -0.884 0.394025
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2331 on 12 degrees of freedom
## Multiple R-squared: 0.9759, Adjusted R-squared: 0.9658
## F-statistic: 97.07 on 5 and 12 DF, p-value: 2.804e-09
The coefficient most affected is the main effect for Speed, while the others stay mostly the same. Also, the standard errors are uniformly smaller using the reduced data set.
4abc
library(leaps)
library(faraway)
##
## Attaching package: 'faraway'
## The following objects are masked from 'package:alr4':
##
## cathedral, pipeline, twins
## The following objects are masked from 'package:car':
##
## logit, vif
attach(divusa)
div.reg <- regsubsets(cbind(year,unemployed,femlab,marriage,birth,military), divorce, data=divusa)
summary.reg <- summary(div.reg)
names(summary.reg)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
summary.reg$which
## (Intercept) year unemployed femlab marriage birth military
## 1 TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE TRUE FALSE TRUE FALSE
## 3 TRUE FALSE FALSE TRUE TRUE TRUE FALSE
## 4 TRUE TRUE FALSE TRUE TRUE TRUE FALSE
## 5 TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
summary.reg$adjr2
## [1] 0.8265403 0.8720158 0.9105579 0.9208807 0.9289506 0.9287914
summary.reg$cp
## [1] 109.695444 62.001274 22.692257 12.998703 5.841314 7.000000
summary.reg$bic
## [1] -127.2216 -147.3224 -171.6165 -177.7776 -182.7945 -179.3707
Using the criterion R^2, mallow’s Cp, and BIC in our best subsets regressuion, we found that the most important predictors are year, femlab, marriage, birth, military. The best model will be divorce~year+unemployed+femlab+marriage+birth+military.