Students are encouraged to work together on homework. However, sharing, copying or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible.
Final submissions must be uploaded to our Compass 2g site on the Homework page. No email, hardcopy, or late submissions will be accepted.
.zip
file, named hw09_yourNetID.zip
, which contains:
hw09_yourNetID.Rmd
. For example hw09_dunger.Rmd
.hw09_yourNetID.html
. For example hw09_dunger.html
..html
file will be considered a “report” which is the material that will determine the majority of your grade. Be sure to visibly include all R
code and output that is relevant to answering the exercises. (You do not need to include irrelevant code you tried that resulted in error or did not answer the question correctly.).Rmd
file as a template, be sure to remove the directions section. Consider removing eval = FALSE
from any code chunks provided in the template, if you would like to run that code as part of your assignment..Rmd
file should be written such that, if it is placed in a folder with any data your are asked to import, it will knit properly without modification.R
for each of the exercises.longley
Macroeconomic data)The data set longley
from the faraway
package contains macroeconomic data for predicting employment.
library(faraway)
View(longley)
?longley
(a) Find the correlation between each of the variables in the dataset.
round(cor(longley),2)
## GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
## GNP.deflator 1.00 0.99 0.62 0.46 0.98 0.99 0.97
## GNP 0.99 1.00 0.60 0.45 0.99 1.00 0.98
## Unemployed 0.62 0.60 1.00 -0.18 0.69 0.67 0.50
## Armed.Forces 0.46 0.45 -0.18 1.00 0.36 0.42 0.46
## Population 0.98 0.99 0.69 0.36 1.00 0.99 0.96
## Year 0.99 1.00 0.67 0.42 0.99 1.00 0.97
## Employed 0.97 0.98 0.50 0.46 0.96 0.97 1.00
(b) Fit a model with Employed
as the response and the remaining variables as predictors. Calculate the variance inflation factor for each of the predictors. What is the largest VIF? Do any of the VIFs suggest multicollinearity?
longley_model = lm(Employed ~ ., data = longley )
vif(longley_model)
## GNP.deflator GNP Unemployed Armed.Forces Population Year
## 135.532 1788.513 33.619 3.589 399.151 758.981
The largest VIF is GNP with the value of 1788.51348. Since it is common to say that any VIF that is greater than 5 is cause for concern, we can see that there is a huge multicollinearity issue as many of the predictors have a VIF greater than 5.
(c) What proportion of observed variation in Population
is explained by a linear relationship with the other predictors?
fit1_model = lm(Employed ~ . - Population, data = longley)
fit2_model = lm(Population ~ . - Employed, data = longley)
summary(fit2_model)$r.squared
## [1] 0.9975
(d) Calculate the partial correlation coefficient for Population
and Employed
with the effects of the other predictors removed.
cor(resid(fit2_model),resid(fit1_model))
## [1] -0.07514
(e) Fit a new model with Employed
as the response and the predictors from the model in (b) which were significant. (Use \(\alpha = 0.05\).) Calculate the variance inflation factor for each of the predictors. What is the largest VIF? Do any of the VIFs suggest multicollinearity?
fitnew_model = lm(Employed ~ Year + Armed.Forces + Unemployed, data = longley)
summary(fitnew_model)
##
## Call:
## lm(formula = Employed ~ Year + Armed.Forces + Unemployed, data = longley)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5729 -0.1199 0.0409 0.1398 0.7530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.80e+03 6.86e+01 -26.18 5.9e-12 ***
## Year 9.56e-01 3.55e-02 26.92 4.2e-12 ***
## Armed.Forces -7.72e-03 1.84e-03 -4.20 0.0012 **
## Unemployed -1.47e-02 1.67e-03 -8.79 1.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.332 on 12 degrees of freedom
## Multiple R-squared: 0.993, Adjusted R-squared: 0.991
## F-statistic: 555 on 3 and 12 DF, p-value: 3.92e-13
vif(fitnew_model)
## Year Armed.Forces Unemployed
## 3.891 2.223 3.318
The largest VIF is 3.890861.Based on the data, there is no VIF that suggest multicollinearity since all of them are less than 5.
(f) Use an \(F\)-test to compare the models in parts (b) and (e). Report the following:
anova(fitnew_model,longley_model)
## Analysis of Variance Table
##
## Model 1: Employed ~ Year + Armed.Forces + Unemployed
## Model 2: Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population +
## Year
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 12 1.323
## 2 9 0.836 3 0.487 1.75 0.23
$H_0:GNP.Deflator=GNP =population = 0 $.
Test statistic : F = 1.7465
F distribution : 12 degrees of freedom in the numerator and 9 degrees of freedom in the denominator.
p-value : 0.227
Decision : Fail to Reject \(H_0\) at \(\alpha = 0.01\)
I would prefer model (e).
(g) Check the assumptions of the model chosen in part (f). Do any assumptions appear to be violated?
plot_fitted_resid(fitnew_model)
The assumptions appear to be violated.
odor
Chemical Data)Use the odor
data from the faraway
package for this question.
(a) Fit a complete second order model with odor
as the response and the three other variables as predictors. That is use each first order term, their two-way interactions, and the quadratic term for each of the predictors. Perform the significance of the regression test. Use a level of \(\alpha = 0.10\). Report the following:
library(faraway)
data(odor)
odor_model = lm(odor ~.^2 + I(temp^2) + I(gas^2) + I(pack^2),data = odor)
summary(odor_model)
##
## Call:
## lm(formula = odor ~ .^2 + I(temp^2) + I(gas^2) + I(pack^2), data = odor)
##
## Residuals:
## 1 2 3 4 5 6 7 8 9 10
## -20.625 -6.875 6.875 20.625 15.500 1.750 -1.750 -15.500 5.125 -22.375
## 11 12 13 14 15
## 22.375 -5.125 -0.333 -4.333 4.667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30.67 12.98 -2.36 0.0645 .
## temp -12.13 7.95 -1.53 0.1876
## gas -17.00 7.95 -2.14 0.0854 .
## pack -21.37 7.95 -2.69 0.0433 *
## I(temp^2) 32.08 11.70 2.74 0.0407 *
## I(gas^2) 47.83 11.70 4.09 0.0095 **
## I(pack^2) 6.08 11.70 0.52 0.6252
## temp:gas 8.25 11.24 0.73 0.4959
## temp:pack 1.50 11.24 0.13 0.8990
## gas:pack -1.75 11.24 -0.16 0.8824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.5 on 5 degrees of freedom
## Multiple R-squared: 0.882, Adjusted R-squared: 0.67
## F-statistic: 4.15 on 9 and 5 DF, p-value: 0.0657
Test statistic : F = 4.152
F distribution : 9 degrees of freedom in the numerator and 5 degrees of freedom in the denominator.
p-value : 0.06569
Decision : Fail to Reject \(H_0\) at \(\alpha = 0.01\)
(b) Fit a model with the same response, but now excluding any interaction terms. So, include all linear and quadratic terms. Compare this model to the model in (a) using an appropriate test. Use a level of \(\alpha = 0.10\). Report the following:
odor_model_2 = lm(odor~. + I(temp^2) + I(gas^2) + I(pack^2), data = odor)
summary(odor_model_2)
##
## Call:
## lm(formula = odor ~ . + I(temp^2) + I(gas^2) + I(pack^2), data = odor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.62 -9.62 -1.38 4.02 28.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30.67 10.84 -2.83 0.0222 *
## temp -12.13 6.64 -1.83 0.1052
## gas -17.00 6.64 -2.56 0.0336 *
## pack -21.37 6.64 -3.22 0.0122 *
## I(temp^2) 32.08 9.77 3.28 0.0111 *
## I(gas^2) 47.83 9.77 4.90 0.0012 **
## I(pack^2) 6.08 9.77 0.62 0.5509
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.8 on 8 degrees of freedom
## Multiple R-squared: 0.868, Adjusted R-squared: 0.769
## F-statistic: 8.79 on 6 and 8 DF, p-value: 0.00362
anova(odor_model_2,odor_model)
## Analysis of Variance Table
##
## Model 1: odor ~ temp + gas + pack + I(temp^2) + I(gas^2) + I(pack^2)
## Model 2: odor ~ (temp + gas + pack)^2 + I(temp^2) + I(gas^2) + I(pack^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8 2820
## 2 5 2526 3 294 0.19 0.9
Test statistic : F = 0.1936
F distribution : 8 degrees of freedom in the numerator and 5 degrees of freedom in the denominator.
p-value : 0.8965
Decision : Fail to Reject \(H_0\) at \(\alpha = 0.01\)
(c) Report the proportion of the observed variation of odor
explained by the two previous models.
summary(odor_model)$r.squared
## [1] 0.882
summary(odor_model_2)$r.squared
## [1] 0.8683
(d) Use adjusted \(R^2\) to pick from the two models. Report both values. Does this decision match the decision made in part (b)?
summary(odor_model)$adj.r.squared
## [1] 0.6696
summary(odor_model_2)$adj.r.squared
## [1] 0.7695
The decision does matches the decision made in part (b).
teengamb
Gambling Data)The teengamb
dataset from the faraway
package contains data related to teenage gambling in Britain.
(a) Fit an additive model with gamble
as the response and the other variables as predictors. Use backward AIC variable selection to determine a good model. When writing your final report, you may wish to use trace = 0
inside of step()
to minimize unneeded output. (This advice is also useful for future questions which use step()
.)
library(faraway)
data(teengamb)
teenmodel = lm(gamble ~., data = teengamb)
teenmodel_mod_back_aic = step(teenmodel,direction = "backward",trace = 0)
teenmodel_mod_back_aic
##
## Call:
## lm(formula = gamble ~ sex + income + verbal, data = teengamb)
##
## Coefficients:
## (Intercept) sex income verbal
## 24.14 -22.96 4.90 -2.75
(b) Use backward BIC variable selection to determine a good model.
n = length(resid(teenmodel))
teenmodel_mod_back_bic = step(teenmodel, direction = "backward", k = log(n))
## Start: AIC=307.4
## gamble ~ sex + status + income + verbal
##
## Df Sum of Sq RSS AIC
## - status 1 18 21642 304
## - verbal 1 956 22580 306
## <none> 21624 307
## - sex 1 3736 25360 311
## - income 1 12056 33680 324
##
## Step: AIC=303.6
## gamble ~ sex + income + verbal
##
## Df Sum of Sq RSS AIC
## - verbal 1 1140 22781 302
## <none> 21642 304
## - sex 1 5788 27429 311
## - income 1 13236 34878 322
##
## Step: AIC=302.2
## gamble ~ sex + income
##
## Df Sum of Sq RSS AIC
## <none> 22781 302
## - sex 1 5227 28009 308
## - income 1 15310 38091 322
teenmodel_mod_back_bic
##
## Call:
## lm(formula = gamble ~ sex + income, data = teengamb)
##
## Coefficients:
## (Intercept) sex income
## 4.04 -21.63 5.17
(c) Use a statistical test to compare these two models. Use a level of \(\alpha = 0.10\). Report the following:
anova(teenmodel_mod_back_bic,teenmodel_mod_back_aic)
## Analysis of Variance Table
##
## Model 1: gamble ~ sex + income
## Model 2: gamble ~ sex + income + verbal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 44 22781
## 2 43 21642 1 1140 2.26 0.14
Test statistic : F = 2.2646
F distribution : 44 degrees of freedom in the numerator and 43 degrees of freedom in the denominator.
p-value : 0.1397
Decision : Fail to Reject \(H_0\) at \(\alpha = 0.01\)
(d) Fit a model with gamble
as the response and the other variables as predictors with all possible interactions, up to and including a four-way interaction. Use backward AIC variable selection to determine a good model. ?teengamb
fit_new = lm(gamble ~ sex*status*income*verbal, data = teengamb)
fit_new_aic = step(fit_new,direction = "backward",trace = 0)
fit_new_aic
##
## Call:
## lm(formula = gamble ~ sex + status + income + sex:income + status:income,
## data = teengamb)
##
## Coefficients:
## (Intercept) sex status income sex:income
## -29.87 15.18 0.59 13.89 -9.10
## status:income
## -0.17
(e) Compare the values of Adjusted \(R^2\) for the each of the five previous models. Which model is the “best” model out of the five? Justify your answer.
summary(teenmodel)$adj.r.squared
## [1] 0.4816
summary(teenmodel_mod_back_aic)$adj.r.squared
## [1] 0.4933
summary(teenmodel_mod_back_bic)$adj.r.squared
## [1] 0.4787
summary(fit_new)$adj.r.squared
## [1] 0.5607
summary(fit_new_aic)$adj.r.squared
## [1] 0.6329
The best model out of the five is the fifth model with the value of 0.6328661.