Directions

Students are encouraged to work together on homework. However, sharing, copying or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible.

Final submissions must be uploaded to our Compass 2g site on the Homework page. No email, hardcopy, or late submissions will be accepted.

Assignment

Exercise 1 (longley Macroeconomic data)

The data set longley from the faraway package contains macroeconomic data for predicting employment.

library(faraway)
View(longley)
?longley

(a) Find the correlation between each of the variables in the dataset.

round(cor(longley),2)
##              GNP.deflator  GNP Unemployed Armed.Forces Population Year Employed
## GNP.deflator         1.00 0.99       0.62         0.46       0.98 0.99     0.97
## GNP                  0.99 1.00       0.60         0.45       0.99 1.00     0.98
## Unemployed           0.62 0.60       1.00        -0.18       0.69 0.67     0.50
## Armed.Forces         0.46 0.45      -0.18         1.00       0.36 0.42     0.46
## Population           0.98 0.99       0.69         0.36       1.00 0.99     0.96
## Year                 0.99 1.00       0.67         0.42       0.99 1.00     0.97
## Employed             0.97 0.98       0.50         0.46       0.96 0.97     1.00

(b) Fit a model with Employed as the response and the remaining variables as predictors. Calculate the variance inflation factor for each of the predictors. What is the largest VIF? Do any of the VIFs suggest multicollinearity?

longley_model = lm(Employed ~ ., data = longley )
vif(longley_model)
## GNP.deflator          GNP   Unemployed Armed.Forces   Population         Year 
##      135.532     1788.513       33.619        3.589      399.151      758.981

The largest VIF is GNP with the value of 1788.51348. Since it is common to say that any VIF that is greater than 5 is cause for concern, we can see that there is a huge multicollinearity issue as many of the predictors have a VIF greater than 5.

(c) What proportion of observed variation in Population is explained by a linear relationship with the other predictors?

fit1_model = lm(Employed ~ . - Population, data = longley)
fit2_model = lm(Population ~ . - Employed, data = longley)
summary(fit2_model)$r.squared
## [1] 0.9975

(d) Calculate the partial correlation coefficient for Population and Employed with the effects of the other predictors removed.

cor(resid(fit2_model),resid(fit1_model))
## [1] -0.07514

(e) Fit a new model with Employed as the response and the predictors from the model in (b) which were significant. (Use \(\alpha = 0.05\).) Calculate the variance inflation factor for each of the predictors. What is the largest VIF? Do any of the VIFs suggest multicollinearity?

fitnew_model = lm(Employed ~ Year + Armed.Forces + Unemployed, data = longley)
summary(fitnew_model)
## 
## Call:
## lm(formula = Employed ~ Year + Armed.Forces + Unemployed, data = longley)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5729 -0.1199  0.0409  0.1398  0.7530 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.80e+03   6.86e+01  -26.18  5.9e-12 ***
## Year          9.56e-01   3.55e-02   26.92  4.2e-12 ***
## Armed.Forces -7.72e-03   1.84e-03   -4.20   0.0012 ** 
## Unemployed   -1.47e-02   1.67e-03   -8.79  1.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.332 on 12 degrees of freedom
## Multiple R-squared:  0.993,  Adjusted R-squared:  0.991 
## F-statistic:  555 on 3 and 12 DF,  p-value: 3.92e-13
vif(fitnew_model)
##         Year Armed.Forces   Unemployed 
##        3.891        2.223        3.318

The largest VIF is 3.890861.Based on the data, there is no VIF that suggest multicollinearity since all of them are less than 5.

(f) Use an \(F\)-test to compare the models in parts (b) and (e). Report the following:

  • The null hypothesis.
  • The test statistic.
  • The distribution of the test statistic under the null hypothesis.
  • The p-value.
  • A decision.
  • Which model you prefer. (b) or (e)
anova(fitnew_model,longley_model)
## Analysis of Variance Table
## 
## Model 1: Employed ~ Year + Armed.Forces + Unemployed
## Model 2: Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population + 
##     Year
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)
## 1     12 1.323                         
## 2      9 0.836  3     0.487 1.75   0.23

$H_0:GNP.Deflator=GNP =population = 0 $.

Test statistic : F = 1.7465

F distribution : 12 degrees of freedom in the numerator and 9 degrees of freedom in the denominator.

p-value : 0.227

Decision : Fail to Reject \(H_0\) at \(\alpha = 0.01\)

I would prefer model (e).

(g) Check the assumptions of the model chosen in part (f). Do any assumptions appear to be violated?

plot_fitted_resid(fitnew_model)

The assumptions appear to be violated.

Exercise 2 (odor Chemical Data)

Use the odor data from the faraway package for this question.

(a) Fit a complete second order model with odor as the response and the three other variables as predictors. That is use each first order term, their two-way interactions, and the quadratic term for each of the predictors. Perform the significance of the regression test. Use a level of \(\alpha = 0.10\). Report the following:

  • The test statistic.
  • The distribution of the test statistic under the null hypothesis.
  • The p-value.
  • A decision.
library(faraway)
data(odor)
odor_model = lm(odor ~.^2 + I(temp^2) + I(gas^2) + I(pack^2),data = odor)
summary(odor_model)
## 
## Call:
## lm(formula = odor ~ .^2 + I(temp^2) + I(gas^2) + I(pack^2), data = odor)
## 
## Residuals:
##       1       2       3       4       5       6       7       8       9      10 
## -20.625  -6.875   6.875  20.625  15.500   1.750  -1.750 -15.500   5.125 -22.375 
##      11      12      13      14      15 
##  22.375  -5.125  -0.333  -4.333   4.667 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -30.67      12.98   -2.36   0.0645 . 
## temp          -12.13       7.95   -1.53   0.1876   
## gas           -17.00       7.95   -2.14   0.0854 . 
## pack          -21.37       7.95   -2.69   0.0433 * 
## I(temp^2)      32.08      11.70    2.74   0.0407 * 
## I(gas^2)       47.83      11.70    4.09   0.0095 **
## I(pack^2)       6.08      11.70    0.52   0.6252   
## temp:gas        8.25      11.24    0.73   0.4959   
## temp:pack       1.50      11.24    0.13   0.8990   
## gas:pack       -1.75      11.24   -0.16   0.8824   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.5 on 5 degrees of freedom
## Multiple R-squared:  0.882,  Adjusted R-squared:  0.67 
## F-statistic: 4.15 on 9 and 5 DF,  p-value: 0.0657

Test statistic : F = 4.152

F distribution : 9 degrees of freedom in the numerator and 5 degrees of freedom in the denominator.

p-value : 0.06569

Decision : Fail to Reject \(H_0\) at \(\alpha = 0.01\)

(b) Fit a model with the same response, but now excluding any interaction terms. So, include all linear and quadratic terms. Compare this model to the model in (a) using an appropriate test. Use a level of \(\alpha = 0.10\). Report the following:

  • The test statistic.
  • The distribution of the test statistic under the null hypothesis.
  • The p-value.
  • A decision.
odor_model_2 = lm(odor~. + I(temp^2) + I(gas^2) + I(pack^2), data = odor)
summary(odor_model_2)
## 
## Call:
## lm(formula = odor ~ . + I(temp^2) + I(gas^2) + I(pack^2), data = odor)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20.62  -9.62  -1.38   4.02  28.88 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -30.67      10.84   -2.83   0.0222 * 
## temp          -12.13       6.64   -1.83   0.1052   
## gas           -17.00       6.64   -2.56   0.0336 * 
## pack          -21.37       6.64   -3.22   0.0122 * 
## I(temp^2)      32.08       9.77    3.28   0.0111 * 
## I(gas^2)       47.83       9.77    4.90   0.0012 **
## I(pack^2)       6.08       9.77    0.62   0.5509   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.8 on 8 degrees of freedom
## Multiple R-squared:  0.868,  Adjusted R-squared:  0.769 
## F-statistic: 8.79 on 6 and 8 DF,  p-value: 0.00362
anova(odor_model_2,odor_model)
## Analysis of Variance Table
## 
## Model 1: odor ~ temp + gas + pack + I(temp^2) + I(gas^2) + I(pack^2)
## Model 2: odor ~ (temp + gas + pack)^2 + I(temp^2) + I(gas^2) + I(pack^2)
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1      8 2820                         
## 2      5 2526  3       294 0.19    0.9

Test statistic : F = 0.1936

F distribution : 8 degrees of freedom in the numerator and 5 degrees of freedom in the denominator.

p-value : 0.8965

Decision : Fail to Reject \(H_0\) at \(\alpha = 0.01\)

(c) Report the proportion of the observed variation of odor explained by the two previous models.

summary(odor_model)$r.squared
## [1] 0.882
summary(odor_model_2)$r.squared
## [1] 0.8683

(d) Use adjusted \(R^2\) to pick from the two models. Report both values. Does this decision match the decision made in part (b)?

summary(odor_model)$adj.r.squared
## [1] 0.6696
summary(odor_model_2)$adj.r.squared
## [1] 0.7695

The decision does matches the decision made in part (b).

Exercise 3 (teengamb Gambling Data)

The teengamb dataset from the faraway package contains data related to teenage gambling in Britain.

(a) Fit an additive model with gamble as the response and the other variables as predictors. Use backward AIC variable selection to determine a good model. When writing your final report, you may wish to use trace = 0 inside of step() to minimize unneeded output. (This advice is also useful for future questions which use step().)

library(faraway)
data(teengamb)
teenmodel = lm(gamble ~., data = teengamb)
teenmodel_mod_back_aic = step(teenmodel,direction = "backward",trace = 0)
teenmodel_mod_back_aic
## 
## Call:
## lm(formula = gamble ~ sex + income + verbal, data = teengamb)
## 
## Coefficients:
## (Intercept)          sex       income       verbal  
##       24.14       -22.96         4.90        -2.75

(b) Use backward BIC variable selection to determine a good model.

n = length(resid(teenmodel))
teenmodel_mod_back_bic = step(teenmodel, direction = "backward", k = log(n))
## Start:  AIC=307.4
## gamble ~ sex + status + income + verbal
## 
##          Df Sum of Sq   RSS AIC
## - status  1        18 21642 304
## - verbal  1       956 22580 306
## <none>                21624 307
## - sex     1      3736 25360 311
## - income  1     12056 33680 324
## 
## Step:  AIC=303.6
## gamble ~ sex + income + verbal
## 
##          Df Sum of Sq   RSS AIC
## - verbal  1      1140 22781 302
## <none>                21642 304
## - sex     1      5788 27429 311
## - income  1     13236 34878 322
## 
## Step:  AIC=302.2
## gamble ~ sex + income
## 
##          Df Sum of Sq   RSS AIC
## <none>                22781 302
## - sex     1      5227 28009 308
## - income  1     15310 38091 322
teenmodel_mod_back_bic
## 
## Call:
## lm(formula = gamble ~ sex + income, data = teengamb)
## 
## Coefficients:
## (Intercept)          sex       income  
##        4.04       -21.63         5.17

(c) Use a statistical test to compare these two models. Use a level of \(\alpha = 0.10\). Report the following:

  • The test statistic.
  • The distribution of the test statistic under the null hypothesis.
  • The p-value.
  • A decision.
anova(teenmodel_mod_back_bic,teenmodel_mod_back_aic)
## Analysis of Variance Table
## 
## Model 1: gamble ~ sex + income
## Model 2: gamble ~ sex + income + verbal
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)
## 1     44 22781                         
## 2     43 21642  1      1140 2.26   0.14

Test statistic : F = 2.2646

F distribution : 44 degrees of freedom in the numerator and 43 degrees of freedom in the denominator.

p-value : 0.1397

Decision : Fail to Reject \(H_0\) at \(\alpha = 0.01\)

(d) Fit a model with gamble as the response and the other variables as predictors with all possible interactions, up to and including a four-way interaction. Use backward AIC variable selection to determine a good model. ?teengamb

fit_new = lm(gamble ~ sex*status*income*verbal, data = teengamb)
fit_new_aic = step(fit_new,direction = "backward",trace = 0)
fit_new_aic
## 
## Call:
## lm(formula = gamble ~ sex + status + income + sex:income + status:income, 
##     data = teengamb)
## 
## Coefficients:
##   (Intercept)            sex         status         income     sex:income  
##        -29.87          15.18           0.59          13.89          -9.10  
## status:income  
##         -0.17

(e) Compare the values of Adjusted \(R^2\) for the each of the five previous models. Which model is the “best” model out of the five? Justify your answer.

summary(teenmodel)$adj.r.squared
## [1] 0.4816
summary(teenmodel_mod_back_aic)$adj.r.squared
## [1] 0.4933
summary(teenmodel_mod_back_bic)$adj.r.squared
## [1] 0.4787
summary(fit_new)$adj.r.squared
## [1] 0.5607
summary(fit_new_aic)$adj.r.squared
## [1] 0.6329

The best model out of the five is the fifth model with the value of 0.6328661.