Directions

Students are encouraged to work together on homework. However, sharing, copying or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible.

Final submissions must be uploaded to our Compass 2g site on the Homework page. No email, hardcopy, or late submissions will be accepted.

Your assignment must be submitted through the submission link on Compass 2g. You are required to attach one .zip file, named hw09_yourNetID.zip, which contains:
- Your RMarkdown file which should be saved as hw09_yourNetID.Rmd. For example hw09_dunger.Rmd.
- The result of knitting your RMarkdown file as hw09_yourNetID.html. For example hw09_dunger.html.
Your resulting .html file will be considered a “report” which is the material that will determine the majority of your grade. Be sure to visibly include all R code and output that is relevant to answering the exercises. (You do not need to include irrelevant code you tried that resulted in error or did not answer the question correctly.)
You are granted an unlimited number of submissions, but only the last submission before the deadline will be viewed and graded.
If you use this .Rmd file as a template, be sure to remove the directions section. Consider removing eval = FALSE from any code chunks provided in the template, if you would like to run that code as part of your assignment.
Your .Rmd file should be written such that, if it is placed in a folder with any data your are asked to import, it will knit properly without modification.
Unless otherwise stated, you may use R for each of the exercises.
Be sure to read each exercise carefully!
Include your Name and NetID in the final document, not only in your filenames.

Assignment

Exercise 1 (`longley` Macroeconomic data)

The data set longley from the faraway package contains macroeconomic data for predicting employment.

library(faraway)

View(longley)
?longley

(a) Find the correlation between each of the variables in the dataset.

round(cor(longley),2)

##              GNP.deflator  GNP Unemployed Armed.Forces Population Year Employed
## GNP.deflator         1.00 0.99       0.62         0.46       0.98 0.99     0.97
## GNP                  0.99 1.00       0.60         0.45       0.99 1.00     0.98
## Unemployed           0.62 0.60       1.00        -0.18       0.69 0.67     0.50
## Armed.Forces         0.46 0.45      -0.18         1.00       0.36 0.42     0.46
## Population           0.98 0.99       0.69         0.36       1.00 0.99     0.96
## Year                 0.99 1.00       0.67         0.42       0.99 1.00     0.97
## Employed             0.97 0.98       0.50         0.46       0.96 0.97     1.00

(b) Fit a model with Employed as the response and the remaining variables as predictors. Calculate the variance inflation factor for each of the predictors. What is the largest VIF? Do any of the VIFs suggest multicollinearity?

longley_model = lm(Employed ~ ., data = longley )
vif(longley_model)

## GNP.deflator          GNP   Unemployed Armed.Forces   Population         Year 
##      135.532     1788.513       33.619        3.589      399.151      758.981

The largest VIF is GNP with the value of 1788.51348. Since it is common to say that any VIF that is greater than 5 is cause for concern, we can see that there is a huge multicollinearity issue as many of the predictors have a VIF greater than 5.

(c) What proportion of observed variation in Population is explained by a linear relationship with the other predictors?

fit1_model = lm(Employed ~ . - Population, data = longley)
fit2_model = lm(Population ~ . - Employed, data = longley)
summary(fit2_model)$r.squared

## [1] 0.9975

(d) Calculate the partial correlation coefficient for Population and Employed with the effects of the other predictors removed.

cor(resid(fit2_model),resid(fit1_model))

## [1] -0.07514

(e) Fit a new model with Employed as the response and the predictors from the model in (b) which were significant. (Use $\alpha = 0.05$.) Calculate the variance inflation factor for each of the predictors. What is the largest VIF? Do any of the VIFs suggest multicollinearity?

fitnew_model = lm(Employed ~ Year + Armed.Forces + Unemployed, data = longley)
summary(fitnew_model)

## 
## Call:
## lm(formula = Employed ~ Year + Armed.Forces + Unemployed, data = longley)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5729 -0.1199  0.0409  0.1398  0.7530 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.80e+03   6.86e+01  -26.18  5.9e-12 ***
## Year          9.56e-01   3.55e-02   26.92  4.2e-12 ***
## Armed.Forces -7.72e-03   1.84e-03   -4.20   0.0012 ** 
## Unemployed   -1.47e-02   1.67e-03   -8.79  1.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.332 on 12 degrees of freedom
## Multiple R-squared:  0.993,  Adjusted R-squared:  0.991 
## F-statistic:  555 on 3 and 12 DF,  p-value: 3.92e-13

vif(fitnew_model)

##         Year Armed.Forces   Unemployed 
##        3.891        2.223        3.318

The largest VIF is 3.890861.Based on the data, there is no VIF that suggest multicollinearity since all of them are less than 5.

(f) Use an $F$-test to compare the models in parts (b) and (e). Report the following:

The null hypothesis.
The test statistic.
The distribution of the test statistic under the null hypothesis.
The p-value.
A decision.
Which model you prefer. (b) or (e)

anova(fitnew_model,longley_model)

## Analysis of Variance Table
## 
## Model 1: Employed ~ Year + Armed.Forces + Unemployed
## Model 2: Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population + 
##     Year
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)
## 1     12 1.323                         
## 2      9 0.836  3     0.487 1.75   0.23

$H_0:GNP.Deflator=GNP =population = 0 $.

Test statistic : F = 1.7465

F distribution : 12 degrees of freedom in the numerator and 9 degrees of freedom in the denominator.

p-value : 0.227

Decision : Fail to Reject $H_0$ at $\alpha = 0.01$

I would prefer model (e).

(g) Check the assumptions of the model chosen in part (f). Do any assumptions appear to be violated?

plot_fitted_resid(fitnew_model)

The assumptions appear to be violated.

Exercise 2 (`odor` Chemical Data)

Use the odor data from the faraway package for this question.

(a) Fit a complete second order model with odor as the response and the three other variables as predictors. That is use each first order term, their two-way interactions, and the quadratic term for each of the predictors. Perform the significance of the regression test. Use a level of $\alpha = 0.10$. Report the following:

The test statistic.
The distribution of the test statistic under the null hypothesis.
The p-value.
A decision.

library(faraway)
data(odor)
odor_model = lm(odor ~.^2 + I(temp^2) + I(gas^2) + I(pack^2),data = odor)
summary(odor_model)

## 
## Call:
## lm(formula = odor ~ .^2 + I(temp^2) + I(gas^2) + I(pack^2), data = odor)
## 
## Residuals:
##       1       2       3       4       5       6       7       8       9      10 
## -20.625  -6.875   6.875  20.625  15.500   1.750  -1.750 -15.500   5.125 -22.375 
##      11      12      13      14      15 
##  22.375  -5.125  -0.333  -4.333   4.667 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -30.67      12.98   -2.36   0.0645 . 
## temp          -12.13       7.95   -1.53   0.1876   
## gas           -17.00       7.95   -2.14   0.0854 . 
## pack          -21.37       7.95   -2.69   0.0433 * 
## I(temp^2)      32.08      11.70    2.74   0.0407 * 
## I(gas^2)       47.83      11.70    4.09   0.0095 **
## I(pack^2)       6.08      11.70    0.52   0.6252   
## temp:gas        8.25      11.24    0.73   0.4959   
## temp:pack       1.50      11.24    0.13   0.8990   
## gas:pack       -1.75      11.24   -0.16   0.8824   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.5 on 5 degrees of freedom
## Multiple R-squared:  0.882,  Adjusted R-squared:  0.67 
## F-statistic: 4.15 on 9 and 5 DF,  p-value: 0.0657

Test statistic : F = 4.152

F distribution : 9 degrees of freedom in the numerator and 5 degrees of freedom in the denominator.

p-value : 0.06569

Decision : Fail to Reject $H_0$ at $\alpha = 0.01$

(b) Fit a model with the same response, but now excluding any interaction terms. So, include all linear and quadratic terms. Compare this model to the model in (a) using an appropriate test. Use a level of $\alpha = 0.10$. Report the following:

The test statistic.
The distribution of the test statistic under the null hypothesis.
The p-value.
A decision.

odor_model_2 = lm(odor~. + I(temp^2) + I(gas^2) + I(pack^2), data = odor)
summary(odor_model_2)

## 
## Call:
## lm(formula = odor ~ . + I(temp^2) + I(gas^2) + I(pack^2), data = odor)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20.62  -9.62  -1.38   4.02  28.88 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -30.67      10.84   -2.83   0.0222 * 
## temp          -12.13       6.64   -1.83   0.1052   
## gas           -17.00       6.64   -2.56   0.0336 * 
## pack          -21.37       6.64   -3.22   0.0122 * 
## I(temp^2)      32.08       9.77    3.28   0.0111 * 
## I(gas^2)       47.83       9.77    4.90   0.0012 **
## I(pack^2)       6.08       9.77    0.62   0.5509   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.8 on 8 degrees of freedom
## Multiple R-squared:  0.868,  Adjusted R-squared:  0.769 
## F-statistic: 8.79 on 6 and 8 DF,  p-value: 0.00362

anova(odor_model_2,odor_model)

## Analysis of Variance Table
## 
## Model 1: odor ~ temp + gas + pack + I(temp^2) + I(gas^2) + I(pack^2)
## Model 2: odor ~ (temp + gas + pack)^2 + I(temp^2) + I(gas^2) + I(pack^2)
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1      8 2820                         
## 2      5 2526  3       294 0.19    0.9

Test statistic : F = 0.1936

F distribution : 8 degrees of freedom in the numerator and 5 degrees of freedom in the denominator.

p-value : 0.8965

Decision : Fail to Reject $H_0$ at $\alpha = 0.01$

(c) Report the proportion of the observed variation of odor explained by the two previous models.

summary(odor_model)$r.squared

## [1] 0.882

summary(odor_model_2)$r.squared

## [1] 0.8683

(d) Use adjusted $R^2$ to pick from the two models. Report both values. Does this decision match the decision made in part (b)?

summary(odor_model)$adj.r.squared

## [1] 0.6696

summary(odor_model_2)$adj.r.squared

## [1] 0.7695

The decision does matches the decision made in part (b).

Exercise 3 (`teengamb` Gambling Data)

The teengamb dataset from the faraway package contains data related to teenage gambling in Britain.

(a) Fit an additive model with gamble as the response and the other variables as predictors. Use backward AIC variable selection to determine a good model. When writing your final report, you may wish to use trace = 0 inside of step() to minimize unneeded output. (This advice is also useful for future questions which use step().)

library(faraway)
data(teengamb)
teenmodel = lm(gamble ~., data = teengamb)
teenmodel_mod_back_aic = step(teenmodel,direction = "backward",trace = 0)
teenmodel_mod_back_aic

## 
## Call:
## lm(formula = gamble ~ sex + income + verbal, data = teengamb)
## 
## Coefficients:
## (Intercept)          sex       income       verbal  
##       24.14       -22.96         4.90        -2.75

(b) Use backward BIC variable selection to determine a good model.

n = length(resid(teenmodel))
teenmodel_mod_back_bic = step(teenmodel, direction = "backward", k = log(n))

## Start:  AIC=307.4
## gamble ~ sex + status + income + verbal
## 
##          Df Sum of Sq   RSS AIC
## - status  1        18 21642 304
## - verbal  1       956 22580 306
## <none>                21624 307
## - sex     1      3736 25360 311
## - income  1     12056 33680 324
## 
## Step:  AIC=303.6
## gamble ~ sex + income + verbal
## 
##          Df Sum of Sq   RSS AIC
## - verbal  1      1140 22781 302
## <none>                21642 304
## - sex     1      5788 27429 311
## - income  1     13236 34878 322
## 
## Step:  AIC=302.2
## gamble ~ sex + income
## 
##          Df Sum of Sq   RSS AIC
## <none>                22781 302
## - sex     1      5227 28009 308
## - income  1     15310 38091 322

teenmodel_mod_back_bic

## 
## Call:
## lm(formula = gamble ~ sex + income, data = teengamb)
## 
## Coefficients:
## (Intercept)          sex       income  
##        4.04       -21.63         5.17

(c) Use a statistical test to compare these two models. Use a level of $\alpha = 0.10$. Report the following:

The test statistic.
The distribution of the test statistic under the null hypothesis.
The p-value.
A decision.

anova(teenmodel_mod_back_bic,teenmodel_mod_back_aic)

## Analysis of Variance Table
## 
## Model 1: gamble ~ sex + income
## Model 2: gamble ~ sex + income + verbal
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)
## 1     44 22781                         
## 2     43 21642  1      1140 2.26   0.14

Test statistic : F = 2.2646

F distribution : 44 degrees of freedom in the numerator and 43 degrees of freedom in the denominator.

p-value : 0.1397

Decision : Fail to Reject $H_0$ at $\alpha = 0.01$

(d) Fit a model with gamble as the response and the other variables as predictors with all possible interactions, up to and including a four-way interaction. Use backward AIC variable selection to determine a good model. ?teengamb

fit_new = lm(gamble ~ sex*status*income*verbal, data = teengamb)
fit_new_aic = step(fit_new,direction = "backward",trace = 0)
fit_new_aic

## 
## Call:
## lm(formula = gamble ~ sex + status + income + sex:income + status:income, 
##     data = teengamb)
## 
## Coefficients:
##   (Intercept)            sex         status         income     sex:income  
##        -29.87          15.18           0.59          13.89          -9.10  
## status:income  
##         -0.17

(e) Compare the values of Adjusted $R^2$ for the each of the five previous models. Which model is the “best” model out of the five? Justify your answer.

summary(teenmodel)$adj.r.squared

## [1] 0.4816

summary(teenmodel_mod_back_aic)$adj.r.squared

## [1] 0.4933

summary(teenmodel_mod_back_bic)$adj.r.squared

## [1] 0.4787

summary(fit_new)$adj.r.squared

## [1] 0.5607

summary(fit_new_aic)$adj.r.squared

## [1] 0.6329

The best model out of the five is the fifth model with the value of 0.6328661.

STAT 420: Homework 9

Fall 2016, Atika Nurulfatin Omar

Due: Thursday, December 1 by 11:50 PM CT

Directions

Assignment

Exercise 1 (`longley` Macroeconomic data)

Exercise 2 (`odor` Chemical Data)

Exercise 3 (`teengamb` Gambling Data)

STAT 420: Homework 9

Fall 2016, Atika Nurulfatin Omar

Due: Thursday, December 1 by 11:50 PM CT

Directions

Assignment

Exercise 1 (longley Macroeconomic data)

Exercise 2 (odor Chemical Data)

Exercise 3 (teengamb Gambling Data)

Exercise 1 (`longley` Macroeconomic data)

Exercise 2 (`odor` Chemical Data)

Exercise 3 (`teengamb` Gambling Data)