Question 1

library(faraway)
## Warning: package 'faraway' was built under R version 3.6.3
data(sat)
sat
##                expend ratio salary takers verbal math total
## Alabama         4.405  17.2 31.144      8    491  538  1029
## Alaska          8.963  17.6 47.951     47    445  489   934
## Arizona         4.778  19.3 32.175     27    448  496   944
## Arkansas        4.459  17.1 28.934      6    482  523  1005
## California      4.992  24.0 41.078     45    417  485   902
## Colorado        5.443  18.4 34.571     29    462  518   980
## Connecticut     8.817  14.4 50.045     81    431  477   908
## Delaware        7.030  16.6 39.076     68    429  468   897
## Florida         5.718  19.1 32.588     48    420  469   889
## Georgia         5.193  16.3 32.291     65    406  448   854
## Hawaii          6.078  17.9 38.518     57    407  482   889
## Idaho           4.210  19.1 29.783     15    468  511   979
## Illinois        6.136  17.3 39.431     13    488  560  1048
## Indiana         5.826  17.5 36.785     58    415  467   882
## Iowa            5.483  15.8 31.511      5    516  583  1099
## Kansas          5.817  15.1 34.652      9    503  557  1060
## Kentucky        5.217  17.0 32.257     11    477  522   999
## Louisiana       4.761  16.8 26.461      9    486  535  1021
## Maine           6.428  13.8 31.972     68    427  469   896
## Maryland        7.245  17.0 40.661     64    430  479   909
## Massachusetts   7.287  14.8 40.795     80    430  477   907
## Michigan        6.994  20.1 41.895     11    484  549  1033
## Minnesota       6.000  17.5 35.948      9    506  579  1085
## Mississippi     4.080  17.5 26.818      4    496  540  1036
## Missouri        5.383  15.5 31.189      9    495  550  1045
## Montana         5.692  16.3 28.785     21    473  536  1009
## Nebraska        5.935  14.5 30.922      9    494  556  1050
## Nevada          5.160  18.7 34.836     30    434  483   917
## New Hampshire   5.859  15.6 34.720     70    444  491   935
## New Jersey      9.774  13.8 46.087     70    420  478   898
## New Mexico      4.586  17.2 28.493     11    485  530  1015
## New York        9.623  15.2 47.612     74    419  473   892
## North Carolina  5.077  16.2 30.793     60    411  454   865
## North Dakota    4.775  15.3 26.327      5    515  592  1107
## Ohio            6.162  16.6 36.802     23    460  515   975
## Oklahoma        4.845  15.5 28.172      9    491  536  1027
## Oregon          6.436  19.9 38.555     51    448  499   947
## Pennsylvania    7.109  17.1 44.510     70    419  461   880
## Rhode Island    7.469  14.7 40.729     70    425  463   888
## South Carolina  4.797  16.4 30.279     58    401  443   844
## South Dakota    4.775  14.4 25.994      5    505  563  1068
## Tennessee       4.388  18.6 32.477     12    497  543  1040
## Texas           5.222  15.7 31.223     47    419  474   893
## Utah            3.656  24.3 29.082      4    513  563  1076
## Vermont         6.750  13.8 35.406     68    429  472   901
## Virginia        5.327  14.6 33.987     65    428  468   896
## Washington      5.906  20.2 36.151     48    443  494   937
## West Virginia   6.107  14.8 31.944     17    448  484   932
## Wisconsin       6.930  15.9 37.746      9    501  572  1073
## Wyoming         6.160  14.9 31.285     10    476  525  1001
fit = lm(total ~ expend + ratio + salary + takers, data = sat)

Part a

plot(fitted(fit), resid(fit), col = "blue")
abline(h = 0, col = "orange", lwd = 2)

library(lmtest)
## Warning: package 'lmtest' was built under R version 3.6.3
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(fit)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 2.1587, df = 4, p-value = 0.7066

We do not reject a constant variance assumption for our model given the result of the BP-Test and a pattern of constant variance in our graph.

Part b

hist(resid(fit))

qqnorm(resid(fit))
qqline(resid(fit))

shapiro.test(resid(fit))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit)
## W = 0.97691, p-value = 0.4304

We do not reject a normality assumption of the model given the p value of the Shapiro test and the evidence of normality on both the histogram and the quaartile plot.

Part c

plot(hatvalues(fit), rstandard(fit))

n = 50; p = 5; lim = 2 * p / n
hatvalues(fit)[hatvalues(fit) > lim]
##  California Connecticut  New Jersey        Utah 
##   0.2821179   0.2254519   0.2220978   0.2921128

Large leverage points in the model include California, Connecticut, New Jersey, and Utah.

Part d

hatvalues(fit)[abs(rstandard(fit)) >2]
## New Hampshire  North Dakota          Utah West Virginia 
##    0.08283540    0.08104629    0.29211280    0.06206536

Outliers in the model include New Hampshire, North Dakota, Utah, and West Virginia.

Part e

cook = cooks.distance(fit)
cook[cook > 4/50]
##          Utah West Virginia 
##     0.4715287     0.1081395

Influential points include Utah and West Virginia.

Part f

summary(fit)
## 
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.531 -20.855  -1.746  15.979  66.571 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1045.9715    52.8698  19.784  < 2e-16 ***
## expend         4.4626    10.5465   0.423    0.674    
## ratio         -3.6242     3.2154  -1.127    0.266    
## salary         1.6379     2.3872   0.686    0.496    
## takers        -2.9045     0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.809 
## F-statistic: 52.88 on 4 and 45 DF,  p-value: < 2.2e-16

Because ‘expend’ is the coefficient with the highest p-value greater than 0.10 we will eliminate it from the model.

fit2 = update(fit, .~. - expend)
summary(fit2)
## 
## Call:
## lm(formula = total ~ ratio + salary + takers, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -89.244 -21.485  -0.798  17.685  68.262 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1057.8982    44.3287  23.865   <2e-16 ***
## ratio         -4.6394     2.1215  -2.187   0.0339 *  
## salary         2.5525     1.0045   2.541   0.0145 *  
## takers        -2.9134     0.2282 -12.764   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.41 on 46 degrees of freedom
## Multiple R-squared:  0.8239, Adjusted R-squared:  0.8124 
## F-statistic: 71.72 on 3 and 46 DF,  p-value: < 2.2e-16

All of the remaining coefficients have a p-value less than 0.10, so the ideal model contains the coefficients ‘ratio’, ‘salary’, and ‘takers.’

Part g

fit_back_aic = step(fit, direction = "backward")
## Start:  AIC=353.48
## total ~ expend + ratio + salary + takers
## 
##          Df Sum of Sq    RSS    AIC
## - expend  1       191  48315 351.67
## - salary  1       503  48627 352.00
## - ratio   1      1359  49483 352.87
## <none>                 48124 353.48
## - takers  1    168688 216812 426.74
## 
## Step:  AIC=351.67
## total ~ ratio + salary + takers
## 
##          Df Sum of Sq    RSS    AIC
## <none>                 48315 351.67
## - ratio   1      5023  53338 354.62
## - salary  1      6782  55097 356.24
## - takers  1    171126 219441 425.34
summary(fit_back_aic)
## 
## Call:
## lm(formula = total ~ ratio + salary + takers, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -89.244 -21.485  -0.798  17.685  68.262 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1057.8982    44.3287  23.865   <2e-16 ***
## ratio         -4.6394     2.1215  -2.187   0.0339 *  
## salary         2.5525     1.0045   2.541   0.0145 *  
## takers        -2.9134     0.2282 -12.764   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.41 on 46 degrees of freedom
## Multiple R-squared:  0.8239, Adjusted R-squared:  0.8124 
## F-statistic: 71.72 on 3 and 46 DF,  p-value: < 2.2e-16

The Backward AIC Elimination yielded that the best model includes ‘ratio’, ‘salary’, and ‘takers.’ The AIC = 351.67

Part h

fit_start = lm(total ~ 1, data = sat)
fit_forw_aic = step(fit_start,
                    total ~ expend + ratio + salary + takers,
                    direction = "forward")
## Start:  AIC=432.5
## total ~ 1
## 
##          Df Sum of Sq    RSS    AIC
## + takers  1    215875  58433 357.18
## + salary  1     53078 221230 423.75
## + expend  1     39722 234586 426.68
## <none>                274308 432.50
## + ratio   1      1811 272497 434.17
## 
## Step:  AIC=357.18
## total ~ takers
## 
##          Df Sum of Sq   RSS    AIC
## + expend  1    8913.1 49520 350.91
## + salary  1    5094.8 53338 354.62
## + ratio   1    3336.2 55097 356.24
## <none>                58433 357.18
## 
## Step:  AIC=350.91
## total ~ takers + expend
## 
##          Df Sum of Sq   RSS    AIC
## <none>                49520 350.91
## + ratio   1    892.74 48627 352.00
## + salary  1     37.52 49483 352.87
summary(fit_forw_aic)
## 
## Call:
## lm(formula = total ~ takers + expend, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.400 -22.884   1.968  19.142  68.755 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 993.8317    21.8332  45.519  < 2e-16 ***
## takers       -2.8509     0.2151 -13.253  < 2e-16 ***
## expend       12.2865     4.2243   2.909  0.00553 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.46 on 47 degrees of freedom
## Multiple R-squared:  0.8195, Adjusted R-squared:  0.8118 
## F-statistic: 106.7 on 2 and 47 DF,  p-value: < 2.2e-16

When using the forward AIC elimination, we see that the best model has predictors ‘takers’, and ‘expend.’ AIC = 350.91.

Part i

Given the two methods of elimination, we will choose the model with predictors ‘takers’, and ‘expend.’ This is because the AIC for the Forward AIC method is less than backward AIC method. -> 350.91 < 351.67.

Part j

fit.g = lm(total ~ ratio + salary + takers, data = sat)
fit.h = lm(total ~ expend + takers, data = sat)
summary(fit.g)$adj.r.squared
## [1] 0.8123772
summary(fit.h)$adj.r.squared
## [1] 0.8117906

The model in Part g which uses predictors ‘ratio’, ‘salary’, and ‘takers’ is the best model when adjusted R squared. 0.8123772 > 0.8117906