library(faraway)
## Warning: package 'faraway' was built under R version 3.6.3
data(sat)
sat
## expend ratio salary takers verbal math total
## Alabama 4.405 17.2 31.144 8 491 538 1029
## Alaska 8.963 17.6 47.951 47 445 489 934
## Arizona 4.778 19.3 32.175 27 448 496 944
## Arkansas 4.459 17.1 28.934 6 482 523 1005
## California 4.992 24.0 41.078 45 417 485 902
## Colorado 5.443 18.4 34.571 29 462 518 980
## Connecticut 8.817 14.4 50.045 81 431 477 908
## Delaware 7.030 16.6 39.076 68 429 468 897
## Florida 5.718 19.1 32.588 48 420 469 889
## Georgia 5.193 16.3 32.291 65 406 448 854
## Hawaii 6.078 17.9 38.518 57 407 482 889
## Idaho 4.210 19.1 29.783 15 468 511 979
## Illinois 6.136 17.3 39.431 13 488 560 1048
## Indiana 5.826 17.5 36.785 58 415 467 882
## Iowa 5.483 15.8 31.511 5 516 583 1099
## Kansas 5.817 15.1 34.652 9 503 557 1060
## Kentucky 5.217 17.0 32.257 11 477 522 999
## Louisiana 4.761 16.8 26.461 9 486 535 1021
## Maine 6.428 13.8 31.972 68 427 469 896
## Maryland 7.245 17.0 40.661 64 430 479 909
## Massachusetts 7.287 14.8 40.795 80 430 477 907
## Michigan 6.994 20.1 41.895 11 484 549 1033
## Minnesota 6.000 17.5 35.948 9 506 579 1085
## Mississippi 4.080 17.5 26.818 4 496 540 1036
## Missouri 5.383 15.5 31.189 9 495 550 1045
## Montana 5.692 16.3 28.785 21 473 536 1009
## Nebraska 5.935 14.5 30.922 9 494 556 1050
## Nevada 5.160 18.7 34.836 30 434 483 917
## New Hampshire 5.859 15.6 34.720 70 444 491 935
## New Jersey 9.774 13.8 46.087 70 420 478 898
## New Mexico 4.586 17.2 28.493 11 485 530 1015
## New York 9.623 15.2 47.612 74 419 473 892
## North Carolina 5.077 16.2 30.793 60 411 454 865
## North Dakota 4.775 15.3 26.327 5 515 592 1107
## Ohio 6.162 16.6 36.802 23 460 515 975
## Oklahoma 4.845 15.5 28.172 9 491 536 1027
## Oregon 6.436 19.9 38.555 51 448 499 947
## Pennsylvania 7.109 17.1 44.510 70 419 461 880
## Rhode Island 7.469 14.7 40.729 70 425 463 888
## South Carolina 4.797 16.4 30.279 58 401 443 844
## South Dakota 4.775 14.4 25.994 5 505 563 1068
## Tennessee 4.388 18.6 32.477 12 497 543 1040
## Texas 5.222 15.7 31.223 47 419 474 893
## Utah 3.656 24.3 29.082 4 513 563 1076
## Vermont 6.750 13.8 35.406 68 429 472 901
## Virginia 5.327 14.6 33.987 65 428 468 896
## Washington 5.906 20.2 36.151 48 443 494 937
## West Virginia 6.107 14.8 31.944 17 448 484 932
## Wisconsin 6.930 15.9 37.746 9 501 572 1073
## Wyoming 6.160 14.9 31.285 10 476 525 1001
fit = lm(total ~ expend + ratio + salary + takers, data = sat)
plot(fitted(fit), resid(fit), col = "blue")
abline(h = 0, col = "orange", lwd = 2)
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.6.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(fit)
##
## studentized Breusch-Pagan test
##
## data: fit
## BP = 2.1587, df = 4, p-value = 0.7066
We do not reject a constant variance assumption for our model given the result of the BP-Test and a pattern of constant variance in our graph.
hist(resid(fit))
qqnorm(resid(fit))
qqline(resid(fit))
shapiro.test(resid(fit))
##
## Shapiro-Wilk normality test
##
## data: resid(fit)
## W = 0.97691, p-value = 0.4304
We do not reject a normality assumption of the model given the p value of the Shapiro test and the evidence of normality on both the histogram and the quaartile plot.
plot(hatvalues(fit), rstandard(fit))
n = 50; p = 5; lim = 2 * p / n
hatvalues(fit)[hatvalues(fit) > lim]
## California Connecticut New Jersey Utah
## 0.2821179 0.2254519 0.2220978 0.2921128
Large leverage points in the model include California, Connecticut, New Jersey, and Utah.
hatvalues(fit)[abs(rstandard(fit)) >2]
## New Hampshire North Dakota Utah West Virginia
## 0.08283540 0.08104629 0.29211280 0.06206536
Outliers in the model include New Hampshire, North Dakota, Utah, and West Virginia.
cook = cooks.distance(fit)
cook[cook > 4/50]
## Utah West Virginia
## 0.4715287 0.1081395
Influential points include Utah and West Virginia.
summary(fit)
##
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.531 -20.855 -1.746 15.979 66.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.9715 52.8698 19.784 < 2e-16 ***
## expend 4.4626 10.5465 0.423 0.674
## ratio -3.6242 3.2154 -1.127 0.266
## salary 1.6379 2.3872 0.686 0.496
## takers -2.9045 0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809
## F-statistic: 52.88 on 4 and 45 DF, p-value: < 2.2e-16
Because ‘expend’ is the coefficient with the highest p-value greater than 0.10 we will eliminate it from the model.
fit2 = update(fit, .~. - expend)
summary(fit2)
##
## Call:
## lm(formula = total ~ ratio + salary + takers, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89.244 -21.485 -0.798 17.685 68.262
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1057.8982 44.3287 23.865 <2e-16 ***
## ratio -4.6394 2.1215 -2.187 0.0339 *
## salary 2.5525 1.0045 2.541 0.0145 *
## takers -2.9134 0.2282 -12.764 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.41 on 46 degrees of freedom
## Multiple R-squared: 0.8239, Adjusted R-squared: 0.8124
## F-statistic: 71.72 on 3 and 46 DF, p-value: < 2.2e-16
All of the remaining coefficients have a p-value less than 0.10, so the ideal model contains the coefficients ‘ratio’, ‘salary’, and ‘takers.’
fit_back_aic = step(fit, direction = "backward")
## Start: AIC=353.48
## total ~ expend + ratio + salary + takers
##
## Df Sum of Sq RSS AIC
## - expend 1 191 48315 351.67
## - salary 1 503 48627 352.00
## - ratio 1 1359 49483 352.87
## <none> 48124 353.48
## - takers 1 168688 216812 426.74
##
## Step: AIC=351.67
## total ~ ratio + salary + takers
##
## Df Sum of Sq RSS AIC
## <none> 48315 351.67
## - ratio 1 5023 53338 354.62
## - salary 1 6782 55097 356.24
## - takers 1 171126 219441 425.34
summary(fit_back_aic)
##
## Call:
## lm(formula = total ~ ratio + salary + takers, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89.244 -21.485 -0.798 17.685 68.262
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1057.8982 44.3287 23.865 <2e-16 ***
## ratio -4.6394 2.1215 -2.187 0.0339 *
## salary 2.5525 1.0045 2.541 0.0145 *
## takers -2.9134 0.2282 -12.764 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.41 on 46 degrees of freedom
## Multiple R-squared: 0.8239, Adjusted R-squared: 0.8124
## F-statistic: 71.72 on 3 and 46 DF, p-value: < 2.2e-16
The Backward AIC Elimination yielded that the best model includes ‘ratio’, ‘salary’, and ‘takers.’ The AIC = 351.67
fit_start = lm(total ~ 1, data = sat)
fit_forw_aic = step(fit_start,
total ~ expend + ratio + salary + takers,
direction = "forward")
## Start: AIC=432.5
## total ~ 1
##
## Df Sum of Sq RSS AIC
## + takers 1 215875 58433 357.18
## + salary 1 53078 221230 423.75
## + expend 1 39722 234586 426.68
## <none> 274308 432.50
## + ratio 1 1811 272497 434.17
##
## Step: AIC=357.18
## total ~ takers
##
## Df Sum of Sq RSS AIC
## + expend 1 8913.1 49520 350.91
## + salary 1 5094.8 53338 354.62
## + ratio 1 3336.2 55097 356.24
## <none> 58433 357.18
##
## Step: AIC=350.91
## total ~ takers + expend
##
## Df Sum of Sq RSS AIC
## <none> 49520 350.91
## + ratio 1 892.74 48627 352.00
## + salary 1 37.52 49483 352.87
summary(fit_forw_aic)
##
## Call:
## lm(formula = total ~ takers + expend, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.400 -22.884 1.968 19.142 68.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 993.8317 21.8332 45.519 < 2e-16 ***
## takers -2.8509 0.2151 -13.253 < 2e-16 ***
## expend 12.2865 4.2243 2.909 0.00553 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.46 on 47 degrees of freedom
## Multiple R-squared: 0.8195, Adjusted R-squared: 0.8118
## F-statistic: 106.7 on 2 and 47 DF, p-value: < 2.2e-16
When using the forward AIC elimination, we see that the best model has predictors ‘takers’, and ‘expend.’ AIC = 350.91.
Given the two methods of elimination, we will choose the model with predictors ‘takers’, and ‘expend.’ This is because the AIC for the Forward AIC method is less than backward AIC method. -> 350.91 < 351.67.
fit.g = lm(total ~ ratio + salary + takers, data = sat)
fit.h = lm(total ~ expend + takers, data = sat)
summary(fit.g)$adj.r.squared
## [1] 0.8123772
summary(fit.h)$adj.r.squared
## [1] 0.8117906
The model in Part g which uses predictors ‘ratio’, ‘salary’, and ‘takers’ is the best model when adjusted R squared. 0.8123772 > 0.8117906