## expend ratio salary takers
## Min. :3.656 Min. :13.80 Min. :25.99 Min. : 4.00
## 1st Qu.:4.882 1st Qu.:15.22 1st Qu.:30.98 1st Qu.: 9.00
## Median :5.768 Median :16.60 Median :33.29 Median :28.00
## Mean :5.905 Mean :16.86 Mean :34.83 Mean :35.24
## 3rd Qu.:6.434 3rd Qu.:17.57 3rd Qu.:38.55 3rd Qu.:63.00
## Max. :9.774 Max. :24.30 Max. :50.05 Max. :81.00
## verbal math total
## Min. :401.0 Min. :443.0 Min. : 844.0
## 1st Qu.:427.2 1st Qu.:474.8 1st Qu.: 897.2
## Median :448.0 Median :497.5 Median : 945.5
## Mean :457.1 Mean :508.8 Mean : 965.9
## 3rd Qu.:490.2 3rd Qu.:539.5 3rd Qu.:1032.0
## Max. :516.0 Max. :592.0 Max. :1107.0
## 'data.frame': 50 obs. of 7 variables:
## $ expend: num 4.41 8.96 4.78 4.46 4.99 ...
## $ ratio : num 17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
## $ salary: num 31.1 48 32.2 28.9 41.1 ...
## $ takers: int 8 47 27 6 45 29 81 68 48 65 ...
## $ verbal: int 491 445 448 482 417 462 431 429 420 406 ...
## $ math : int 538 489 496 523 485 518 477 468 469 448 ...
## $ total : int 1029 934 944 1005 902 980 908 897 889 854 ...
##
## Call:
## lm(formula = "total ~ expend + salary + ratio + takers", data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.531 -20.855 -1.746 15.979 66.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.9715 52.8698 19.784 < 2e-16 ***
## expend 4.4626 10.5465 0.423 0.674
## salary 1.6379 2.3872 0.686 0.496
## ratio -3.6242 3.2154 -1.127 0.266
## takers -2.9045 0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809
## F-statistic: 52.88 on 4 and 45 DF, p-value: < 2.2e-16
plot(y=sat$total, x=sat$takers, ylab="Total score", xlab="Takers", main="SAT Total score vs. takers, state by state")
abline(lm(sat$total ~ sat$takers))
While the meaning of each data element is not defined, the values for “takers” range from a low of 4 in Mississippi and Utah to a high of 80 and 81 in Massachusetts and Connecticut, respectively. Perhaps this is the percentage of students in each state who take the SAT. The higher scores in states where it appears that very few students take the test suggest that only those “better” students (in such states) who are seriously considering making application to more prestigious colleges (which may require that candidates take the SAT) sit for the exam, resulting in a high state-wide average. On the other hand, the lower average scores in “high taker” states such as Massachusetts and Connecticut suggests that a wider group of students, including many who are not candidates for prestigious colleges, are taking the test, bringing down their statewide average score.
Perform regression diagnostics on this model to answer the following questions.
Display any plots that are relevant.
Do not provide any plots about which you have nothing to say.
Suggest possible improvements or corrections to the model where appropriate.
plot(fitted(lmod1),residuals(lmod1),xlab="Fitted",ylab="|Residuals|",main = "SAT: Fitted vs. Residuals")
abline(h=0)
The question is, are the variances the same across each of the two groups, or do the variances differ?
##
## Call:
## lm(formula = abs(residuals(lmod1)) ~ fitted(lmod1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.827 -14.323 -5.001 6.204 65.754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.60384 40.82315 0.260 0.796
## fitted(lmod1) 0.01386 0.04216 0.329 0.744
##
## Residual standard error: 20.05 on 48 degrees of freedom
## Multiple R-squared: 0.002247, Adjusted R-squared: -0.01854
## F-statistic: 0.1081 on 1 and 48 DF, p-value: 0.7438
##
## Call:
## lm(formula = sqrt(abs(residuals(lmod1))) ~ fitted(lmod1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3097 -1.2366 -0.2234 0.9929 5.0337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6484524 4.0660807 1.143 0.259
## fitted(lmod1) -0.0001637 0.0041994 -0.039 0.969
##
## Residual standard error: 1.997 on 48 degrees of freedom
## Multiple R-squared: 3.166e-05, Adjusted R-squared: -0.0208
## F-statistic: 0.00152 on 1 and 48 DF, p-value: 0.9691
##
## F test to compare two variances
##
## data: residuals(lmod1)[fitted(lmod1) > 950] and residuals(lmod1)[fitted(lmod1) < 950]
## F = 1.651, num df = 27, denom df = 21, p-value = 0.2419
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.7071032 3.6928150
## sample estimates:
## ratio of variances
## 1.65098
The outlier at the bottom left is West Virginia, -90.531 .
The three items at the top right are Utah, New Hampshire, and North Dakota, all of which have residuals between 65.7 and 66.6 .
The graph does not appear to show evidence of lack of Normality.
The text warns that histographs may not be suitable. WHile the above graph slightly resembles the desired bell-shaped curve, the fact that there are only 50 data points makes it difficult to have appropriate bins for a histogram.
##
## Shapiro-Wilk normality test
##
## data: residuals(lmod1)
## W = 0.97691, p-value = 0.4304
Because the p-value is high, we fail to reject the null hypothesis of Normality.
Even if the test were to fail, the text notes that
“the effects of non-normality are mitigated by large sample sizes. For smaller sample sizes, formal tests lack power.”
According to the text,
“A leverage point is extreme in the predictor space. It has the potential to influence the fit, but does not necessarily do so.”
“The value of \(h_i\) depends only on \(X\) and not \(y\) so leverages contain only partial information about a case.”
\(\hat{y} = X(X^TX)^{-1}X^Ty = Hy\) , where \(H = X(X^TX)^{-1}X^T\) is the “hat matrix” .
\(\hat{\epsilon} = (I-H)\epsilon\)
\(var[\hat{\epsilon}] = var[(I-H)\epsilon] = (I-H)\sigma^2\) assuming that \(var[\epsilon] = \sigma^2 I\) .
\(h_i = H_{ii}\) are called leverages. Since \(var[\hat{\epsilon_i}] = (I-h_i)\sigma^2\) , a large leverage \(h_i\) will make \(var[\hat{\epsilon}]\) small.
Since \(\sum{h_i} = p\), an average value for \(h_i\) is \(\frac{p}{n}\) .
A rough rule is that leverages of more than \(\frac{2p}{n}\)should be looked at more closely.
## [1] 5
## [1] 0.2
## California Connecticut New Jersey Utah
## 0.2821179 0.2254519 0.2220978 0.2921128
## California Connecticut New Jersey Utah
## -15.84797 28.16407 -13.74758 65.76608
## expend ratio salary takers verbal math total
## California 4.992 24.0 41.078 45 417 485 902
## Connecticut 8.817 14.4 50.045 81 431 477 908
## New Jersey 9.774 13.8 46.087 70 420 478 898
## Utah 3.656 24.3 29.082 4 513 563 1076
Utah and California have the highest leverages.
The next two (unlabeled) points in the upper right are Connecticut and New Jersey.
The above points do not appear to be extreme enough to substantially affect the results as leverage points.
“Some observations do not fit the model well. Such points are called outliers.”
The method of “studentized residuals” recomputes the results, excluding each point one-at-a-time,
stud = rstudent(lmod1)
orderedstud = stud[order(stud)]
# smallest studentized residuals
t(t(head(orderedstud,3)))
## [,1]
## West Virginia -3.124428
## Nevada -1.732004
## South Carolina -1.468832
## [,1]
## New Hampshire 2.190006
## North Dakota 2.213686
## Utah 2.529587
Because there are 50 data points and 5 variables, there are 44 degress of freedom. The Bonferroni critical value is -3.5258013 .
Because the above studentized critical values are within the bounds of \(\pm -3.5258013\) , these points are not considered outliers.
“Other observations change the fit of the model in a substantive manner. These are called influential observations.”
“An influential point is one whose removal from the dataset would cause a large change in the fit. An influential point may or may not be an outlier and may or may not have large leverage but it will tend to have at least one of these two properties.”
Cook’s Distance (“Cook’s D”) is a common way to identify influential points.
Utah has the largest Cook’s D.
# sort the cook results
ordercook = cook[order(cook)]
# which are the largest?
t(t(tail(ordercook)))
## [,1]
## Alaska 0.04581008
## Connecticut 0.05574806
## North Dakota 0.07954292
## New Hampshire 0.07989442
## West Virginia 0.10813954
## Utah 0.47152866
# Drop Utah from the data set and re-fit:
sat_exUtah <- subset(sat,rownames(sat)!="Utah")
lmod_exUtah <- lm(formula = "total ~ expend + salary + ratio + takers",data = sat_exUtah)
summary(lmod_exUtah)
##
## Call:
## lm(formula = "total ~ expend + salary + ratio + takers", data = sat_exUtah)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92.118 -18.402 1.808 14.890 67.669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1093.8460 53.4226 20.475 <2e-16 ***
## expend -0.9427 10.1922 -0.092 0.927
## salary 3.0964 2.3283 1.330 0.190
## ratio -7.6391 3.4279 -2.229 0.031 *
## takers -2.9308 0.2188 -13.397 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.9 on 44 degrees of freedom
## Multiple R-squared: 0.8396, Adjusted R-squared: 0.825
## F-statistic: 57.58 on 4 and 44 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = "total ~ expend + salary + ratio + takers", data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.531 -20.855 -1.746 15.979 66.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.9715 52.8698 19.784 < 2e-16 ***
## expend 4.4626 10.5465 0.423 0.674
## salary 1.6379 2.3872 0.686 0.496
## ratio -3.6242 3.2154 -1.127 0.266
## takers -2.9045 0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809
## F-statistic: 52.88 on 4 and 45 DF, p-value: < 2.2e-16
# How much does dropping Utah from the dataset change the means of each variable?
# means of each variable, including Utah
colMeans(sat)
## expend ratio salary takers verbal math total
## 5.90526 16.85800 34.82892 35.24000 457.14000 508.78000 965.92000
## expend ratio salary takers verbal math
## 5.951163 16.706122 34.946204 35.877551 456.000000 507.673469
## total
## 963.673469
## expend ratio salary takers verbal math total
## Utah 3.656 24.3 29.082 4 513 563 1076
Utah has the lowest number of “takers” and one of the highest values for “total.”
Omitting it from the data causes “ratio” to become significant at 95% confidence.
The states are indexed alphabetically. Utah is index 44.
There are substantial changes to the betas when Utah is excluded from the analysis.
Other states:
#influence.measures(lmod1)
influence.measures(lmod1)$is.inf[rowSums(influence.measures(lmod1)$is.inf)>0,]
## dfb.1_ dfb.expn dfb.slry dfb.rati dfb.tkrs dffit cov.r
## California FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## New Jersey FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## New York FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## Utah FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## West Virginia FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## cook.d hat
## California FALSE FALSE
## New Jersey FALSE FALSE
## New York FALSE FALSE
## Utah FALSE FALSE
## West Virginia FALSE FALSE
## dfb.1_ dfb.expn dfb.slry dfb.rati dfb.tkrs
## California 0.17494825 0.11879268 -0.12778683 -0.11353062 -0.037695710
## New Jersey 0.08810643 -0.16153804 0.09430634 -0.05161047 0.001040469
## New York 0.03944191 -0.04697844 0.02374747 -0.02265524 -0.003107751
## Utah -0.95829725 0.54239648 -0.64657132 1.32142482 0.120462842
## West Virginia -0.24200911 -0.30021099 0.25200030 0.10725005 0.320984440
## dffit cov.r cook.d hat
## California -0.35584939 1.5028022 0.025713040 0.28211791
## New Jersey -0.25247497 1.4024378 0.012972641 0.22209778
## New York -0.07991023 1.3798444 0.001305355 0.19157520
## Utah 1.62496100 0.8016725 0.471528660 0.29211280
## West Virginia -0.80372769 0.4380370 0.108139543 0.06206536
The flatness of the lines associated with three of the variables (expend, salary, and ratio) reflect their lack of significance.
The model may be improved by dropping them.