Chapter 6, Diagnostics, Exercise 6.1, Page 97

6.1 Using the sat dataset, fit a model with the total SAT score as the response and expend, salary, ratio and takers as predictors.

library(faraway)
data(sat)
summary(sat)

##      expend          ratio           salary          takers     
##  Min.   :3.656   Min.   :13.80   Min.   :25.99   Min.   : 4.00  
##  1st Qu.:4.882   1st Qu.:15.22   1st Qu.:30.98   1st Qu.: 9.00  
##  Median :5.768   Median :16.60   Median :33.29   Median :28.00  
##  Mean   :5.905   Mean   :16.86   Mean   :34.83   Mean   :35.24  
##  3rd Qu.:6.434   3rd Qu.:17.57   3rd Qu.:38.55   3rd Qu.:63.00  
##  Max.   :9.774   Max.   :24.30   Max.   :50.05   Max.   :81.00  
##      verbal           math           total       
##  Min.   :401.0   Min.   :443.0   Min.   : 844.0  
##  1st Qu.:427.2   1st Qu.:474.8   1st Qu.: 897.2  
##  Median :448.0   Median :497.5   Median : 945.5  
##  Mean   :457.1   Mean   :508.8   Mean   : 965.9  
##  3rd Qu.:490.2   3rd Qu.:539.5   3rd Qu.:1032.0  
##  Max.   :516.0   Max.   :592.0   Max.   :1107.0

str(sat)

## 'data.frame':    50 obs. of  7 variables:
##  $ expend: num  4.41 8.96 4.78 4.46 4.99 ...
##  $ ratio : num  17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
##  $ salary: num  31.1 48 32.2 28.9 41.1 ...
##  $ takers: int  8 47 27 6 45 29 81 68 48 65 ...
##  $ verbal: int  491 445 448 482 417 462 431 429 420 406 ...
##  $ math  : int  538 489 496 523 485 518 477 468 469 448 ...
##  $ total : int  1029 934 944 1005 902 980 908 897 889 854 ...

lmod1 = lm(formula = "total ~ expend + salary + ratio + takers", data=sat)
summary(lmod1)

## 
## Call:
## lm(formula = "total ~ expend + salary + ratio + takers", data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.531 -20.855  -1.746  15.979  66.571 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1045.9715    52.8698  19.784  < 2e-16 ***
## expend         4.4626    10.5465   0.423    0.674    
## salary         1.6379     2.3872   0.686    0.496    
## ratio         -3.6242     3.2154  -1.127    0.266    
## takers        -2.9045     0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.809 
## F-statistic: 52.88 on 4 and 45 DF,  p-value: < 2.2e-16

The results of this regression indicate that the variables expend, salary, and ratio are not statistically significant.

The only significant variable is takers, which has a strong negative association with the state-by-state total SAT score.

plot(y=sat$total, x=sat$takers, ylab="Total score", xlab="Takers", main="SAT Total score vs. takers, state by state")
abline(lm(sat$total ~ sat$takers))

While the meaning of each data element is not defined, the values for “takers” range from a low of 4 in Mississippi and Utah to a high of 80 and 81 in Massachusetts and Connecticut, respectively. Perhaps this is the percentage of students in each state who take the SAT. The higher scores in states where it appears that very few students take the test suggest that only those “better” students (in such states) who are seriously considering making application to more prestigious colleges (which may require that candidates take the SAT) sit for the exam, resulting in a high state-wide average. On the other hand, the lower average scores in “high taker” states such as Massachusetts and Connecticut suggests that a wider group of students, including many who are not candidates for prestigious colleges, are taking the test, bringing down their statewide average score.

Perform regression diagnostics on this model to answer the following questions.
Display any plots that are relevant.
Do not provide any plots about which you have nothing to say.
Suggest possible improvements or corrections to the model where appropriate.

(a) Check the constant variance assumption for the errors.

plot(fitted(lmod1),residuals(lmod1),xlab="Fitted",ylab="|Residuals|",main = "SAT: Fitted vs. Residuals")
abline(h=0)

plot(fitted(lmod1),abs(residuals(lmod1)),xlab="Fitted",ylab="|Residuals|",main = "SAT: Fitted vs. abs(Residuals)")

Clearly there are two clusters among the fitted scores: those states where the total is below 950, and those where the total is above.

The question is, are the variances the same across each of the two groups, or do the variances differ?

Does a square root transformation help?

plot(fitted(lmod1),sqrt(abs(residuals(lmod1))), xlab="Fitted",ylab=expression(sqrt(hat(epsilon))),main="SAT: fittted vs. sqrt(residuals)")

Does a regression of the absolute values of the residuals on the fitted values indicate non-constant variance?

lmod2 = lm(abs(residuals(lmod1)) ~ fitted(lmod1) )
summary(lmod2)

## 
## Call:
## lm(formula = abs(residuals(lmod1)) ~ fitted(lmod1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.827 -14.323  -5.001   6.204  65.754 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)   10.60384   40.82315   0.260    0.796
## fitted(lmod1)  0.01386    0.04216   0.329    0.744
## 
## Residual standard error: 20.05 on 48 degrees of freedom
## Multiple R-squared:  0.002247,   Adjusted R-squared:  -0.01854 
## F-statistic: 0.1081 on 1 and 48 DF,  p-value: 0.7438

What about a regression of the square root of the absolute values of the residuals on the fitted values??

lmod3 = lm(sqrt(abs(residuals(lmod1))) ~ fitted(lmod1) )
summary(lmod3)

## 
## Call:
## lm(formula = sqrt(abs(residuals(lmod1))) ~ fitted(lmod1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3097 -1.2366 -0.2234  0.9929  5.0337 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)    4.6484524  4.0660807   1.143    0.259
## fitted(lmod1) -0.0001637  0.0041994  -0.039    0.969
## 
## Residual standard error: 1.997 on 48 degrees of freedom
## Multiple R-squared:  3.166e-05,  Adjusted R-squared:  -0.0208 
## F-statistic: 0.00152 on 1 and 48 DF,  p-value: 0.9691

In both of the above two cases, the high p-values indicate that we fail to reject non-constant variance.

plot(lmod2)

plot(lmod3)

Is there a difference between the variances of the residuals among those states where the fitted value is below 950, vs, above 950?

var.test(residuals(lmod1)[fitted(lmod1)>950],
         residuals(lmod1)[fitted(lmod1)<950])

## 
##  F test to compare two variances
## 
## data:  residuals(lmod1)[fitted(lmod1) > 950] and residuals(lmod1)[fitted(lmod1) < 950]
## F = 1.651, num df = 27, denom df = 21, p-value = 0.2419
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7071032 3.6928150
## sample estimates:
## ratio of variances 
##            1.65098

Because of the high p-value we fail to reject the hypothesis that the variances differ between groups.

(Additionally note that the 95-percent CI contains 1.0).

(b) Check the normality assumption.

Check normality using a QQ-plot:

qqnorm(residuals(lmod1),ylab="Residuals",main="")
qqline(residuals(lmod1))

The outlier at the bottom left is West Virginia, -90.531 .

The three items at the top right are Utah, New Hampshire, and North Dakota, all of which have residuals between 65.7 and 66.6 .

The graph does not appear to show evidence of lack of Normality.

Check for Normality using a histogram

hist(residuals(lmod1),xlab="Residuals",main="", breaks = 10)

The text warns that histographs may not be suitable. WHile the above graph slightly resembles the desired bell-shaped curve, the fact that there are only 50 data points makes it difficult to have appropriate bins for a histogram.

Shapiro-Wilk test for Normality

shapiro.test(residuals(lmod1))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lmod1)
## W = 0.97691, p-value = 0.4304

Because the p-value is high, we fail to reject the null hypothesis of Normality.
Even if the test were to fail, the text notes that
“the effects of non-normality are mitigated by large sample sizes. For smaller sample sizes, formal tests lack power.”

(c) Check for large leverage points.

According to the text,

“A leverage point is extreme in the predictor space. It has the potential to influence the fit, but does not necessarily do so.”
“The value of \(h_i\) depends only on \(X\) and not \(y\) so leverages contain only partial information about a case.”

\(\hat{y} = X(X^TX)^{-1}X^Ty = Hy\) , where \(H = X(X^TX)^{-1}X^T\) is the “hat matrix” .

\(\hat{\epsilon} = (I-H)\epsilon\)

\(var[\hat{\epsilon}] = var[(I-H)\epsilon] = (I-H)\sigma^2\) assuming that \(var[\epsilon] = \sigma^2 I\) .

\(h_i = H_{ii}\) are called leverages. Since \(var[\hat{\epsilon_i}] = (I-h_i)\sigma^2\) , a large leverage \(h_i\) will make \(var[\hat{\epsilon}]\) small.

Since \(\sum{h_i} = p\), an average value for \(h_i\) is \(\frac{p}{n}\) .
A rough rule is that leverages of more than \(\frac{2p}{n}\)should be looked at more closely.

hatv <- hatvalues(lmod1)
p = sum(hatv)
p

## [1] 5

two_p_over_n = 2*p/50
two_p_over_n

## [1] 0.2

hatv[hatv>two_p_over_n]

##  California Connecticut  New Jersey        Utah 
##   0.2821179   0.2254519   0.2220978   0.2921128

residuals(lmod1)[hatv>two_p_over_n]

##  California Connecticut  New Jersey        Utah 
##   -15.84797    28.16407   -13.74758    65.76608

sat[hatv>two_p_over_n,]

##             expend ratio salary takers verbal math total
## California   4.992  24.0 41.078     45    417  485   902
## Connecticut  8.817  14.4 50.045     81    431  477   908
## New Jersey   9.774  13.8 46.087     70    420  478   898
## Utah         3.656  24.3 29.082      4    513  563  1076

states = row.names(sat)
halfnorm(hatv,labs=states,ylab="Leverages")

Utah and California have the highest leverages.
The next two (unlabeled) points in the upper right are Connecticut and New Jersey.

QQnorm plot of standardized residuals

qqnorm(rstandard(lmod1), main="Standardized Residuals")
abline(0,1)

The above points do not appear to be extreme enough to substantially affect the results as leverage points.

(d) Check for outliers.

“Some observations do not fit the model well. Such points are called outliers.”

The method of “studentized residuals” recomputes the results, excluding each point one-at-a-time,

stud = rstudent(lmod1)
orderedstud = stud[order(stud)]
# smallest studentized residuals
t(t(head(orderedstud,3)))

##                     [,1]
## West Virginia  -3.124428
## Nevada         -1.732004
## South Carolina -1.468832

# largest studentized residuals
t(t(tail(orderedstud,3)))

##                   [,1]
## New Hampshire 2.190006
## North Dakota  2.213686
## Utah          2.529587

# Bonferroni critical value
bonferroni = qt(.05/(50*2),44)

Because there are 50 data points and 5 variables, there are 44 degress of freedom. The Bonferroni critical value is -3.5258013 .

Because the above studentized critical values are within the bounds of \(\pm -3.5258013\) , these points are not considered outliers.

(e) Check for influential points.

“Other observations change the fit of the model in a substantive manner. These are called influential observations.”

“An influential point is one whose removal from the dataset would cause a large change in the fit. An influential point may or may not be an outlier and may or may not have large leverage but it will tend to have at least one of these two properties.”

Cook’s Distance (“Cook’s D”) is a common way to identify influential points.

cook <- cooks.distance(lmod1)
halfnorm(cook,3,labs=states,ylab="Cook’s distances")

Utah has the largest Cook’s D.

# sort the cook results
ordercook = cook[order(cook)]

# which are the largest?
t(t(tail(ordercook)))

##                     [,1]
## Alaska        0.04581008
## Connecticut   0.05574806
## North Dakota  0.07954292
## New Hampshire 0.07989442
## West Virginia 0.10813954
## Utah          0.47152866

# Drop Utah from the data set and re-fit:
sat_exUtah <- subset(sat,rownames(sat)!="Utah")
lmod_exUtah <- lm(formula = "total ~ expend + salary + ratio + takers",data = sat_exUtah)
summary(lmod_exUtah)

## 
## Call:
## lm(formula = "total ~ expend + salary + ratio + takers", data = sat_exUtah)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -92.118 -18.402   1.808  14.890  67.669 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1093.8460    53.4226  20.475   <2e-16 ***
## expend        -0.9427    10.1922  -0.092    0.927    
## salary         3.0964     2.3283   1.330    0.190    
## ratio         -7.6391     3.4279  -2.229    0.031 *  
## takers        -2.9308     0.2188 -13.397   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.9 on 44 degrees of freedom
## Multiple R-squared:  0.8396, Adjusted R-squared:  0.825 
## F-statistic: 57.58 on 4 and 44 DF,  p-value: < 2.2e-16

# Review the results when Utah was included:
summary(lmod1)

## 
## Call:
## lm(formula = "total ~ expend + salary + ratio + takers", data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.531 -20.855  -1.746  15.979  66.571 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1045.9715    52.8698  19.784  < 2e-16 ***
## expend         4.4626    10.5465   0.423    0.674    
## salary         1.6379     2.3872   0.686    0.496    
## ratio         -3.6242     3.2154  -1.127    0.266    
## takers        -2.9045     0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.809 
## F-statistic: 52.88 on 4 and 45 DF,  p-value: < 2.2e-16

# How much does dropping Utah from the dataset change the means of each variable?
# means of each variable, including Utah
colMeans(sat)

##    expend     ratio    salary    takers    verbal      math     total 
##   5.90526  16.85800  34.82892  35.24000 457.14000 508.78000 965.92000

# means of each column, omitting Utah
colMeans(sat_exUtah)

##     expend      ratio     salary     takers     verbal       math 
##   5.951163  16.706122  34.946204  35.877551 456.000000 507.673469 
##      total 
## 963.673469

# What were Utah's values?
subset(sat,rownames(sat)=="Utah")

##      expend ratio salary takers verbal math total
## Utah  3.656  24.3 29.082      4    513  563  1076

Utah has the lowest number of “takers” and one of the highest values for “total.”
Omitting it from the data causes “ratio” to become significant at 95% confidence.

plot(dfbeta(lmod1)[,2],ylab="Change in SAT coef")
abline(h=0)

The states are indexed alphabetically. Utah is index 44.

There are substantial changes to the betas when Utah is excluded from the analysis.

Other states:

Influence measures are demonstrated according to the following table:

#influence.measures(lmod1)

influence.measures(lmod1)$is.inf[rowSums(influence.measures(lmod1)$is.inf)>0,]

##               dfb.1_ dfb.expn dfb.slry dfb.rati dfb.tkrs dffit cov.r
## California     FALSE    FALSE    FALSE    FALSE    FALSE FALSE  TRUE
## New Jersey     FALSE    FALSE    FALSE    FALSE    FALSE FALSE  TRUE
## New York       FALSE    FALSE    FALSE    FALSE    FALSE FALSE  TRUE
## Utah           FALSE    FALSE    FALSE     TRUE    FALSE  TRUE FALSE
## West Virginia  FALSE    FALSE    FALSE    FALSE    FALSE FALSE  TRUE
##               cook.d   hat
## California     FALSE FALSE
## New Jersey     FALSE FALSE
## New York       FALSE FALSE
## Utah           FALSE FALSE
## West Virginia  FALSE FALSE

influence.measures(lmod1)$infmat[rowSums(influence.measures(lmod1)$is.inf)>0,]

##                    dfb.1_    dfb.expn    dfb.slry    dfb.rati     dfb.tkrs
## California     0.17494825  0.11879268 -0.12778683 -0.11353062 -0.037695710
## New Jersey     0.08810643 -0.16153804  0.09430634 -0.05161047  0.001040469
## New York       0.03944191 -0.04697844  0.02374747 -0.02265524 -0.003107751
## Utah          -0.95829725  0.54239648 -0.64657132  1.32142482  0.120462842
## West Virginia -0.24200911 -0.30021099  0.25200030  0.10725005  0.320984440
##                     dffit     cov.r      cook.d        hat
## California    -0.35584939 1.5028022 0.025713040 0.28211791
## New Jersey    -0.25247497 1.4024378 0.012972641 0.22209778
## New York      -0.07991023 1.3798444 0.001305355 0.19157520
## Utah           1.62496100 0.8016725 0.471528660 0.29211280
## West Virginia -0.80372769 0.4380370 0.108139543 0.06206536

(f) Check the structure of the relationship between the predictors and the response.

# standard diagnostic plot
plot(lmod1)

# Term plot of partial residuals:
termplot(lmod1,partial.resid = T)

The flatness of the lines associated with three of the variables (expend, salary, and ratio) reflect their lack of significance.

The model may be improved by dropping them.

MY621-LMR-CH6-EX1-P97

Michael Y.

9/22/2019