Ex 6.3
require(faraway)## Loading required package: faraway
Q.
For the prostate data, fit a model with lpsa as the response and the other variables as predictors. Answer the questions posed in the first question.
(a) Check the constant variance assumption for the errors.
head(prostate)## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.7695 50 -1.386294 0 -1.38629 6 0 -0.43078
## 2 -0.9942523 3.3196 58 -1.386294 0 -1.38629 6 0 -0.16252
## 3 -0.5108256 2.6912 74 -1.386294 0 -1.38629 7 20 -0.16252
## 4 -1.2039728 3.2828 58 -1.386294 0 -1.38629 6 0 -0.16252
## 5 0.7514161 3.4324 62 -1.386294 0 -1.38629 6 0 0.37156
## 6 -1.0498221 3.2288 50 -1.386294 0 -1.38629 6 0 0.76547
Column description
lpsa - log(prostate specific antigen)
For more details, check here
lmod <- lm (lpsa ~ ., data=prostate)
summary(lmod)##
## Call:
## lm(formula = lpsa ~ ., data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7331 -0.3713 -0.0170 0.4141 1.6381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.669337 1.296387 0.516 0.60693
## lcavol 0.587022 0.087920 6.677 2.11e-09 ***
## lweight 0.454467 0.170012 2.673 0.00896 **
## age -0.019637 0.011173 -1.758 0.08229 .
## lbph 0.107054 0.058449 1.832 0.07040 .
## svi 0.766157 0.244309 3.136 0.00233 **
## lcp -0.105474 0.091013 -1.159 0.24964
## gleason 0.045142 0.157465 0.287 0.77503
## pgg45 0.004525 0.004421 1.024 0.30886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7084 on 88 degrees of freedom
## Multiple R-squared: 0.6548, Adjusted R-squared: 0.6234
## F-statistic: 20.86 on 8 and 88 DF, p-value: < 2.2e-16
plot(fitted(lmod), residuals(lmod), xlab = "Fitted", ylab = "Residuals")
abline(h=0)The plot shows there is a constant variance for the errors.
Let’s examine the constant variance assumption more closely.
plot(fitted(lmod), sqrt(abs(residuals(lmod))), xlab = "Fitted", ylab=expression(sqrt(hat(epsilon))))The plot above shows approximately constant variation.
sumary(lm(sqrt(abs(residuals(lmod))) ~ fitted(lmod)))## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.703381 0.090607 7.7630 9.475e-12
## fitted(lmod) -0.021990 0.034232 -0.6424 0.5222
##
## n = 97, p = 2, Residual SE = 0.31328, R-Squared = 0
-> Doesn’t seem to have non-constant variance.
(b) Check the normality assumption.
qqnorm(residuals(lmod), ylab = "Residuals")
qqline(residuals(lmod))You can tell that both the left tail and the right tail we see some deviations from the normal.
shapiro.test(residuals(lmod))##
## Shapiro-Wilk normality test
##
## data: residuals(lmod)
## W = 0.99113, p-value = 0.7721
The p-value in Shapiro-Wilk Normality test tells us that there is no evidence that null hypothesis (the data came from a normally distributed distribution) can not be rejected.
(c) Check for large leverage points.
hatv <- hatvalues(lmod)
head(hatv)## 1 2 3 4 5 6
## 0.07873101 0.06758053 0.13596177 0.07766218 0.03499946 0.08331908
sum(hatv)## [1] 9
length(coefficients(lmod))## [1] 9
We verify that the sum of the leverages is indeed 9 — the number of parameters in the model.
subjects <- row.names(prostate)
halfnorm(hatv, labs = subjects, ylab="Leverages")qqnorm(rstandard(lmod))
abline(0,1)As for Standardized Residuals in the plot above, an absolute value of 2 would be large but not exceptional for a standardized residual whereas a value of 4 would be very unusual under the standard normal. So there is a total of 6 large leverage points.
(d) Check for outliers.
Let’s compute studentized residuals for the prostate data and pick out the largest:
stud <- rstudent(lmod)
stud[which.max(abs(stud))]## 39
## -2.61698
The largest residual of -2.62 is rather large for a standard normal scale, but is it an outlier? Compute the Bonferroni critical value:
qt(.05/(50*2),44)## [1] -3.525801
Since the value 2.61698 is smaller than 3.53 in absolute value, 39 is not an outlier.
(e) Check for influential points.
cook <- cooks.distance(lmod)
halfnorm(cook, 3, labs = subjects, ylab = "Cook's distance")We’ve now identified the largest 3 subjects for this prostate cancer study. We now exclude the largest subject (#32) and see how fit changes.
lmod1<-lm(lpsa ~ ., data=prostate, subset = (cook < max(cook)))
sumary(lmod1)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1718627 1.3288221 0.1293 0.897391
## lcavol 0.5653328 0.0884722 6.3899 7.93e-09
## lweight 0.6216627 0.2020165 3.0773 0.002793
## age -0.0212715 0.0111459 -1.9085 0.059630
## lbph 0.0955905 0.0585289 1.6332 0.106037
## svi 0.7604232 0.2425957 3.1345 0.002347
## lcp -0.1059870 0.0903647 -1.1729 0.244045
## gleason 0.0506884 0.1563842 0.3241 0.746620
## pgg45 0.0044683 0.0043898 1.0179 0.311554
##
## n = 96, p = 9, Residual SE = 0.70336, R-Squared = 0.66
Compared to the full fit:
sumary(lmod)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6693367 1.2963875 0.5163 0.606934
## lcavol 0.5870218 0.0879203 6.6767 2.111e-09
## lweight 0.4544674 0.1700124 2.6731 0.008955
## age -0.0196372 0.0111727 -1.7576 0.082293
## lbph 0.1070540 0.0584492 1.8316 0.070398
## svi 0.7661573 0.2443091 3.1360 0.002329
## lcp -0.1054743 0.0910135 -1.1589 0.249638
## gleason 0.0451416 0.1574645 0.2867 0.775033
## pgg45 0.0045252 0.0044212 1.0235 0.308860
##
## n = 97, p = 9, Residual SE = 0.70842, R-Squared = 0.65
We have plotted the change in the third parameter estimate, lweight, when a case is left out.
# dfbeta: This suite of functions can be used to compute some of the regression (leave-one-out deletion) diagnostics for linear and generalized linear models discussed in Belsley, Kuh and Welsch (1980), Cook and Weisberg (1982), etc.
plot(dfbeta(lmod)[,3],ylab="Change in lweight coef")
abline(h=0)There is no significant changes in the p-values between the model that excludes the largest subject and the full model. But the coefficient estimate for lweight changed by nearly 37% with just the exclusion of the largest subject. With that, I concluded that the largest subject (#32) is an influential point.
(f) Check the structure of the relationship between the predictors and the response.
summary(prostate)## lcavol lweight age lbph
## Min. :-1.3471 Min. :2.375 Min. :41.00 Min. :-1.3863
## 1st Qu.: 0.5128 1st Qu.:3.376 1st Qu.:60.00 1st Qu.:-1.3863
## Median : 1.4469 Median :3.623 Median :65.00 Median : 0.3001
## Mean : 1.3500 Mean :3.653 Mean :63.87 Mean : 0.1004
## 3rd Qu.: 2.1270 3rd Qu.:3.878 3rd Qu.:68.00 3rd Qu.: 1.5581
## Max. : 3.8210 Max. :6.108 Max. :79.00 Max. : 2.3263
## svi lcp gleason pgg45
## Min. :0.0000 Min. :-1.3863 Min. :6.000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:-1.3863 1st Qu.:6.000 1st Qu.: 0.00
## Median :0.0000 Median :-0.7985 Median :7.000 Median : 15.00
## Mean :0.2165 Mean :-0.1794 Mean :6.753 Mean : 24.38
## 3rd Qu.:0.0000 3rd Qu.: 1.1786 3rd Qu.:7.000 3rd Qu.: 40.00
## Max. :1.0000 Max. : 2.9042 Max. :9.000 Max. :100.00
## lpsa
## Min. :-0.4308
## 1st Qu.: 1.7317
## Median : 2.5915
## Mean : 2.4784
## 3rd Qu.: 3.0564
## Max. : 5.5829
plot(prostate$age)prostate$age## [1] 50 58 74 58 62 50 64 58 47 63 65 63 63 67 57 66 70 66 41 70 59 60 59 63 69
## [26] 68 65 67 67 65 65 65 71 54 63 64 73 64 68 56 60 68 62 61 66 61 79 68 43 70
## [51] 68 64 64 68 59 66 47 49 70 61 73 63 72 66 64 61 68 72 69 72 60 77 69 60 69
## [76] 68 72 78 69 63 66 57 77 65 60 64 58 62 65 76 68 61 68 44 52 68 68
Realized that age 55 would be a good cutoff point.
l <-residuals(lm(lpsa~., data=prostate))
a <-residuals(lm(age~lcavol+lweight+lbph+svi+lcp+gleason+pgg45, data=prostate))
plot(a, l, xlab ="Age residuals", ylab = "log(prostate specific antigen) residuals")
# abline(l,a)
abline(0, coef(lmod)['age'])coef(lm(l~a))## (Intercept) a
## -2.705426e-16 7.003921e-18
coef(lmod)## (Intercept) lcavol lweight age lbph svi
## 0.669336698 0.587021826 0.454467424 -0.019637176 0.107054031 0.766157326
## lcp gleason pgg45
## -0.105474263 0.045141598 0.004525231
Let’s try to construct a partial residual plot.
termplot(lmod, partial.resid = T, terms = 1)We now construct 2 groups, one is greater than or equal to 55 years of age. The other is strictly less than 55.
lmodx <- lm(lpsa~., data = prostate, subset = (age>=55))
lmody <- lm(lpsa~., data = prostate, subset = (age< 55))
sumary(lmodx)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0156469 1.4409033 0.0109 0.991364
## lcavol 0.5834430 0.0921044 6.3346 1.409e-08
## lweight 0.4371205 0.1691919 2.5836 0.011646
## age -0.0015432 0.0156670 -0.0985 0.921787
## lbph 0.1060509 0.0578093 1.8345 0.070394
## svi 0.7034850 0.2423294 2.9030 0.004804
## lcp -0.1559469 0.0917346 -1.7000 0.093120
## gleason -0.0410577 0.1563744 -0.2626 0.793582
## pgg45 0.0082465 0.0044896 1.8368 0.070049
##
## n = 87, p = 9, Residual SE = 0.68860, R-Squared = 0.63
sumary(lmody)##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.0622439 53.8561672 1.2638 0.3337
## lcavol 0.5325474 0.3583797 1.4860 0.2756
## lweight 2.3174689 1.3590219 1.7052 0.2303
## age 0.0141106 0.0754149 0.1871 0.8688
## lbph -0.0020277 0.3801935 -0.0053 0.9962
## svi -13.2448104 8.2037138 -1.6145 0.2478
## lcp 6.7593744 4.2628645 1.5856 0.2537
## gleason -10.9262719 7.9589355 -1.3728 0.3035
##
## n = 10, p = 8, Residual SE = 0.74034, R-Squared = 0.96
prostate$age_group <- ifelse(prostate$age >= 55, "Older", "Younger")
require(ggplot2)## Loading required package: ggplot2
Plotting lpsa against age by age_group
ggplot(prostate, aes(x = age, y = lpsa, shape = age_group)) +
geom_point()ggplot(prostate, aes(x = age, y = lpsa, shape = age_group)) +
geom_point() +
facet_grid(~ age_group) +
stat_smooth(method = "lm")## `geom_smooth()` using formula 'y ~ x'
Note the regression line with a 95% confidence bands, making it clear how relationships differ between 2 groups.