Ex 6.3

require(faraway)
## Loading required package: faraway

Q.

For the prostate data, fit a model with lpsa as the response and the other variables as predictors. Answer the questions posed in the first question.

(a) Check the constant variance assumption for the errors.

head(prostate)
##       lcavol lweight age      lbph svi      lcp gleason pgg45     lpsa
## 1 -0.5798185  2.7695  50 -1.386294   0 -1.38629       6     0 -0.43078
## 2 -0.9942523  3.3196  58 -1.386294   0 -1.38629       6     0 -0.16252
## 3 -0.5108256  2.6912  74 -1.386294   0 -1.38629       7    20 -0.16252
## 4 -1.2039728  3.2828  58 -1.386294   0 -1.38629       6     0 -0.16252
## 5  0.7514161  3.4324  62 -1.386294   0 -1.38629       6     0  0.37156
## 6 -1.0498221  3.2288  50 -1.386294   0 -1.38629       6     0  0.76547

Column description

lpsa - log(prostate specific antigen)

For more details, check here

lmod <- lm (lpsa ~ ., data=prostate)
summary(lmod)
## 
## Call:
## lm(formula = lpsa ~ ., data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7331 -0.3713 -0.0170  0.4141  1.6381 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.669337   1.296387   0.516  0.60693    
## lcavol       0.587022   0.087920   6.677 2.11e-09 ***
## lweight      0.454467   0.170012   2.673  0.00896 ** 
## age         -0.019637   0.011173  -1.758  0.08229 .  
## lbph         0.107054   0.058449   1.832  0.07040 .  
## svi          0.766157   0.244309   3.136  0.00233 ** 
## lcp         -0.105474   0.091013  -1.159  0.24964    
## gleason      0.045142   0.157465   0.287  0.77503    
## pgg45        0.004525   0.004421   1.024  0.30886    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7084 on 88 degrees of freedom
## Multiple R-squared:  0.6548, Adjusted R-squared:  0.6234 
## F-statistic: 20.86 on 8 and 88 DF,  p-value: < 2.2e-16
plot(fitted(lmod), residuals(lmod), xlab = "Fitted", ylab = "Residuals")
abline(h=0)

The plot shows there is a constant variance for the errors.

Let’s examine the constant variance assumption more closely.

plot(fitted(lmod), sqrt(abs(residuals(lmod))), xlab = "Fitted", ylab=expression(sqrt(hat(epsilon))))

The plot above shows approximately constant variation.

sumary(lm(sqrt(abs(residuals(lmod))) ~ fitted(lmod)))
##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   0.703381   0.090607  7.7630 9.475e-12
## fitted(lmod) -0.021990   0.034232 -0.6424    0.5222
## 
## n = 97, p = 2, Residual SE = 0.31328, R-Squared = 0

-> Doesn’t seem to have non-constant variance.

(b) Check the normality assumption.

qqnorm(residuals(lmod), ylab = "Residuals")
qqline(residuals(lmod))

You can tell that both the left tail and the right tail we see some deviations from the normal.

shapiro.test(residuals(lmod))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lmod)
## W = 0.99113, p-value = 0.7721

The p-value in Shapiro-Wilk Normality test tells us that there is no evidence that null hypothesis (the data came from a normally distributed distribution) can not be rejected.

(c) Check for large leverage points.

hatv <- hatvalues(lmod)
head(hatv)
##          1          2          3          4          5          6 
## 0.07873101 0.06758053 0.13596177 0.07766218 0.03499946 0.08331908
sum(hatv)
## [1] 9
length(coefficients(lmod))
## [1] 9

We verify that the sum of the leverages is indeed 9 — the number of parameters in the model.

subjects <- row.names(prostate)
halfnorm(hatv, labs = subjects, ylab="Leverages")

qqnorm(rstandard(lmod))
abline(0,1)

As for Standardized Residuals in the plot above, an absolute value of 2 would be large but not exceptional for a standardized residual whereas a value of 4 would be very unusual under the standard normal. So there is a total of 6 large leverage points.

(d) Check for outliers.

Let’s compute studentized residuals for the prostate data and pick out the largest:

stud <- rstudent(lmod)
stud[which.max(abs(stud))]
##       39 
## -2.61698

The largest residual of -2.62 is rather large for a standard normal scale, but is it an outlier? Compute the Bonferroni critical value:

qt(.05/(50*2),44)
## [1] -3.525801

Since the value 2.61698 is smaller than 3.53 in absolute value, 39 is not an outlier.

(e) Check for influential points.

cook <- cooks.distance(lmod)
halfnorm(cook, 3, labs = subjects, ylab = "Cook's distance")

We’ve now identified the largest 3 subjects for this prostate cancer study. We now exclude the largest subject (#32) and see how fit changes.

lmod1<-lm(lpsa ~ ., data=prostate, subset = (cook < max(cook)))
sumary(lmod1)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.1718627  1.3288221  0.1293 0.897391
## lcavol       0.5653328  0.0884722  6.3899 7.93e-09
## lweight      0.6216627  0.2020165  3.0773 0.002793
## age         -0.0212715  0.0111459 -1.9085 0.059630
## lbph         0.0955905  0.0585289  1.6332 0.106037
## svi          0.7604232  0.2425957  3.1345 0.002347
## lcp         -0.1059870  0.0903647 -1.1729 0.244045
## gleason      0.0506884  0.1563842  0.3241 0.746620
## pgg45        0.0044683  0.0043898  1.0179 0.311554
## 
## n = 96, p = 9, Residual SE = 0.70336, R-Squared = 0.66

Compared to the full fit:

sumary(lmod)
##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  0.6693367  1.2963875  0.5163  0.606934
## lcavol       0.5870218  0.0879203  6.6767 2.111e-09
## lweight      0.4544674  0.1700124  2.6731  0.008955
## age         -0.0196372  0.0111727 -1.7576  0.082293
## lbph         0.1070540  0.0584492  1.8316  0.070398
## svi          0.7661573  0.2443091  3.1360  0.002329
## lcp         -0.1054743  0.0910135 -1.1589  0.249638
## gleason      0.0451416  0.1574645  0.2867  0.775033
## pgg45        0.0045252  0.0044212  1.0235  0.308860
## 
## n = 97, p = 9, Residual SE = 0.70842, R-Squared = 0.65

We have plotted the change in the third parameter estimate, lweight, when a case is left out.

# dfbeta: This suite of functions can be used to compute some of the regression (leave-one-out deletion) diagnostics for linear and generalized linear models discussed in Belsley, Kuh and Welsch (1980), Cook and Weisberg (1982), etc.
plot(dfbeta(lmod)[,3],ylab="Change in lweight coef")
abline(h=0)

There is no significant changes in the p-values between the model that excludes the largest subject and the full model. But the coefficient estimate for lweight changed by nearly 37% with just the exclusion of the largest subject. With that, I concluded that the largest subject (#32) is an influential point.

(f) Check the structure of the relationship between the predictors and the response.

summary(prostate)
##      lcavol           lweight           age             lbph        
##  Min.   :-1.3471   Min.   :2.375   Min.   :41.00   Min.   :-1.3863  
##  1st Qu.: 0.5128   1st Qu.:3.376   1st Qu.:60.00   1st Qu.:-1.3863  
##  Median : 1.4469   Median :3.623   Median :65.00   Median : 0.3001  
##  Mean   : 1.3500   Mean   :3.653   Mean   :63.87   Mean   : 0.1004  
##  3rd Qu.: 2.1270   3rd Qu.:3.878   3rd Qu.:68.00   3rd Qu.: 1.5581  
##  Max.   : 3.8210   Max.   :6.108   Max.   :79.00   Max.   : 2.3263  
##       svi              lcp             gleason          pgg45       
##  Min.   :0.0000   Min.   :-1.3863   Min.   :6.000   Min.   :  0.00  
##  1st Qu.:0.0000   1st Qu.:-1.3863   1st Qu.:6.000   1st Qu.:  0.00  
##  Median :0.0000   Median :-0.7985   Median :7.000   Median : 15.00  
##  Mean   :0.2165   Mean   :-0.1794   Mean   :6.753   Mean   : 24.38  
##  3rd Qu.:0.0000   3rd Qu.: 1.1786   3rd Qu.:7.000   3rd Qu.: 40.00  
##  Max.   :1.0000   Max.   : 2.9042   Max.   :9.000   Max.   :100.00  
##       lpsa        
##  Min.   :-0.4308  
##  1st Qu.: 1.7317  
##  Median : 2.5915  
##  Mean   : 2.4784  
##  3rd Qu.: 3.0564  
##  Max.   : 5.5829
plot(prostate$age)

prostate$age
##  [1] 50 58 74 58 62 50 64 58 47 63 65 63 63 67 57 66 70 66 41 70 59 60 59 63 69
## [26] 68 65 67 67 65 65 65 71 54 63 64 73 64 68 56 60 68 62 61 66 61 79 68 43 70
## [51] 68 64 64 68 59 66 47 49 70 61 73 63 72 66 64 61 68 72 69 72 60 77 69 60 69
## [76] 68 72 78 69 63 66 57 77 65 60 64 58 62 65 76 68 61 68 44 52 68 68

Realized that age 55 would be a good cutoff point.

l <-residuals(lm(lpsa~., data=prostate))
a <-residuals(lm(age~lcavol+lweight+lbph+svi+lcp+gleason+pgg45, data=prostate))
plot(a, l, xlab ="Age residuals", ylab = "log(prostate specific antigen) residuals")
# abline(l,a)
abline(0, coef(lmod)['age'])

coef(lm(l~a))
##   (Intercept)             a 
## -2.705426e-16  7.003921e-18
coef(lmod)
##  (Intercept)       lcavol      lweight          age         lbph          svi 
##  0.669336698  0.587021826  0.454467424 -0.019637176  0.107054031  0.766157326 
##          lcp      gleason        pgg45 
## -0.105474263  0.045141598  0.004525231

Let’s try to construct a partial residual plot.

termplot(lmod, partial.resid = T, terms = 1)

We now construct 2 groups, one is greater than or equal to 55 years of age. The other is strictly less than 55.

lmodx <- lm(lpsa~., data = prostate, subset = (age>=55))
lmody <- lm(lpsa~., data = prostate, subset = (age< 55))

sumary(lmodx)
##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  0.0156469  1.4409033  0.0109  0.991364
## lcavol       0.5834430  0.0921044  6.3346 1.409e-08
## lweight      0.4371205  0.1691919  2.5836  0.011646
## age         -0.0015432  0.0156670 -0.0985  0.921787
## lbph         0.1060509  0.0578093  1.8345  0.070394
## svi          0.7034850  0.2423294  2.9030  0.004804
## lcp         -0.1559469  0.0917346 -1.7000  0.093120
## gleason     -0.0410577  0.1563744 -0.2626  0.793582
## pgg45        0.0082465  0.0044896  1.8368  0.070049
## 
## n = 87, p = 9, Residual SE = 0.68860, R-Squared = 0.63
sumary(lmody)
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate  Std. Error t value Pr(>|t|)
## (Intercept)  68.0622439  53.8561672  1.2638   0.3337
## lcavol        0.5325474   0.3583797  1.4860   0.2756
## lweight       2.3174689   1.3590219  1.7052   0.2303
## age           0.0141106   0.0754149  0.1871   0.8688
## lbph         -0.0020277   0.3801935 -0.0053   0.9962
## svi         -13.2448104   8.2037138 -1.6145   0.2478
## lcp           6.7593744   4.2628645  1.5856   0.2537
## gleason     -10.9262719   7.9589355 -1.3728   0.3035
## 
## n = 10, p = 8, Residual SE = 0.74034, R-Squared = 0.96
prostate$age_group <- ifelse(prostate$age >= 55, "Older", "Younger")
require(ggplot2)
## Loading required package: ggplot2

Plotting lpsa against age by age_group

ggplot(prostate, aes(x = age, y = lpsa, shape = age_group)) +
  geom_point()

ggplot(prostate, aes(x = age, y = lpsa, shape = age_group)) +
  geom_point() +
  facet_grid(~ age_group) +
  stat_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Note the regression line with a 95% confidence bands, making it clear how relationships differ between 2 groups.