Disc7

R Markdown

Use the prostate data with lpsa as the response and the other variables as predictors. Implement the following variable selection methods to determine the “best” model:

a). Backward elimination

Backward elimination is the simplest of all variabl selection procedures and can be easily implemented without special software. We start with all predictors and then remove the predictor with highest p-value greater than alphacrit. By proceeding in this manner, all “nonsignificant” predictors will be removed and the selection process will be complete.

From the regression analysis below, we remove the gleason predictor as it has the highest p-value of 0.77503 > 0.05 and run the analysis again and find that lcp has a p-value of 0.25127 > 0.05. We next remove this predictor and find that pgg45 has a p-value > 0.05. We remove this predictor next and find that age as a p-value of 0.169528 > 0.05 and remove this predictor next. The last predictor that we find in this iteration is lbph which has a p-value > 0.05. We remove this also.

We notice that R^2 for the full model of 0.6234 is reduced only slightly to 0.6144 in the final model. Thus, the removal of four predictors causes only a minor reduction in fit. It is important to understand that the variables omitted from the model may still be related to the response. We see that all lcp seems to have some association with lpsa. Thus, one failing of the backward elimination model is that it cannot reliably distinguish between important and unimportant predictors.

library(faraway)
data(prostate)
lmod <- lm(lpsa ~., prostate)
summary(lmod)

## 
## Call:
## lm(formula = lpsa ~ ., data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7331 -0.3713 -0.0170  0.4141  1.6381 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.669337   1.296387   0.516  0.60693    
## lcavol       0.587022   0.087920   6.677 2.11e-09 ***
## lweight      0.454467   0.170012   2.673  0.00896 ** 
## age         -0.019637   0.011173  -1.758  0.08229 .  
## lbph         0.107054   0.058449   1.832  0.07040 .  
## svi          0.766157   0.244309   3.136  0.00233 ** 
## lcp         -0.105474   0.091013  -1.159  0.24964    
## gleason      0.045142   0.157465   0.287  0.77503    
## pgg45        0.004525   0.004421   1.024  0.30886    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7084 on 88 degrees of freedom
## Multiple R-squared:  0.6548, Adjusted R-squared:  0.6234 
## F-statistic: 20.86 on 8 and 88 DF,  p-value: < 2.2e-16

lmod <- update(lmod, . ~ . - gleason)
summary(lmod)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp + 
##     pgg45, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73117 -0.38137 -0.01728  0.43364  1.63513 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.953926   0.829439   1.150  0.25319    
## lcavol       0.591615   0.086001   6.879 8.07e-10 ***
## lweight      0.448292   0.167771   2.672  0.00897 ** 
## age         -0.019336   0.011066  -1.747  0.08402 .  
## lbph         0.107671   0.058108   1.853  0.06720 .  
## svi          0.757734   0.241282   3.140  0.00229 ** 
## lcp         -0.104482   0.090478  -1.155  0.25127    
## pgg45        0.005318   0.003433   1.549  0.12488    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7048 on 89 degrees of freedom
## Multiple R-squared:  0.6544, Adjusted R-squared:  0.6273 
## F-statistic: 24.08 on 7 and 89 DF,  p-value: < 2.2e-16

lmod <- update(lmod, . ~ . - lcp)
summary(lmod)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + pgg45, 
##     data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77711 -0.41708  0.00002  0.40676  1.59681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.980085   0.830665   1.180  0.24116    
## lcavol       0.545770   0.076431   7.141 2.31e-10 ***
## lweight      0.449450   0.168078   2.674  0.00890 ** 
## age         -0.017470   0.010967  -1.593  0.11469    
## lbph         0.105755   0.058191   1.817  0.07249 .  
## svi          0.641666   0.219757   2.920  0.00442 ** 
## pgg45        0.003528   0.003068   1.150  0.25331    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7061 on 90 degrees of freedom
## Multiple R-squared:  0.6493, Adjusted R-squared:  0.6259 
## F-statistic: 27.77 on 6 and 90 DF,  p-value: < 2.2e-16

lmod <- update(lmod, . ~ . - pgg45)
summary(lmod)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83505 -0.39396  0.00414  0.46336  1.57888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.95100    0.83175   1.143 0.255882    
## lcavol       0.56561    0.07459   7.583 2.77e-11 ***
## lweight      0.42369    0.16687   2.539 0.012814 *  
## age         -0.01489    0.01075  -1.385 0.169528    
## lbph         0.11184    0.05805   1.927 0.057160 .  
## svi          0.72095    0.20902   3.449 0.000854 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7073 on 91 degrees of freedom
## Multiple R-squared:  0.6441, Adjusted R-squared:  0.6245 
## F-statistic: 32.94 on 5 and 91 DF,  p-value: < 2.2e-16

lmod <- update(lmod, . ~ . - age)
summary(lmod)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + lbph + svi, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82653 -0.42270  0.04362  0.47041  1.48530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.14554    0.59747   0.244  0.80809    
## lcavol       0.54960    0.07406   7.422 5.64e-11 ***
## lweight      0.39088    0.16600   2.355  0.02067 *  
## lbph         0.09009    0.05617   1.604  0.11213    
## svi          0.71174    0.20996   3.390  0.00103 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7108 on 92 degrees of freedom
## Multiple R-squared:  0.6366, Adjusted R-squared:  0.6208 
## F-statistic: 40.29 on 4 and 92 DF,  p-value: < 2.2e-16

lmod <- update(lmod, . ~ . - lbph)
summary(lmod)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72964 -0.45764  0.02812  0.46403  1.57013 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.26809    0.54350  -0.493  0.62298    
## lcavol       0.55164    0.07467   7.388  6.3e-11 ***
## lweight      0.50854    0.15017   3.386  0.00104 ** 
## svi          0.66616    0.20978   3.176  0.00203 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7168 on 93 degrees of freedom
## Multiple R-squared:  0.6264, Adjusted R-squared:  0.6144 
## F-statistic: 51.99 on 3 and 93 DF,  p-value: < 2.2e-16

summary(lm(lpsa ~ gleason + lcp + pgg45 + age + lbph, prostate))

## 
## Call:
## lm(formula = lpsa ~ gleason + lcp + pgg45 + age + lbph, data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0633 -0.5770 -0.0574  0.5861  2.3477 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.615915   1.529515   1.056   0.2935    
## gleason     0.102677   0.207699   0.494   0.6222    
## lcp         0.398056   0.091097   4.370  3.3e-05 ***
## pgg45       0.002105   0.005904   0.357   0.7222    
## age         0.002753   0.014668   0.188   0.8516    
## lbph        0.133615   0.072286   1.848   0.0678 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9608 on 91 degrees of freedom
## Multiple R-squared:  0.3433, Adjusted R-squared:  0.3072 
## F-statistic: 9.512 on 5 and 91 DF,  p-value: 2.527e-07

b). AIC

the leaps package exhasutively searches all possible combinations of the predictors. For each size of model p, it finds the variables that prduce the minimum RSS. From the analysis, we see that the best one-predictor model uses lcavol, the best two-predictor model uses lweight etc. We see from the figure below that the AIC is minimized by a choice of three predictors, namely lcavol, lweight and svi.

require(leaps)

## Loading required package: leaps

b <- regsubsets(lpsa ~., prostate)
rs <- summary(b)
rs$which

##   (Intercept) lcavol lweight   age  lbph   svi   lcp gleason pgg45
## 1        TRUE   TRUE   FALSE FALSE FALSE FALSE FALSE   FALSE FALSE
## 2        TRUE   TRUE    TRUE FALSE FALSE FALSE FALSE   FALSE FALSE
## 3        TRUE   TRUE    TRUE FALSE FALSE  TRUE FALSE   FALSE FALSE
## 4        TRUE   TRUE    TRUE FALSE  TRUE  TRUE FALSE   FALSE FALSE
## 5        TRUE   TRUE    TRUE  TRUE  TRUE  TRUE FALSE   FALSE FALSE
## 6        TRUE   TRUE    TRUE  TRUE  TRUE  TRUE FALSE   FALSE  TRUE
## 7        TRUE   TRUE    TRUE  TRUE  TRUE  TRUE  TRUE   FALSE  TRUE
## 8        TRUE   TRUE    TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE

AIC <- 50 * log(rs$rss/50) + (2:8) * 2

## Warning in 50 * log(rs$rss/50) + (2:8) * 2: longer object length is not a
## multiple of shorter object length

plot(AIC ~ I(1:8), ylabs = "AIC", xlab = "Number of Predictors")

## Warning in plot.window(...): "ylabs" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "ylabs" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "ylabs" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "ylabs" is not a
## graphical parameter

## Warning in box(...): "ylabs" is not a graphical parameter

## Warning in title(...): "ylabs" is not a graphical parameter

c). Adjusted R^2

The model that the adjusted R^2 criterion selects is model 7 which has lcavol, lweight, age, lbph, svi and lcp as predictors.

plot(1:8, rs$adjr2, xlab = "No. of parameters", ylab = "Adjusted R-squared")

which.max(rs$adjr2)

## [1] 7

d). Mallows Cp

One final criterion is Mallow’s Cp statistic. A good model should predict well, so the average mean square error of prediction might be a good criterion. Per the plot, there is only one model (model 7) on or below the Cp = p line, indicating a good fit.

The best model may be model 7 which is picked using both the Mallow’s method as well as the adjusted-R^2 method. This model excludes gleason and has an adjusted R-squared of 0.62.