Q7

In the lab, we applied random forests to the “Boston” data using “mtry = 6” and using “ntree = 25” and “ntree = 500”. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for “mtry” and “ntree”. Describe the results obtained.

First: 划分训练集与测试集

library(MASS)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
Boston.train <- Boston[train, -14]
Boston.test <- Boston[-train, -14]
Y.train <- Boston[train, 14]
Y.test <- Boston[-train, 14]

Second: 拟合模型

rf.boston1 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = ncol(Boston) - 1, ntree = 500)
rf.boston2 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = (ncol(Boston) - 1) / 2, ntree = 500)
rf.boston3 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = sqrt(ncol(Boston) - 1), ntree = 500)

Third: 可视化

plot(1:500, rf.boston1$test$mse, col = "green", type = "l", xlab = "Number of Trees", ylab = "Test MSE", ylim = c(18, 30))
lines(1:500, rf.boston2$test$mse, col = "red", type = "l")
lines(1:500, rf.boston3$test$mse, col = "blue", type = "l")
legend("topright", c("m = p", "m = p/2", "m = sqrt(p)"), col = c("green", "red", "blue"), cex = 1, lty = 1)

Q9

This problem involves the “OJ” data set which is part of the “ISLR” package.

a.Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

library(ISLR)
set.seed(1)
train = sample(1:nrow(OJ),800)
OJ.train = OJ[train,]
OJ.test = OJ[-train,]

b.Fit a tree to the training data, with “Purchase” as the response and the other variables except for “Buy” as predictors. Use the “summary()” function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate ? How many terminal nodes does the tree have ?

library(tree)
tree.oj = tree(Purchase~.,data=OJ.train)
summary(tree.oj)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800

The fitted tree has 9 terminal nodes and a training error rate of 0.1588.

c.Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

tree.oj
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1073.00 CH ( 0.60625 0.39375 )  
##    2) LoyalCH < 0.5036 365  441.60 MM ( 0.29315 0.70685 )  
##      4) LoyalCH < 0.280875 177  140.50 MM ( 0.13559 0.86441 )  
##        8) LoyalCH < 0.0356415 59   10.14 MM ( 0.01695 0.98305 ) *
##        9) LoyalCH > 0.0356415 118  116.40 MM ( 0.19492 0.80508 ) *
##      5) LoyalCH > 0.280875 188  258.00 MM ( 0.44149 0.55851 )  
##       10) PriceDiff < 0.05 79   84.79 MM ( 0.22785 0.77215 )  
##         20) SpecialCH < 0.5 64   51.98 MM ( 0.14062 0.85938 ) *
##         21) SpecialCH > 0.5 15   20.19 CH ( 0.60000 0.40000 ) *
##       11) PriceDiff > 0.05 109  147.00 CH ( 0.59633 0.40367 ) *
##    3) LoyalCH > 0.5036 435  337.90 CH ( 0.86897 0.13103 )  
##      6) LoyalCH < 0.764572 174  201.00 CH ( 0.73563 0.26437 )  
##       12) ListPriceDiff < 0.235 72   99.81 MM ( 0.50000 0.50000 )  
##         24) PctDiscMM < 0.196197 55   73.14 CH ( 0.61818 0.38182 ) *
##         25) PctDiscMM > 0.196197 17   12.32 MM ( 0.11765 0.88235 ) *
##       13) ListPriceDiff > 0.235 102   65.43 CH ( 0.90196 0.09804 ) *
##      7) LoyalCH > 0.764572 261   91.20 CH ( 0.95785 0.04215 ) *

We pick the node labeled 8, which is a terminal node because of the asterisk. The split criterion is LoyalCH<0.0356, the number of observations in that branch is 59 with a deviance of 10.14 and an overall prediction for the branch of MM. Less than 2% of the observations in that branch take the value of CH, and the remaining 98% take the value of MM.

d.Create a plot of the tree, and interpret the results.

plot(tree.oj)
text(tree.oj,pretty=0)

We may see that the most important indicator of “Purchase” appears to be “LoyalCH”, since the first branch differences the intensity of customer brand loyalty to CH. In fact, the top three nodes contain “LoyalCH”.

e.Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate ?

tree.pred = predict(tree.oj,OJ.test,type="class")
table(tree.pred,OJ.test$Purchase)
##          
## tree.pred  CH  MM
##        CH 160  38
##        MM   8  64
1-(160+64)/270
## [1] 0.1703704

The test error rate is about 17%.

f.Apply the “cv.tree()” function to the training set in order to determine the optimal size tree.

cv.oj = cv.tree(tree.oj,FUN=prune.misclass)
cv.oj
## $size
## [1] 9 8 7 4 2 1
## 
## $dev
## [1] 150 150 149 158 172 315
## 
## $k
## [1]       -Inf   0.000000   3.000000   4.333333  10.500000 151.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

The optimal size is 7 with the dev of 149.

g.Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

plot(cv.oj$size,cv.oj$dev,type='b',xlab='Tree size',ylab='Deviance')

h.Which tree size corresponds to the lowest cross-validated classification error rate ? 7

i.Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

prune.oj = prune.misclass(tree.oj,best=7)
plot(prune.oj)
text(prune.oj,pretty=0)

j.Compare the training error rates between the pruned and unpruned trees. Which is higher ?

summary(tree.oj)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800
summary(prune.oj)
## 
## Classification tree:
## snip.tree(tree = tree.oj, nodes = c(4L, 10L))
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff" "PctDiscMM"    
## Number of terminal nodes:  7 
## Residual mean deviance:  0.7748 = 614.4 / 793 
## Misclassification error rate: 0.1625 = 130 / 800

The misclassification error rate is slightly higher for the pruned tree(0.1625vs0.1588)

k.Compare the test error rates between the pruned and unpruned trees. Which is higher ?

prune.pred = predict(prune.oj,OJ.test,type='class')
table(prune.pred,OJ.test$Purchase)
##           
## prune.pred  CH  MM
##         CH 160  36
##         MM   8  66
1-(160+66)/270
## [1] 0.162963

In this case, the pruning process decreased the test error rate to about 16%, and it also produced a way more interpretable tree.

Q10

We now use boosting to predict “Salary” in the “Hitters” data set.

a.Remove the observations for whom the salary information is unknown, and then log-transform the salaries.

Hitters = na.omit(Hitters)
Hitters$Salary = log(Hitters$Salary)

b.Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.

train = 1:200
Hitters.train = Hitters[train,]
Hitters.test = Hitters[-train,]

c.Perform boosting on the training set with 1000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

library(gbm)
## Loaded gbm 2.1.8
set.seed(1)
pows = seq(-10,-0.2,by=0.1)
lambdas = 10^pows
train.err = rep(NA,length(lambdas))
for (i in 1:length(lambdas)){
  boost.hitters = gbm(Salary~.,data=Hitters.train,distribution='gaussian',n.trees=1000,shrinkage=lambdas[i])
  pred.train = predict(boost.hitters,Hitters.train,n.trees=1000)
  train.err[i] = mean((pred.train-Hitters.train$Salary)^2)
}
plot(lambdas,train.err,type="b",xlab="Shrinkage values",ylab="Training MSE")

d.Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

set.seed(1)
test.err = rep(NA,length(lambdas))
for (i in 1:length(lambdas)){
  boost.hitters = gbm(Salary~.,data=Hitters.train,distribution="gaussian",n.trees=1000,shrinkage=lambdas[i])
  yhat = predict(boost.hitters,Hitters.test,n.trees=1000)
  test.err[i] = mean((yhat-Hitters.test$Salary)^2)
}
plot(lambdas,test.err,type="b",xlab="Shrinkage values",ylab="Test MSE")

min(test.err)
## [1] 0.2540265
lambdas[which.min(test.err)]
## [1] 0.07943282

The minimum test MSE is 0.25, and is obtained for \(\lambda\)=0.079

e.Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.

# Linear regression
library(glmnet)
## 载入需要的程辑包:Matrix
## Loaded glmnet 4.1-1
fit1 = lm(Salary~.,data=Hitters.train)
pred1 = predict(fit1,Hitters.test)
mean((pred1-Hitters.test$Salary)^2)
## [1] 0.4917959
# Ridge regression
x = model.matrix(Salary~.,data=Hitters.train)
x.test = model.matrix(Salary~.,data=Hitters.test)
y = Hitters.train$Salary
fit2 = glmnet(x,y,alpha=0)
pred2 = predict(fit2,s=0.01,newx=x.test)
mean((pred2-Hitters.test$Salary)^2)
## [1] 0.4570283

The test MSE for boosting is lower than for linear regression and ridge regression.

f.Which variables appear to be the most important predictors in the boosted model ?

library(gbm)
boost.hitters = gbm(Salary~.,data=Hitters.train,distribution='gaussian',n.trees=1000,shrinkage=lambdas[which.min(test.err)])
summary(boost.hitters)

##                 var    rel.inf
## CAtBat       CAtBat 20.8404970
## CRBI           CRBI 12.3158959
## Walks         Walks  7.4186037
## PutOuts     PutOuts  7.1958539
## Years         Years  6.3104535
## CWalks       CWalks  6.0221656
## CHmRun       CHmRun  5.7759763
## CHits         CHits  4.8914360
## AtBat         AtBat  4.2187460
## RBI             RBI  4.0812410
## Hits           Hits  4.0117255
## Assists     Assists  3.8786634
## HmRun         HmRun  3.6386178
## CRuns         CRuns  3.3230296
## Errors       Errors  2.6369128
## Runs           Runs  2.2048386
## Division   Division  0.5347342
## NewLeague NewLeague  0.4943540
## League       League  0.2062551

The “CAtBat” is by far the most important variable.

g.Now apply bagging to the training set. What is the test set MSE for this approach ?

set.seed(1)
bag.hitters = randomForest(Salary~.,data=Hitters.train,mtry=19,ntree=500)
yhat.bag = predict(bag.hitters,newdata=Hitters.test)
mean((yhat.bag-Hitters.test$Salary)^2)
## [1] 0.2299324

The test MSE for bagging is 0.23, which is slightly lower than the test MSE for boosting.

Q11

This question uses the “Caravan” data set.

a.Create a training set consisting of the first 1000 observations, and a test set consisting of the remaining observations.

set.seed(1)
train = 1:1000
Caravan.train = Caravan[train,]
Caravan.test = Caravan[-train,]

b.Fit a boosting model to the training set with “Purchase” as the response and the other variables as predictors. Use 1000 trees, and a shrinkage value of 0.01. Which predictors appear to be most important ?

set.seed(1)
boost.caravan = gbm(Purchase~.,data=Caravan.train,distribution="gaussian",n.trees=1000,shrinkage=0.01)
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
## variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
## variable 71: AVRAAUT has no variation.
summary(boost.caravan)

##               var     rel.inf
## PPERSAUT PPERSAUT 13.51824557
## MKOOPKLA MKOOPKLA 10.24062778
## MOPLHOOG MOPLHOOG  7.32689780
## MBERMIDD MBERMIDD  6.35820558
## PBRAND     PBRAND  4.98826360
## ABRAND     ABRAND  4.54504653
## MGODGE     MGODGE  4.26496875
## MINK3045 MINK3045  4.13253907
## PWAPART   PWAPART  3.15612877
## MOSTYPE   MOSTYPE  2.79099985
## MAUT1       MAUT1  2.76929763
## MAUT2       MAUT2  1.99879666
## MSKA         MSKA  1.94618539
## MBERARBG MBERARBG  1.89917331
## PBYSTAND PBYSTAND  1.88591514
## MINKGEM   MINKGEM  1.87131472
## MGODOV     MGODOV  1.81673309
## MGODPR     MGODPR  1.80814745
## MFWEKIND MFWEKIND  1.67884570
## MSKC         MSKC  1.65075962
## MBERHOOG MBERHOOG  1.53559951
## MSKB1       MSKB1  1.43339514
## MZFONDS   MZFONDS  1.37139649
## MOPLMIDD MOPLMIDD  1.10617074
## MRELGE     MRELGE  1.09039794
## MINK7512 MINK7512  1.08772012
## MGODRK     MGODRK  1.03126657
## MINK4575 MINK4575  1.02492795
## MHHUUR     MHHUUR  0.86853689
## MRELOV     MRELOV  0.80356854
## MFGEKIND MFGEKIND  0.80335689
## MZPART     MZPART  0.69824614
## MBERARBO MBERARBO  0.60909852
## MHKOOP     MHKOOP  0.57696613
## APERSAUT APERSAUT  0.56707821
## MGEMOMV   MGEMOMV  0.55589456
## MAUT0       MAUT0  0.54748481
## PMOTSCO   PMOTSCO  0.43362597
## MSKB2       MSKB2  0.43075446
## MSKD         MSKD  0.42751490
## MINK123M MINK123M  0.40920707
## MINKM30   MINKM30  0.36996576
## MOSHOOFD MOSHOOFD  0.33336325
## MBERBOER MBERBOER  0.28967068
## MFALLEEN MFALLEEN  0.28877552
## MGEMLEEF MGEMLEEF  0.20084195
## MOPLLAAG MOPLLAAG  0.15750616
## MBERZELF MBERZELF  0.11203381
## PLEVEN     PLEVEN  0.11030994
## MRELSA     MRELSA  0.04500507
## MAANTHUI MAANTHUI  0.03322830
## PWABEDR   PWABEDR  0.00000000
## PWALAND   PWALAND  0.00000000
## PBESAUT   PBESAUT  0.00000000
## PVRAAUT   PVRAAUT  0.00000000
## PAANHANG PAANHANG  0.00000000
## PTRACTOR PTRACTOR  0.00000000
## PWERKT     PWERKT  0.00000000
## PBROM       PBROM  0.00000000
## PPERSONG PPERSONG  0.00000000
## PGEZONG   PGEZONG  0.00000000
## PWAOREG   PWAOREG  0.00000000
## PZEILPL   PZEILPL  0.00000000
## PPLEZIER PPLEZIER  0.00000000
## PFIETS     PFIETS  0.00000000
## PINBOED   PINBOED  0.00000000
## AWAPART   AWAPART  0.00000000
## AWABEDR   AWABEDR  0.00000000
## AWALAND   AWALAND  0.00000000
## ABESAUT   ABESAUT  0.00000000
## AMOTSCO   AMOTSCO  0.00000000
## AVRAAUT   AVRAAUT  0.00000000
## AAANHANG AAANHANG  0.00000000
## ATRACTOR ATRACTOR  0.00000000
## AWERKT     AWERKT  0.00000000
## ABROM       ABROM  0.00000000
## ALEVEN     ALEVEN  0.00000000
## APERSONG APERSONG  0.00000000
## AGEZONG   AGEZONG  0.00000000
## AWAOREG   AWAOREG  0.00000000
## AZEILPL   AZEILPL  0.00000000
## APLEZIER APLEZIER  0.00000000
## AFIETS     AFIETS  0.00000000
## AINBOED   AINBOED  0.00000000
## ABYSTAND ABYSTAND  0.00000000

The variables “PPERSAUT” and “MKOOPKLA” are the two most important variables.

Q12

Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression ? Which of these approaches yields the best performance ?

We will use the “Weekly” data set from the “ISLR” package to predict the “Direction” variable.

library(ISLR)
set.seed(1)
train = sample(nrow(Weekly),nrow(Weekly)/2)
Weekly$Direction = ifelse(Weekly$Direction=="Up",1,0)
Weekly.train = Weekly[train,]
Weekly.test = Weekly[-train,]

We begin with logistic regression.

logit.fit = glm(Direction~.-Year-Today,data=Weekly.train,family='binomial')
logit.probs = predict(logit.fit,newdata=Weekly.test,type='response')
logit.pred = ifelse(logit.probs>0.5,1,0)
table(Weekly.test$Direction,logit.pred)
##    logit.pred
##       0   1
##   0  82 144
##   1 100 219
(82+219)/(82+144+100+216)
## [1] 0.5553506

We have a classification error of 0.5553506. We continue with boosting.

boost.fit = gbm(Direction~.-Year-Today,data=Weekly.train,distribution='bernoulli',n.trees=5000)
boost.probs = predict(boost.fit,Weekly.test,n.trees=5000)
boost.pred = ifelse(boost.probs>0.5,1,0)
table(Weekly.test$Direction,boost.pred)
##    boost.pred
##       0   1
##   0 126 100
##   1 180 139
(126+139)/(126+100+180+139)
## [1] 0.4862385

We have a classification error of 0.4862385. We continue with bagging.

bag.fit = randomForest(Direction~.-Year-Today,data=Weekly.train,mtry=6)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
bag.probs = predict(bag.fit,newdata=Weekly.test)
bag.pred = ifelse(bag.probs>0.5,1,0)
table(Weekly.test$Direction,bag.pred)
##    bag.pred
##       0   1
##   0 101 125
##   1 125 194
(99+191)/(99+127+128+191)
## [1] 0.5321101

We have a classification error of 0.5321101. We end with random forests.

rf.fit <- randomForest(Direction ~ . - Year - Today, data = Weekly.train, mtry = 2)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
rf.probs <- predict(rf.fit, newdata = Weekly.test)
rf.pred <- ifelse(rf.probs > 0.5, 1, 0)
table(Weekly.test$Direction, rf.pred)
##    rf.pred
##       0   1
##   0 103 123
##   1 125 194
(101+197)/(101+125+122+197)
## [1] 0.546789

We have a classification error of 0.546789.

We may conclude that logistic regression gave the lowest classification error.