Q7
In the lab, we applied random forests to the “Boston” data using “mtry = 6” and using “ntree = 25” and “ntree = 500”. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for “mtry” and “ntree”. Describe the results obtained.
First: 划分训练集与测试集
library(MASS)
library(randomForest)## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
Boston.train <- Boston[train, -14]
Boston.test <- Boston[-train, -14]
Y.train <- Boston[train, 14]
Y.test <- Boston[-train, 14]Second: 拟合模型
rf.boston1 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = ncol(Boston) - 1, ntree = 500)
rf.boston2 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = (ncol(Boston) - 1) / 2, ntree = 500)
rf.boston3 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = sqrt(ncol(Boston) - 1), ntree = 500)Third: 可视化
plot(1:500, rf.boston1$test$mse, col = "green", type = "l", xlab = "Number of Trees", ylab = "Test MSE", ylim = c(18, 30))
lines(1:500, rf.boston2$test$mse, col = "red", type = "l")
lines(1:500, rf.boston3$test$mse, col = "blue", type = "l")
legend("topright", c("m = p", "m = p/2", "m = sqrt(p)"), col = c("green", "red", "blue"), cex = 1, lty = 1)- 对单课树,Test MSE非常高,拟合效果很差
- 随着树的增加,Test MSE减小
- 使用所有预测变量,容易导致过拟合,得到的Test MSE比使用部分变量的高
Q9
This problem involves the “OJ” data set which is part of the “ISLR” package.
a.Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
library(ISLR)
set.seed(1)
train = sample(1:nrow(OJ),800)
OJ.train = OJ[train,]
OJ.test = OJ[-train,]b.Fit a tree to the training data, with “Purchase” as the response and the other variables except for “Buy” as predictors. Use the “summary()” function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate ? How many terminal nodes does the tree have ?
library(tree)
tree.oj = tree(Purchase~.,data=OJ.train)
summary(tree.oj)##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
The fitted tree has 9 terminal nodes and a training error rate of 0.1588.
c.Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
tree.oj## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1073.00 CH ( 0.60625 0.39375 )
## 2) LoyalCH < 0.5036 365 441.60 MM ( 0.29315 0.70685 )
## 4) LoyalCH < 0.280875 177 140.50 MM ( 0.13559 0.86441 )
## 8) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) *
## 9) LoyalCH > 0.0356415 118 116.40 MM ( 0.19492 0.80508 ) *
## 5) LoyalCH > 0.280875 188 258.00 MM ( 0.44149 0.55851 )
## 10) PriceDiff < 0.05 79 84.79 MM ( 0.22785 0.77215 )
## 20) SpecialCH < 0.5 64 51.98 MM ( 0.14062 0.85938 ) *
## 21) SpecialCH > 0.5 15 20.19 CH ( 0.60000 0.40000 ) *
## 11) PriceDiff > 0.05 109 147.00 CH ( 0.59633 0.40367 ) *
## 3) LoyalCH > 0.5036 435 337.90 CH ( 0.86897 0.13103 )
## 6) LoyalCH < 0.764572 174 201.00 CH ( 0.73563 0.26437 )
## 12) ListPriceDiff < 0.235 72 99.81 MM ( 0.50000 0.50000 )
## 24) PctDiscMM < 0.196197 55 73.14 CH ( 0.61818 0.38182 ) *
## 25) PctDiscMM > 0.196197 17 12.32 MM ( 0.11765 0.88235 ) *
## 13) ListPriceDiff > 0.235 102 65.43 CH ( 0.90196 0.09804 ) *
## 7) LoyalCH > 0.764572 261 91.20 CH ( 0.95785 0.04215 ) *
We pick the node labeled 8, which is a terminal node because of the asterisk. The split criterion is LoyalCH<0.0356, the number of observations in that branch is 59 with a deviance of 10.14 and an overall prediction for the branch of MM. Less than 2% of the observations in that branch take the value of CH, and the remaining 98% take the value of MM.
d.Create a plot of the tree, and interpret the results.
plot(tree.oj)
text(tree.oj,pretty=0)We may see that the most important indicator of “Purchase” appears to be “LoyalCH”, since the first branch differences the intensity of customer brand loyalty to CH. In fact, the top three nodes contain “LoyalCH”.
e.Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate ?
tree.pred = predict(tree.oj,OJ.test,type="class")
table(tree.pred,OJ.test$Purchase)##
## tree.pred CH MM
## CH 160 38
## MM 8 64
1-(160+64)/270## [1] 0.1703704
The test error rate is about 17%.
f.Apply the “cv.tree()” function to the training set in order to determine the optimal size tree.
cv.oj = cv.tree(tree.oj,FUN=prune.misclass)
cv.oj## $size
## [1] 9 8 7 4 2 1
##
## $dev
## [1] 150 150 149 158 172 315
##
## $k
## [1] -Inf 0.000000 3.000000 4.333333 10.500000 151.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
The optimal size is 7 with the dev of 149.
g.Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(cv.oj$size,cv.oj$dev,type='b',xlab='Tree size',ylab='Deviance')h.Which tree size corresponds to the lowest cross-validated classification error rate ? 7
i.Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
prune.oj = prune.misclass(tree.oj,best=7)
plot(prune.oj)
text(prune.oj,pretty=0)j.Compare the training error rates between the pruned and unpruned trees. Which is higher ?
summary(tree.oj)##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
summary(prune.oj)##
## Classification tree:
## snip.tree(tree = tree.oj, nodes = c(4L, 10L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff" "PctDiscMM"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7748 = 614.4 / 793
## Misclassification error rate: 0.1625 = 130 / 800
The misclassification error rate is slightly higher for the pruned tree(0.1625vs0.1588)
k.Compare the test error rates between the pruned and unpruned trees. Which is higher ?
prune.pred = predict(prune.oj,OJ.test,type='class')
table(prune.pred,OJ.test$Purchase)##
## prune.pred CH MM
## CH 160 36
## MM 8 66
1-(160+66)/270## [1] 0.162963
In this case, the pruning process decreased the test error rate to about 16%, and it also produced a way more interpretable tree.
Q10
We now use boosting to predict “Salary” in the “Hitters” data set.
a.Remove the observations for whom the salary information is unknown, and then log-transform the salaries.
Hitters = na.omit(Hitters)
Hitters$Salary = log(Hitters$Salary)b.Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.
train = 1:200
Hitters.train = Hitters[train,]
Hitters.test = Hitters[-train,]c.Perform boosting on the training set with 1000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.
library(gbm)## Loaded gbm 2.1.8
set.seed(1)
pows = seq(-10,-0.2,by=0.1)
lambdas = 10^pows
train.err = rep(NA,length(lambdas))
for (i in 1:length(lambdas)){
boost.hitters = gbm(Salary~.,data=Hitters.train,distribution='gaussian',n.trees=1000,shrinkage=lambdas[i])
pred.train = predict(boost.hitters,Hitters.train,n.trees=1000)
train.err[i] = mean((pred.train-Hitters.train$Salary)^2)
}
plot(lambdas,train.err,type="b",xlab="Shrinkage values",ylab="Training MSE")d.Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.
set.seed(1)
test.err = rep(NA,length(lambdas))
for (i in 1:length(lambdas)){
boost.hitters = gbm(Salary~.,data=Hitters.train,distribution="gaussian",n.trees=1000,shrinkage=lambdas[i])
yhat = predict(boost.hitters,Hitters.test,n.trees=1000)
test.err[i] = mean((yhat-Hitters.test$Salary)^2)
}
plot(lambdas,test.err,type="b",xlab="Shrinkage values",ylab="Test MSE")min(test.err)## [1] 0.2540265
lambdas[which.min(test.err)]## [1] 0.07943282
The minimum test MSE is 0.25, and is obtained for \(\lambda\)=0.079
e.Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.
# Linear regression
library(glmnet)## 载入需要的程辑包:Matrix
## Loaded glmnet 4.1-1
fit1 = lm(Salary~.,data=Hitters.train)
pred1 = predict(fit1,Hitters.test)
mean((pred1-Hitters.test$Salary)^2)## [1] 0.4917959
# Ridge regression
x = model.matrix(Salary~.,data=Hitters.train)
x.test = model.matrix(Salary~.,data=Hitters.test)
y = Hitters.train$Salary
fit2 = glmnet(x,y,alpha=0)
pred2 = predict(fit2,s=0.01,newx=x.test)
mean((pred2-Hitters.test$Salary)^2)## [1] 0.4570283
The test MSE for boosting is lower than for linear regression and ridge regression.
f.Which variables appear to be the most important predictors in the boosted model ?
library(gbm)
boost.hitters = gbm(Salary~.,data=Hitters.train,distribution='gaussian',n.trees=1000,shrinkage=lambdas[which.min(test.err)])
summary(boost.hitters)## var rel.inf
## CAtBat CAtBat 20.8404970
## CRBI CRBI 12.3158959
## Walks Walks 7.4186037
## PutOuts PutOuts 7.1958539
## Years Years 6.3104535
## CWalks CWalks 6.0221656
## CHmRun CHmRun 5.7759763
## CHits CHits 4.8914360
## AtBat AtBat 4.2187460
## RBI RBI 4.0812410
## Hits Hits 4.0117255
## Assists Assists 3.8786634
## HmRun HmRun 3.6386178
## CRuns CRuns 3.3230296
## Errors Errors 2.6369128
## Runs Runs 2.2048386
## Division Division 0.5347342
## NewLeague NewLeague 0.4943540
## League League 0.2062551
The “CAtBat” is by far the most important variable.
g.Now apply bagging to the training set. What is the test set MSE for this approach ?
set.seed(1)
bag.hitters = randomForest(Salary~.,data=Hitters.train,mtry=19,ntree=500)
yhat.bag = predict(bag.hitters,newdata=Hitters.test)
mean((yhat.bag-Hitters.test$Salary)^2)## [1] 0.2299324
The test MSE for bagging is 0.23, which is slightly lower than the test MSE for boosting.
Q11
This question uses the “Caravan” data set.
a.Create a training set consisting of the first 1000 observations, and a test set consisting of the remaining observations.
set.seed(1)
train = 1:1000
Caravan.train = Caravan[train,]
Caravan.test = Caravan[-train,]b.Fit a boosting model to the training set with “Purchase” as the response and the other variables as predictors. Use 1000 trees, and a shrinkage value of 0.01. Which predictors appear to be most important ?
set.seed(1)
boost.caravan = gbm(Purchase~.,data=Caravan.train,distribution="gaussian",n.trees=1000,shrinkage=0.01)## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
## variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
## variable 71: AVRAAUT has no variation.
summary(boost.caravan)## var rel.inf
## PPERSAUT PPERSAUT 13.51824557
## MKOOPKLA MKOOPKLA 10.24062778
## MOPLHOOG MOPLHOOG 7.32689780
## MBERMIDD MBERMIDD 6.35820558
## PBRAND PBRAND 4.98826360
## ABRAND ABRAND 4.54504653
## MGODGE MGODGE 4.26496875
## MINK3045 MINK3045 4.13253907
## PWAPART PWAPART 3.15612877
## MOSTYPE MOSTYPE 2.79099985
## MAUT1 MAUT1 2.76929763
## MAUT2 MAUT2 1.99879666
## MSKA MSKA 1.94618539
## MBERARBG MBERARBG 1.89917331
## PBYSTAND PBYSTAND 1.88591514
## MINKGEM MINKGEM 1.87131472
## MGODOV MGODOV 1.81673309
## MGODPR MGODPR 1.80814745
## MFWEKIND MFWEKIND 1.67884570
## MSKC MSKC 1.65075962
## MBERHOOG MBERHOOG 1.53559951
## MSKB1 MSKB1 1.43339514
## MZFONDS MZFONDS 1.37139649
## MOPLMIDD MOPLMIDD 1.10617074
## MRELGE MRELGE 1.09039794
## MINK7512 MINK7512 1.08772012
## MGODRK MGODRK 1.03126657
## MINK4575 MINK4575 1.02492795
## MHHUUR MHHUUR 0.86853689
## MRELOV MRELOV 0.80356854
## MFGEKIND MFGEKIND 0.80335689
## MZPART MZPART 0.69824614
## MBERARBO MBERARBO 0.60909852
## MHKOOP MHKOOP 0.57696613
## APERSAUT APERSAUT 0.56707821
## MGEMOMV MGEMOMV 0.55589456
## MAUT0 MAUT0 0.54748481
## PMOTSCO PMOTSCO 0.43362597
## MSKB2 MSKB2 0.43075446
## MSKD MSKD 0.42751490
## MINK123M MINK123M 0.40920707
## MINKM30 MINKM30 0.36996576
## MOSHOOFD MOSHOOFD 0.33336325
## MBERBOER MBERBOER 0.28967068
## MFALLEEN MFALLEEN 0.28877552
## MGEMLEEF MGEMLEEF 0.20084195
## MOPLLAAG MOPLLAAG 0.15750616
## MBERZELF MBERZELF 0.11203381
## PLEVEN PLEVEN 0.11030994
## MRELSA MRELSA 0.04500507
## MAANTHUI MAANTHUI 0.03322830
## PWABEDR PWABEDR 0.00000000
## PWALAND PWALAND 0.00000000
## PBESAUT PBESAUT 0.00000000
## PVRAAUT PVRAAUT 0.00000000
## PAANHANG PAANHANG 0.00000000
## PTRACTOR PTRACTOR 0.00000000
## PWERKT PWERKT 0.00000000
## PBROM PBROM 0.00000000
## PPERSONG PPERSONG 0.00000000
## PGEZONG PGEZONG 0.00000000
## PWAOREG PWAOREG 0.00000000
## PZEILPL PZEILPL 0.00000000
## PPLEZIER PPLEZIER 0.00000000
## PFIETS PFIETS 0.00000000
## PINBOED PINBOED 0.00000000
## AWAPART AWAPART 0.00000000
## AWABEDR AWABEDR 0.00000000
## AWALAND AWALAND 0.00000000
## ABESAUT ABESAUT 0.00000000
## AMOTSCO AMOTSCO 0.00000000
## AVRAAUT AVRAAUT 0.00000000
## AAANHANG AAANHANG 0.00000000
## ATRACTOR ATRACTOR 0.00000000
## AWERKT AWERKT 0.00000000
## ABROM ABROM 0.00000000
## ALEVEN ALEVEN 0.00000000
## APERSONG APERSONG 0.00000000
## AGEZONG AGEZONG 0.00000000
## AWAOREG AWAOREG 0.00000000
## AZEILPL AZEILPL 0.00000000
## APLEZIER APLEZIER 0.00000000
## AFIETS AFIETS 0.00000000
## AINBOED AINBOED 0.00000000
## ABYSTAND ABYSTAND 0.00000000
The variables “PPERSAUT” and “MKOOPKLA” are the two most important variables.
Q12
Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression ? Which of these approaches yields the best performance ?
We will use the “Weekly” data set from the “ISLR” package to predict the “Direction” variable.
library(ISLR)
set.seed(1)
train = sample(nrow(Weekly),nrow(Weekly)/2)
Weekly$Direction = ifelse(Weekly$Direction=="Up",1,0)
Weekly.train = Weekly[train,]
Weekly.test = Weekly[-train,]We begin with logistic regression.
logit.fit = glm(Direction~.-Year-Today,data=Weekly.train,family='binomial')
logit.probs = predict(logit.fit,newdata=Weekly.test,type='response')
logit.pred = ifelse(logit.probs>0.5,1,0)
table(Weekly.test$Direction,logit.pred)## logit.pred
## 0 1
## 0 82 144
## 1 100 219
(82+219)/(82+144+100+216)## [1] 0.5553506
We have a classification error of 0.5553506. We continue with boosting.
boost.fit = gbm(Direction~.-Year-Today,data=Weekly.train,distribution='bernoulli',n.trees=5000)
boost.probs = predict(boost.fit,Weekly.test,n.trees=5000)
boost.pred = ifelse(boost.probs>0.5,1,0)
table(Weekly.test$Direction,boost.pred)## boost.pred
## 0 1
## 0 126 100
## 1 180 139
(126+139)/(126+100+180+139)## [1] 0.4862385
We have a classification error of 0.4862385. We continue with bagging.
bag.fit = randomForest(Direction~.-Year-Today,data=Weekly.train,mtry=6)## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
bag.probs = predict(bag.fit,newdata=Weekly.test)
bag.pred = ifelse(bag.probs>0.5,1,0)
table(Weekly.test$Direction,bag.pred)## bag.pred
## 0 1
## 0 101 125
## 1 125 194
(99+191)/(99+127+128+191)## [1] 0.5321101
We have a classification error of 0.5321101. We end with random forests.
rf.fit <- randomForest(Direction ~ . - Year - Today, data = Weekly.train, mtry = 2)## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
rf.probs <- predict(rf.fit, newdata = Weekly.test)
rf.pred <- ifelse(rf.probs > 0.5, 1, 0)
table(Weekly.test$Direction, rf.pred)## rf.pred
## 0 1
## 0 103 123
## 1 125 194
(101+197)/(101+125+122+197)## [1] 0.546789
We have a classification error of 0.546789.
We may conclude that logistic regression gave the lowest classification error.