Q1. Draw an example (of your own invention) of a partition of two-dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all aspects of your figures, including the regions \(R_1,R_2,...\), the cutpoints \(t_1,t_2,...\), and so forth.
par(xpd = NA)
plot(NA, NA, type = "n", xlim = c(0,100), ylim = c(0,100), xlab = "X", ylab = "Y")
# t1: x = 40; (40, 0) (40, 100)
lines(x = c(40,40), y = c(0,100))
text(x = 40, y = 108, labels = c("t1"), col = "red")
# t2: y = 75; (0, 75) (40, 75)
lines(x = c(0,40), y = c(75,75))
text(x = -8, y = 75, labels = c("t2"), col = "red")
# t3: x = 75; (75,0) (75, 100)
lines(x = c(75,75), y = c(0,100))
text(x = 75, y = 108, labels = c("t3"), col = "red")
# t4: x = 20; (20,0) (20, 75)
lines(x = c(20,20), y = c(0,75))
text(x = 20, y = 80, labels = c("t4"), col = "red")
# t5: y=25; (75,25) (100,25)
lines(x = c(75,100), y = c(25,25))
text(x = 70, y = 25, labels = c("t5"), col = "red")
text(x = (40+75)/2, y = 50, labels = c("R1"))
text(x = 20, y = (100+75)/2, labels = c("R2"))
text(x = (75+100)/2, y = (100+25)/2, labels = c("R3"))
text(x = (75+100)/2, y = 25/2, labels = c("R4"))
text(x = 30, y = 75/2, labels = c("R5"))
text(x = 10, y = 75/2, labels = c("R6"))
Q2. It is mentioned in Section 8.2.3 that boosting using depth-one trees (or stumps) leads to an additive model : that is, a model of the form \[f(X) = \sum_{j = 1}^p f_j(X_j).\] Explain why this is the case.
We begin by detailing the boost algorithm step by step. At first, we let \(\hat{f}(x) = 0\) and \(r_i = y_i\ \forall i\). At step one, we let \[\hat{f}^1(x) = c_1I(x_1 < t_1) + c_1' = \frac{1}{\lambda}f_1(x_1),\] then \(\hat{f}(x) = \lambda\hat{f}^1(x)\) and \(r_i = y_i - \lambda\hat{f}^1(x_i)\ \forall i\). At step two, we have \[\hat{f}^2(x) = c_2I(x_2 < t_2) + c_2' = \frac{1}{\lambda}f_2(x_2)\] as to maximize the fit to the residuals, another distinct stump must be fit. Then \(\hat{f}(x) = \lambda\hat{f}^1(x) + \lambda\hat{f}^2(x)\) and \(r_i = y_i - \lambda\hat{f}^1(x_i) - \lambda\hat{f}^2(x_i)\ \forall i\). Finally we have \[\hat{f}(x) = \sum_{j = 1}^p f_j(x_j).\]
Q3. Consider the Gini index, classification error, and cross-entropy in a simple setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The \(x\)-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the \(y\)-axis should display the value of the Gini index, classification error, and entropy.
p <- seq(0, 1, 0.01)
gini.index <- 2 * p * (1 - p)
class.error <- 1 - pmax(p, 1 - p)
cross.entropy <- - (p * log(p) + (1 - p) * log(1 - p))
matplot(p, cbind(gini.index, class.error, cross.entropy), col = c("red", "green", "blue"))
Q4. This question relates to the plots in Figure 8.12.
If \(X_1 \ge 1\) then 5, else if \(X_2 \ge 1\) then 15, else if \(X_1 < 0\) then 3, else if \(X_2 <0\) then 10, else 0.
par(xpd = NA)
plot(NA, NA, type = "n", xlim = c(-2, 2), ylim = c(-3, 3), xlab = "X1", ylab = "X2")
# X2 < 1
lines(x = c(-2, 2), y = c(1, 1))
# X1 < 1 with X2 < 1
lines(x = c(1, 1), y = c(-3, 1))
text(x = (-2 + 1)/2, y = -1, labels = c(-1.8))
text(x = 1.5, y = -1, labels = c(0.63))
# X2 < 2 with X2 >= 1
lines(x = c(-2, 2), y = c(2, 2))
text(x = 0, y = 2.5, labels = c(2.49))
# X1 < 0 with X2<2 and X2>=1
lines(x = c(0, 0), y = c(1, 2))
text(x = -1, y = 1.5, labels = c(-1.06))
text(x = 1, y = 1.5, labels = c(0.21))
Q5. Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of \(X\), produce 10 estimates of \(P(\text{Class is Red}|X)\) : \[0.1,0.15,0.2,0.2,,0.55,0.6,0.6,0.65,0.7,0.75.\] There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches ?
With the majority vote approach, we classify \(X\) as Red as it is the most commonly occurring class among the 10 predictions (6 for Red vs 4 for Green). With the average probability approach, we classify \(X\) as Green as the average of the 10 probabilities is 0.45.
Q6. Provide a detailed explanation of the algorithm that is used to fit a regression tree.
First we do recursive binary splitting on the data. This is a top-down approach where recursively and greedily we find the best single partitioning of the data such that the reduction of RSS is the greatest. This process is applied to each of the split parts seperately until some minimal number of observations is present on each of the leaves.
Then we apply cost complexity pruning of this larger tree formed in step 1 to obtain a sequence of best subtrees as a function of a parameter, \(\alpha\). Each value of \(\alpha\) corresponds to a different subtree which minimizes the equation \[\sum_{m=i}^{|T|}\sum_{i:x_i\in R_m}(y_i - \hat y_{R_m})^2 + \alpha |T|.\] Here \(|T|\) is the number of terminal nodes on the tree. When \(\alpha=0\) we have the original tree, and as \(\alpha\) increases we get a more pruned version of the tree.
Next, using K-fold CV, choose \(\alpha\). For each fold, repeat steps 1 and 2, and then evaluate the MSE as a function of \(\alpha\) on the held out fold. Chose an \(\alpha\) that minimizes the average error.
Given the \(\alpha\) chosen in previous step, return the tree calculated using the formula laid out in step 2 on the entire dataset with that chosen value of \(\alpha\).
Q7. In the lab, we applied random forests to the “Boston” data using “mtry = 6” and using “ntree = 25” and “ntree = 500”. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for “mtry” and “ntree”. Describe the results obtained.
library(MASS)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
Boston.train <- Boston[train, -14]
Boston.test <- Boston[-train, -14]
Y.train <- Boston[train, 14]
Y.test <- Boston[-train, 14]
rf.boston1 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = ncol(Boston) - 1, ntree = 500)
rf.boston2 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = (ncol(Boston) - 1) / 2, ntree = 500)
rf.boston3 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = sqrt(ncol(Boston) - 1), ntree = 500)
plot(1:500, rf.boston1$test$mse, col = "green", type = "l", xlab = "Number of Trees", ylab = "Test MSE", ylim = c(10, 19))
lines(1:500, rf.boston2$test$mse, col = "red", type = "l")
lines(1:500, rf.boston3$test$mse, col = "blue", type = "l")
legend("topright", c("m = p", "m = p/2", "m = sqrt(p)"), col = c("green", "red", "blue"), cex = 1, lty = 1)
We may see that the Test MSE is very high for a single tree, it decreases as the number of trees increases. Also the Test MSE for all predictors is higher than for half the predictors or the square root of the number of predictors.
Q8. In the lab, a classification tree was applied to the “Carseats” data set after converting “Sales” into a qualitative response variable. Now we will seek to predict “Sales” using regression trees and related approaches, treating the response as a quantitative variable.
library(ISLR)
set.seed(1)
train <- sample(1:nrow(Carseats), nrow(Carseats) / 2)
Carseats.train <- Carseats[train, ]
Carseats.test <- Carseats[-train, ]
library(tree)
tree.carseats <- tree(Sales ~ ., data = Carseats.train)
summary(tree.carseats)
##
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Advertising" "Income"
## [6] "CompPrice"
## Number of terminal nodes: 18
## Residual mean deviance: 2.36 = 429.5 / 182
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.2570 -1.0360 0.1024 0.0000 0.9301 3.9130
plot(tree.carseats)
text(tree.carseats, pretty = 0)
yhat <- predict(tree.carseats, newdata = Carseats.test)
mean((yhat - Carseats.test$Sales)^2)
## [1] 4.148897
We may conclude that the Test MSE is about 4.15.
cv.carseats <- cv.tree(tree.carseats)
plot(cv.carseats$size, cv.carseats$dev, type = "b")
tree.min <- which.min(cv.carseats$dev)
points(tree.min, cv.carseats$dev[tree.min], col = "red", cex = 2, pch = 20)
In this case, the tree of size 8 is selected by cross-validation. We now prune the tree to obtain the 8-node tree.
prune.carseats <- prune.tree(tree.carseats, best = 8)
plot(prune.carseats)
text(prune.carseats, pretty = 0)
yhat <- predict(prune.carseats, newdata = Carseats.test)
mean((yhat - Carseats.test$Sales)^2)
## [1] 5.09085
We may see that pruning the tree increases the Test MSE to 5.1.
bag.carseats <- randomForest(Sales ~ ., data = Carseats.train, mtry = 10, ntree = 500, importance = TRUE)
yhat.bag <- predict(bag.carseats, newdata = Carseats.test)
mean((yhat.bag - Carseats.test$Sales)^2)
## [1] 2.604369
We may see that bagging decreases the Test MSE to 2.6.
importance(bag.carseats)
## %IncMSE IncNodePurity
## CompPrice 14.4124562 133.731797
## Income 6.5147532 74.346961
## Advertising 15.7607104 117.822651
## Population 0.6031237 60.227867
## Price 57.8206926 514.802084
## ShelveLoc 43.0486065 319.117972
## Age 19.8789659 192.880596
## Education 2.9319161 39.490093
## Urban -3.1300102 8.695529
## US 7.6298722 15.723975
We may conclude that “Price” and “ShelveLoc” are the two most important variables.
rf.carseats <- randomForest(Sales ~ ., data = Carseats.train, mtry = 3, ntree = 500, importance = TRUE)
yhat.rf <- predict(rf.carseats, newdata = Carseats.test)
mean((yhat.rf - Carseats.test$Sales)^2)
## [1] 3.296078
In this case, with \(m = \sqrt{p}\), we have a Test MSE of 3.3.
importance(rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 7.5233429 127.36625
## Income 4.3612064 119.19152
## Advertising 12.5799388 138.13567
## Population -0.2974474 100.28836
## Price 37.1612032 383.12126
## ShelveLoc 30.3751253 246.19930
## Age 16.0261885 197.44865
## Education 1.7855151 63.87939
## Urban -1.3928225 16.01173
## US 5.6393475 32.85850
We may conclude that, in this case also, “Price” and “ShelveLoc” are the two most important variables.
Q9. This problem involves the “OJ” data set which is part of the “ISLR” package.
set.seed(1)
train <- sample(1:nrow(OJ), 800)
OJ.train <- OJ[train, ]
OJ.test <- OJ[-train, ]
tree.oj <- tree(Purchase ~ ., data = OJ.train)
summary(tree.oj)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## Number of terminal nodes: 8
## Residual mean deviance: 0.7305 = 578.6 / 792
## Misclassification error rate: 0.165 = 132 / 800
The fitted tree has 8 terminal nodes and a training error rate of 0.165.
tree.oj
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1064.00 CH ( 0.61750 0.38250 )
## 2) LoyalCH < 0.508643 350 409.30 MM ( 0.27143 0.72857 )
## 4) LoyalCH < 0.264232 166 122.10 MM ( 0.12048 0.87952 )
## 8) LoyalCH < 0.0356415 57 10.07 MM ( 0.01754 0.98246 ) *
## 9) LoyalCH > 0.0356415 109 100.90 MM ( 0.17431 0.82569 ) *
## 5) LoyalCH > 0.264232 184 248.80 MM ( 0.40761 0.59239 )
## 10) PriceDiff < 0.195 83 91.66 MM ( 0.24096 0.75904 )
## 20) SpecialCH < 0.5 70 60.89 MM ( 0.15714 0.84286 ) *
## 21) SpecialCH > 0.5 13 16.05 CH ( 0.69231 0.30769 ) *
## 11) PriceDiff > 0.195 101 139.20 CH ( 0.54455 0.45545 ) *
## 3) LoyalCH > 0.508643 450 318.10 CH ( 0.88667 0.11333 )
## 6) LoyalCH < 0.764572 172 188.90 CH ( 0.76163 0.23837 )
## 12) ListPriceDiff < 0.235 70 95.61 CH ( 0.57143 0.42857 ) *
## 13) ListPriceDiff > 0.235 102 69.76 CH ( 0.89216 0.10784 ) *
## 7) LoyalCH > 0.764572 278 86.14 CH ( 0.96403 0.03597 ) *
We pick the node labelled 8, which is a terminal node because of the asterisk. The split criterion is LoyalCH < 0.035, the number of observations in that branch is 57 with a deviance of 10.07 and an overall prediction for the branch of MM. Less than 2% of the observations in that branch take the value of CH, and the remaining 98% take the value of MM.
plot(tree.oj)
text(tree.oj, pretty = 0)
We may see that the most important indicator of “Purchase” appears to be “LoyalCH”, since the first branch differentiates the intensity of customer brand loyalty to CH. In fact, the top three nodes contain “LoyalCH”.
tree.pred <- predict(tree.oj, OJ.test, type = "class")
table(tree.pred, OJ.test$Purchase)
##
## tree.pred CH MM
## CH 147 49
## MM 12 62
1 - (147 + 62) / 270
## [1] 0.2259259
We may conclude that the test error rate is about 22%.
cv.oj <- cv.tree(tree.oj, FUN = prune.misclass)
cv.oj
## $size
## [1] 8 5 2 1
##
## $dev
## [1] 156 156 156 306
##
## $k
## [1] -Inf 0.000000 4.666667 160.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv.oj$size, cv.oj$dev, type = "b", xlab = "Tree size", ylab = "Deviance")
We may see that the 2-node tree is the smallest tree with the lowest classification error rate.
prune.oj <- prune.misclass(tree.oj, best = 2)
plot(prune.oj)
text(prune.oj, pretty = 0)
summary(tree.oj)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## Number of terminal nodes: 8
## Residual mean deviance: 0.7305 = 578.6 / 792
## Misclassification error rate: 0.165 = 132 / 800
summary(prune.oj)
##
## Classification tree:
## snip.tree(tree = tree.oj, nodes = c(3L, 2L))
## Variables actually used in tree construction:
## [1] "LoyalCH"
## Number of terminal nodes: 2
## Residual mean deviance: 0.9115 = 727.4 / 798
## Misclassification error rate: 0.1825 = 146 / 800
The misclassification error rate is slightly higher for the pruned tree (0.1825 vs 0.165).
prune.pred <- predict(prune.oj, OJ.test, type = "class")
table(prune.pred, OJ.test$Purchase)
##
## prune.pred CH MM
## CH 119 30
## MM 40 81
1 - (119 + 81) / 270
## [1] 0.2592593
In this case, the pruning process increased the test error rate to about 26%, but it produced a way more interpretable tree.
Q10. We now use boosting to predict “Salary” in the “Hitters” data set.
Hitters <- na.omit(Hitters)
Hitters$Salary <- log(Hitters$Salary)
train <- 1:200
Hitters.train <- Hitters[train, ]
Hitters.test <- Hitters[-train, ]
library(gbm)
set.seed(1)
pows <- seq(-10, -0.2, by = 0.1)
lambdas <- 10^pows
train.err <- rep(NA, length(lambdas))
for (i in 1:length(lambdas)) {
boost.hitters <- gbm(Salary ~ ., data = Hitters.train, distribution = "gaussian", n.trees = 1000, shrinkage = lambdas[i])
pred.train <- predict(boost.hitters, Hitters.train, n.trees = 1000)
train.err[i] <- mean((pred.train - Hitters.train$Salary)^2)
}
plot(lambdas, train.err, type = "b", xlab = "Shrinkage values", ylab = "Training MSE")
set.seed(1)
test.err <- rep(NA, length(lambdas))
for (i in 1:length(lambdas)) {
boost.hitters <- gbm(Salary ~ ., data = Hitters.train, distribution = "gaussian", n.trees = 1000, shrinkage = lambdas[i])
yhat <- predict(boost.hitters, Hitters.test, n.trees = 1000)
test.err[i] <- mean((yhat - Hitters.test$Salary)^2)
}
plot(lambdas, test.err, type = "b", xlab = "Shrinkage values", ylab = "Test MSE")
min(test.err)
## [1] 0.2540265
lambdas[which.min(test.err)]
## [1] 0.07943282
The minimum test MSE is 0.25, and is obtained for \(\lambda = 0.079\).
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 1.9-8
fit1 <- lm(Salary ~ ., data = Hitters.train)
pred1 <- predict(fit1, Hitters.test)
mean((pred1 - Hitters.test$Salary)^2)
## [1] 0.4917959
x <- model.matrix(Salary ~ ., data = Hitters.train)
x.test <- model.matrix(Salary ~ ., data = Hitters.test)
y <- Hitters.train$Salary
fit2 <- glmnet(x, y, alpha = 0)
pred2 <- predict(fit2, s = 0.01, newx = x.test)
mean((pred2 - Hitters.test$Salary)^2)
## [1] 0.4570283
The test MSE for boosting is lower than for linear regression and ridge regression.
library(gbm)
## Loading required package: survival
## Loading required package: lattice
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1
boost.hitters <- gbm(Salary ~ ., data = Hitters.train, distribution = "gaussian", n.trees = 1000, shrinkage = lambdas[which.min(test.err)])
summary(boost.hitters)
## var rel.inf
## CAtBat CAtBat 23.6278225
## Walks Walks 7.4858538
## PutOuts PutOuts 7.4438741
## CWalks CWalks 7.3437919
## CRBI CRBI 6.7581127
## Years Years 6.0435181
## CRuns CRuns 5.8389531
## Assists Assists 5.2910978
## CHmRun CHmRun 4.9321654
## Hits Hits 4.4338073
## CHits CHits 4.4008947
## RBI RBI 4.2852503
## HmRun HmRun 3.2708823
## AtBat AtBat 3.1569920
## Errors Errors 2.6144780
## Runs Runs 1.8095817
## Division Division 0.6302155
## NewLeague NewLeague 0.5079414
## League League 0.1247675
We may see that “CAtBat” is by far the most important variable.
set.seed(1)
bag.hitters <- randomForest(Salary ~ ., data = Hitters.train, mtry = 19, ntree = 500)
yhat.bag <- predict(bag.hitters, newdata = Hitters.test)
mean((yhat.bag - Hitters.test$Salary)^2)
## [1] 0.2313593
The test MSE for bagging is 0.23, which is slightly lower than the test MSE for boosting.
Q11. This question uses the “Caravan” data set.
set.seed(1)
train <- 1:1000
Caravan$Purchase <- ifelse(Caravan$Purchase == "Yes", 1, 0)
Caravan.train <- Caravan[train, ]
Caravan.test <- Caravan[-train, ]
set.seed(1)
boost.caravan <- gbm(Purchase ~ ., data = Caravan.train, distribution = "gaussian", n.trees = 1000, shrinkage = 0.01)
## Warning in gbm.fit(x, y, offset = offset, distribution = distribution, w =
## w, : variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x, y, offset = offset, distribution = distribution, w =
## w, : variable 71: AVRAAUT has no variation.
summary(boost.caravan)
## var rel.inf
## PPERSAUT PPERSAUT 13.51824557
## MKOOPKLA MKOOPKLA 10.24062778
## MOPLHOOG MOPLHOOG 7.32689780
## MBERMIDD MBERMIDD 6.35820558
## PBRAND PBRAND 4.98826360
## ABRAND ABRAND 4.54504653
## MGODGE MGODGE 4.26496875
## MINK3045 MINK3045 4.13253907
## PWAPART PWAPART 3.15612877
## MAUT1 MAUT1 2.76929763
## MOSTYPE MOSTYPE 2.56937935
## MAUT2 MAUT2 1.99879666
## MSKA MSKA 1.94618539
## MBERARBG MBERARBG 1.89917331
## PBYSTAND PBYSTAND 1.88591514
## MINKGEM MINKGEM 1.87131472
## MGODOV MGODOV 1.81673309
## MGODPR MGODPR 1.80814745
## MFWEKIND MFWEKIND 1.67884570
## MSKC MSKC 1.65075962
## MBERHOOG MBERHOOG 1.53559951
## MSKB1 MSKB1 1.43339514
## MOPLMIDD MOPLMIDD 1.10617074
## MHHUUR MHHUUR 1.09608784
## MRELGE MRELGE 1.09039794
## MINK7512 MINK7512 1.08772012
## MZFONDS MZFONDS 1.08427551
## MGODRK MGODRK 1.03126657
## MINK4575 MINK4575 1.02492795
## MZPART MZPART 0.98536712
## MRELOV MRELOV 0.80356854
## MFGEKIND MFGEKIND 0.80335689
## MBERARBO MBERARBO 0.60909852
## APERSAUT APERSAUT 0.56707821
## MGEMOMV MGEMOMV 0.55589456
## MOSHOOFD MOSHOOFD 0.55498375
## MAUT0 MAUT0 0.54748481
## PMOTSCO PMOTSCO 0.43362597
## MSKB2 MSKB2 0.43075446
## MSKD MSKD 0.42751490
## MINK123M MINK123M 0.40920707
## MINKM30 MINKM30 0.36996576
## MHKOOP MHKOOP 0.34941518
## MBERBOER MBERBOER 0.28967068
## MFALLEEN MFALLEEN 0.28877552
## MGEMLEEF MGEMLEEF 0.20084195
## MOPLLAAG MOPLLAAG 0.15750616
## MBERZELF MBERZELF 0.11203381
## PLEVEN PLEVEN 0.11030994
## MRELSA MRELSA 0.04500507
## MAANTHUI MAANTHUI 0.03322830
## PWABEDR PWABEDR 0.00000000
## PWALAND PWALAND 0.00000000
## PBESAUT PBESAUT 0.00000000
## PVRAAUT PVRAAUT 0.00000000
## PAANHANG PAANHANG 0.00000000
## PTRACTOR PTRACTOR 0.00000000
## PWERKT PWERKT 0.00000000
## PBROM PBROM 0.00000000
## PPERSONG PPERSONG 0.00000000
## PGEZONG PGEZONG 0.00000000
## PWAOREG PWAOREG 0.00000000
## PZEILPL PZEILPL 0.00000000
## PPLEZIER PPLEZIER 0.00000000
## PFIETS PFIETS 0.00000000
## PINBOED PINBOED 0.00000000
## AWAPART AWAPART 0.00000000
## AWABEDR AWABEDR 0.00000000
## AWALAND AWALAND 0.00000000
## ABESAUT ABESAUT 0.00000000
## AMOTSCO AMOTSCO 0.00000000
## AVRAAUT AVRAAUT 0.00000000
## AAANHANG AAANHANG 0.00000000
## ATRACTOR ATRACTOR 0.00000000
## AWERKT AWERKT 0.00000000
## ABROM ABROM 0.00000000
## ALEVEN ALEVEN 0.00000000
## APERSONG APERSONG 0.00000000
## AGEZONG AGEZONG 0.00000000
## AWAOREG AWAOREG 0.00000000
## AZEILPL AZEILPL 0.00000000
## APLEZIER APLEZIER 0.00000000
## AFIETS AFIETS 0.00000000
## AINBOED AINBOED 0.00000000
## ABYSTAND ABYSTAND 0.00000000
The variables “PPERSAUT” and “MKOOPKLA” are the two most important variables.
probs.test <- predict(boost.caravan, Caravan.test, n.trees = 1000, type = "response")
pred.test <- ifelse(probs.test > 0.2, 1, 0)
table(Caravan.test$Purchase, pred.test)
## pred.test
## 0 1
## 0 4493 40
## 1 278 11
For boosting, the fraction of people predicted to make a purchase that in fact make one is 0.2156863.
logit.caravan <- glm(Purchase ~ ., data = Caravan.train, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
probs.test2 <- predict(logit.caravan, Caravan.test, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
pred.test2 <- ifelse(probs.test > 0.2, 1, 0)
table(Caravan.test$Purchase, pred.test2)
## pred.test2
## 0 1
## 0 4493 40
## 1 278 11
For logistic regression, the fraction of people predicted to make a purchase that in fact make one is again 0.2156863.
Q12. Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression ? Which of these approaches yields the best performance ?
We will use the “Weekly” data set from the “ISLR” package to predict the “Direction” variable.
set.seed(1)
train <- sample(nrow(Weekly), nrow(Weekly) / 2)
Weekly$Direction <- ifelse(Weekly$Direction == "Up", 1, 0)
Weekly.train <- Weekly[train, ]
Weekly.test <- Weekly[-train, ]
We begin with logistic regression.
logit.fit <- glm(Direction ~ . - Year - Today, data = Weekly.train, family = "binomial")
logit.probs <- predict(logit.fit, newdata = Weekly.test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, 1, 0)
table(Weekly.test$Direction, logit.pred)
## logit.pred
## 0 1
## 0 11 244
## 1 8 282
We have a classification error of 0.4623853. We continue with boosting.
boost.fit <- gbm(Direction ~ . - Year - Today, data = Weekly.train, distribution = "bernoulli", n.trees = 5000)
boost.probs <- predict(boost.fit, newdata = Weekly.test, n.trees = 5000)
boost.pred <- ifelse(boost.probs > 0.5, 1, 0)
table(Weekly.test$Direction, boost.pred)
## boost.pred
## 0 1
## 0 166 89
## 1 181 109
We have a classification error of 0.4954128. We continue with bagging.
bag.fit <- randomForest(Direction ~ . - Year - Today, data = Weekly.train, mtry = 6)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
bag.probs <- predict(bag.fit, newdata = Weekly.test)
bag.pred <- ifelse(bag.probs > 0.5, 1, 0)
table(Weekly.test$Direction, bag.pred)
## bag.pred
## 0 1
## 0 85 170
## 1 71 219
We have a classification error of 0.7137615. We end with random forests.
rf.fit <- randomForest(Direction ~ . - Year - Today, data = Weekly.train, mtry = 2)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
rf.probs <- predict(rf.fit, newdata = Weekly.test)
rf.pred <- ifelse(rf.probs > 0.5, 1, 0)
table(Weekly.test$Direction, rf.pred)
## rf.pred
## 0 1
## 0 69 186
## 1 62 228
We have a classification error of 0.4550459.
We may conclude that random forests gave the lowest classification error.