An Introduction to Statistical Learning

8. Tree-Based Methods

Conceptual

1. Draw an example (of your own invention) of a partition of two-dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all aspects of your figures, including the regions R1,R2,…, the cutpoints t1,t2,…, and so forth.

2. It is mentioned in Section 8.2.3 that boosting using depth-one trees (or stumps) leads to an additive model: that is, a model of the form

\[f(X) = \sum_{j = 1}^p f_j(X_j)\]

Explain why this is the case. You can begin with (8.12) in Algorithm 8.2.

If d = 1 every term in \(\widehat{f}(x) = \sum_{b=1}^{B} \lambda \widehat{f}^b(x)\) is based on a single predictor. All these terms are summed up making the model additive.

3. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

p = seq(0, 1, 0.01)
gini = p * (1 - p) * 2
entropy = -(p * log(p) + (1 - p) * log(1 - p))
class.err = 1 - pmax(p, 1 - p)
matplot(p, cbind(gini, entropy, class.err), type = "l", col = c("red", "green", "blue"))
legend("topright",legend=c("gini","entropy", "class error"),pch=19,col=c("red", "green", "blue"))

4. This question relates to the plots in Figure 8.12.

Sketch the tree corresponding to the partition of the predictor space illustrated in the lefthand panel of Figure 8.12. The numbers inside the boxes indicate the mean of Y within each region.

Create a diagram similar to the left-hand panel of Figure 8.12, using the tree illustrated in the righthand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.

5. Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X) : 0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75. There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?

majoritry vote: 6 vs 4 –> red
avg. prob.: P(Class is Red|X) = 4.5/10 = 0.45 –> green

6. Provide a detailed explanation of the algorithm that is used to fit a regression tree.

1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
1. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of \(\alpha\).
1. Use K-fold cross-validation to choose \(\alpha\). That is, divide the training observations into K-folds. For each K=1,…,K:

Repeat Steps 1 and 2 on all but the kth fold of the training data.
Evaluate the mean squared prediction error on the data in the left-out k-th fold, as a function of\(\alpha\). Average the results for each value of \(\alpha\), and pick \(\alpha\) to minimize the average error.

1. Return the subtree from Step 2 that corresponds to the chosen value of \(\alpha\).

Applied

7. In the lab, we applied random forests to the Boston data using mtry=6 and using ntree=25 and ntree=500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values form try and ntree. You can model your plot after Figure 8.10. Describe the results obtained.

library(MASS)
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

set.seed(1101)
train = sample(1:nrow(Boston), nrow(Boston)/2)
X.train <- Boston[train, -14]
X.test <- Boston[-train, -14]
Y.train <- Boston[train, 14]
Y.test <- Boston[-train, 14]

p = dim(Boston)[2] - 1
p_2 = p/2
p.sqrt = sqrt(p)

forest.p = randomForest(X.train, Y.train, xtest=X.test, ytest=Y.test, mtry=p, ntree=500)
forest.p2 = randomForest(X.train, Y.train, xtest=X.test, ytest=Y.test, mtry=p_2, ntree=500)
forest.sqrt.p = randomForest(X.train, Y.train, xtest=X.test, ytest=Y.test, mtry=p.sqrt, ntree=500)

plot(1:500, forest.p$test$mse, col="green", type="l", xlab="Number of Trees", ylab="Test MSE", ylim=c(8, 19))
lines(1:500, forest.p2$test$mse, col="red", type="l")
lines(1:500, forest.sqrt.p$test$mse, col="blue", type="l")
legend("topright", c("m=p", "m=p/2", "m=sqrt(p)"), col=c("green", "red", "blue"), cex=1, lty=1)

test MSE decreases rapidly with more and more grown trees and flatens out after 100 trees
the model for all predictors has a way higher test MSE than the other two models

8. In the lab, a classification tree was applied to the “Carseats” data set after converting “Sales” into a qualitative response variable. Now we will seek to predict “Sales” using regression trees and related approaches, treating the response as a quantitative variable.

Split the data set into a training set and a test set.

library(ISLR)
set.seed(3)
train = sample(dim(Carseats)[1], dim(Carseats)[1]/2)
Carseats.train = Carseats[train, ]
Carseats.test = Carseats[-train, ]

Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

library(tree)
tree.carseats = tree(Sales ~ ., data = Carseats.train)
summary(tree.carseats)

## 
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "CompPrice"   "Advertising"
## [6] "US"         
## Number of terminal nodes:  16 
## Residual mean deviance:  2.134 = 392.6 / 184 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.37400 -0.90790 -0.05181  0.00000  0.92840  3.82600

plot(tree.carseats)
text(tree.carseats, pretty = 0)

pred.carseats = predict(tree.carseats, Carseats.test)
mean((Carseats.test$Sales - pred.carseats)^2)

## [1] 4.784151

Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

cv.carseats = cv.tree(tree.carseats, FUN=prune.tree)
par(mfrow=c(1, 2))
plot(cv.carseats$size, cv.carseats$dev, type="b")
plot(cv.carseats$k, cv.carseats$dev, type="b")

pruned.carseats = prune.tree(tree.carseats, best = 5)
par(mfrow = c(1, 1))
plot(pruned.carseats)
text(pruned.carseats, pretty = 0)

pred.pruned = predict(pruned.carseats, Carseats.test)
mean((Carseats.test$Sales - pred.pruned)^2)

## [1] 5.392551

Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

library(randomForest)
set.seed(1)
bag.carseats = randomForest(Sales~., data=Carseats.train, mtry = 10, ntree = 500, importance=TRUE)
bag.carseats

## 
## Call:
##  randomForest(formula = Sales ~ ., data = Carseats.train, mtry = 10,      ntree = 500, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 2.767934
##                     % Var explained: 63.26

bag.pred = predict(bag.carseats, Carseats.test)
mean((Carseats.test$Sales - bag.pred)^2)

## [1] 2.81727

importance(bag.carseats)

##                %IncMSE IncNodePurity
## CompPrice   21.8383915    130.128611
## Income       2.8165329     61.463907
## Advertising 15.3041593    125.768233
## Population  -0.7215020     56.946551
## Price       49.0132645    408.583034
## ShelveLoc   56.6377633    503.469060
## Age         13.9000058    115.806064
## Education   -0.7017401     38.382219
## Urban       -0.3446988      9.005802
## US           4.3954523     11.382093

The most important variables are Price, ShelveLoc and CompPrice

Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

oob.err=double(10)
test.err=double(10)
for(mtry in 1:10){
  rf.carseats = randomForest(Sales ~ ., data = Carseats.train, mtry = mtry, ntree = 500, 
      importance = T)
  oob.err[mtry]=rf.carseats$mse[500]
  rf.pred = predict(rf.carseats, Carseats.test)
  test.err[mtry] = mean((Carseats.test$Sales - rf.pred)^2)
  cat(mtry," ")
}

## 1  2  3  4  5  6  7  8  9  10

matplot(1:mtry,cbind(test.err,oob.err),pch=19,col=c("red","blue"),type="b",ylab="Mean Squared Error")

test MSE falls fast til mtry = 6 (6 variables considered at each split) and then levels off

importance(rf.carseats)

##                %IncMSE IncNodePurity
## CompPrice   19.7389023    130.315091
## Income       0.2951166     63.090832
## Advertising 14.3940188    129.400659
## Population  -1.2671143     58.342801
## Price       50.4182555    406.240951
## ShelveLoc   54.2314707    509.062795
## Age         14.2492274    118.615047
## Education    0.6040418     42.022058
## Urban        0.2315769      8.835881
## US           6.2409669     10.909511

test.err[which.min(test.err)]

## [1] 2.782621

9. This problem involves the OJ data set which is part of the ISLR package.

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

library(ISLR)
train <- sample(1:nrow(OJ), 800)
OJ.train = OJ[train, ]
OJ.test = OJ[-train, ]

Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

tree.oj = tree(Purchase~., data=OJ.train)
summary(tree.oj)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "STORE"     "PriceDiff"
## Number of terminal nodes:  7 
## Residual mean deviance:  0.727 = 576.5 / 793 
## Misclassification error rate: 0.165 = 132 / 800

just 3 variables were used to construct the tree
there are 7 terminal nodes
the training error rate of the tree is 0.165

Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

tree.oj

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1068.00 CH ( 0.61250 0.38750 )  
##    2) LoyalCH < 0.508643 356  420.90 MM ( 0.27809 0.72191 )  
##      4) LoyalCH < 0.279374 170  123.20 MM ( 0.11765 0.88235 )  
##        8) STORE < 1.5 50   57.31 MM ( 0.26000 0.74000 ) *
##        9) STORE > 1.5 120   53.37 MM ( 0.05833 0.94167 ) *
##      5) LoyalCH > 0.279374 186  253.60 MM ( 0.42473 0.57527 )  
##       10) PriceDiff < 0.05 80   80.06 MM ( 0.20000 0.80000 ) *
##       11) PriceDiff > 0.05 106  143.20 CH ( 0.59434 0.40566 ) *
##    3) LoyalCH > 0.508643 444  324.70 CH ( 0.88063 0.11937 )  
##      6) LoyalCH < 0.705699 142  166.90 CH ( 0.72535 0.27465 )  
##       12) PriceDiff < 0.265 81  111.70 CH ( 0.54321 0.45679 ) *
##       13) PriceDiff > 0.265 61   17.60 CH ( 0.96721 0.03279 ) *
##      7) LoyalCH > 0.705699 302  113.30 CH ( 0.95364 0.04636 ) *

node 8 as indicated by the asterisk
split criterion: STORE < 1.5
number of obs. in this node: 50
deviance. 57.31
74% of obs in this node are Minute Maid Orange Juice(MM)

Create a plot of the tree, and interpret the results.

plot(tree.oj)
text(tree.oj)

LoyalCH is the most important predictor in this tree.
there are 2 splits just based on node purity

Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

tree.pred = predict(tree.oj, OJ.test, type = "class")
table(tree.pred, OJ.test$Purchase)

##          
## tree.pred  CH  MM
##        CH 149  46
##        MM  14  61

1-(149+61)/270

## [1] 0.2222222

Apply the cv.tree() function to the training set in order to determine the optimal tree size.

cv.oj <- cv.tree(tree.oj, FUN = prune.misclass)
cv.oj

## $size
## [1] 7 4 2 1
## 
## $dev
## [1] 151 151 159 310
## 
## $k
## [1] -Inf    0   10  158
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

Produce a plot with tree size on the x-axis and cross-validated classification error rate on they-axis.

plot(cv.oj$size, cv.oj$dev, type = "b", xlab = "Tree size", ylab = "Deviance")

Which tree size corresponds to the lowest cross-validated classification error rate?

a tree size of 4 corresponds to the lowest cross validation error rate

Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

prune.oj = prune.misclass(tree.oj, best=4)
plot(prune.oj)
text(prune.oj, pretty = 0)

Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(tree.oj)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "STORE"     "PriceDiff"
## Number of terminal nodes:  7 
## Residual mean deviance:  0.727 = 576.5 / 793 
## Misclassification error rate: 0.165 = 132 / 800

summary(prune.oj)

## 
## Classification tree:
## snip.tree(tree = tree.oj, nodes = 3:4)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  4 
## Residual mean deviance:  0.8431 = 671.1 / 796 
## Misclassification error rate: 0.165 = 132 / 800

Compare the test error rates between the pruned and unpruned trees. Which is higher?

prune.pred <- predict(prune.oj, OJ.test, type = "class")
table(prune.pred, OJ.test$Purchase)

##           
## prune.pred  CH  MM
##         CH 149  46
##         MM  14  61

1-(149+61)/270

## [1] 0.2222222

Test error rates are exactly the same

10. We now use boosting to predict Salary in the Hitters data set.

Remove the observations for whom the salary information is unknown, and then log transform the salaries.

library(ISLR)
Hitters = Hitters[-which(is.na(Hitters$Salary)), ]
Hitters$Salary = log(Hitters$Salary)

Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.

train = 1:200
Hitters.train = Hitters[train, ]
Hitters.test = Hitters[-train, ]

Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter \(\lambda\). Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

library(gbm)

## Loaded gbm 2.1.5

set.seed(42)
pows = seq(-10, -0.2, by=0.1)
lambdas = 10 ^ pows
train.errors = rep(NA, length(lambdas))
test.errors = rep(NA, length(lambdas))
for(i in 1:length(lambdas)){
  boost.hitters = gbm(Salary~., data=Hitters.train, distribution="gaussian", n.trees=1000, shrinkage=lambdas[i]) 
  train.pred = predict(boost.hitters, Hitters.train, n.trees=1000)
  test.pred = predict(boost.hitters, Hitters.test, n.trees=1000)
  train.errors[i] = mean((Hitters.train$Salary - train.pred)^2)
  test.errors[i] = mean((Hitters.test$Salary - test.pred)^2)
}

plot(lambdas, train.errors, type="b", xlab="Shrinkage", ylab="Train MSE", col="red", pch=20)

Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

plot(lambdas, test.errors, type="b", xlab="Shrinkage", ylab="Test MSE", col="red", pch=20)

min(test.errors)

## [1] 0.2583995

lambdas[which.min(test.errors)]

## [1] 0.06309573

Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 3.0-2

set.seed(42)
lm.fit = lm(Salary~., data=Hitters.train)
lm.pred = predict(lm.fit, Hitters.test)
mean((Hitters.test$Salary - lm.pred)^2)

## [1] 0.4917959

x = model.matrix(Salary~., data=Hitters.train)
y = Hitters.train$Salary
x.test = model.matrix(Salary~., data=Hitters.test)
lasso.fit = glmnet(x, y, alpha=1)
lasso.pred = predict(lasso.fit, s=0.01, newx=x.test)
mean((Hitters.test$Salary - lasso.pred)^2)

## [1] 0.4700537

Boosting has a way better test MSE than the other two models

Which variables appear to be the most important predictors in the boosted model?

boost.best = gbm(Salary~., data=Hitters.train, distribution="gaussian", n.trees=1000, shrinkage=lambdas[which.min(test.errors)])
summary(boost.best)

##                 var    rel.inf
## CAtBat       CAtBat 14.8197925
## CRBI           CRBI 12.1186597
## CWalks       CWalks  9.7998044
## CHits         CHits  8.6560652
## PutOuts     PutOuts  7.2138695
## Walks         Walks  6.8368300
## Years         Years  6.4412453
## CHmRun       CHmRun  5.4742375
## Assists     Assists  4.3071056
## Hits           Hits  3.7625714
## Runs           Runs  3.6334822
## RBI             RBI  3.5479456
## HmRun         HmRun  3.3512231
## CRuns         CRuns  3.2227444
## AtBat         AtBat  3.0281593
## Errors       Errors  2.3590083
## Division   Division  0.6475954
## NewLeague NewLeague  0.6438269
## League       League  0.1358334

CAtBat, CRBI and CWalks are the most impoirtant variables

Now apply bagging to the training set. What is the test set MSE for this approach?

library(randomForest)
set.seed(42)
rf.hitters = randomForest(Salary~., data=Hitters.train, ntree=500, mtry=19)
rf.pred = predict(rf.hitters, Hitters.test)
mean((Hitters.test$Salary - rf.pred)^2)

## [1] 0.2265061

Bagging has a slightly lower test MSE than Boosting

11. This question uses the Caravan data set.

Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

library(ISLR)
train = 1:1000
Caravan$Purchase = ifelse(Caravan$Purchase == "Yes", 1, 0)
Caravan.train = Caravan[train, ]
Caravan.test = Caravan[-train, ]

Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

library(gbm)
set.seed(42)
boost.caravan = gbm(Purchase~., data=Caravan.train, n.trees = 1000, shrinkage = 0.01, distribution = "bernoulli")

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
## variable 50: PVRAAUT has no variation.

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
## variable 71: AVRAAUT has no variation.

summary(boost.caravan)

##               var     rel.inf
## PPERSAUT PPERSAUT 15.24304059
## MKOOPKLA MKOOPKLA 10.22049754
## MOPLHOOG MOPLHOOG  7.58473391
## MBERMIDD MBERMIDD  5.98365038
## PBRAND     PBRAND  4.55749118
## ABRAND     ABRAND  4.07601736
## MINK3045 MINK3045  4.03149141
## MGODGE     MGODGE  3.50618597
## MOSTYPE   MOSTYPE  2.82332650
## MAUT2       MAUT2  2.61711991
## MSKC         MSKC  2.21439111
## MAUT1       MAUT1  2.13764619
## MBERARBG MBERARBG  2.01645301
## MSKA         MSKA  1.98039700
## MGODPR     MGODPR  1.94608284
## PWAPART   PWAPART  1.90065796
## MGODOV     MGODOV  1.81046263
## MRELGE     MRELGE  1.65327955
## MINK7512 MINK7512  1.54952273
## MBERHOOG MBERHOOG  1.54562792
## MINKGEM   MINKGEM  1.51129086
## PBYSTAND PBYSTAND  1.46885445
## MSKB1       MSKB1  1.46349386
## MFWEKIND MFWEKIND  1.24890126
## MINKM30   MINKM30  1.01877067
## APERSAUT APERSAUT  1.00947264
## MSKD         MSKD  0.94342296
## MOSHOOFD MOSHOOFD  0.92805596
## MFGEKIND MFGEKIND  0.92012209
## MAUT0       MAUT0  0.91661495
## MGODRK     MGODRK  0.90097295
## MOPLMIDD MOPLMIDD  0.80067001
## MRELOV     MRELOV  0.79866885
## MHHUUR     MHHUUR  0.66251044
## MBERBOER MBERBOER  0.61490907
## MBERARBO MBERARBO  0.59493791
## PMOTSCO   PMOTSCO  0.59140712
## MSKB2       MSKB2  0.49738895
## PLEVEN     PLEVEN  0.47656908
## MINK4575 MINK4575  0.44932585
## MZPART     MZPART  0.43764686
## MHKOOP     MHKOOP  0.41454005
## MGEMOMV   MGEMOMV  0.41336506
## MBERZELF MBERZELF  0.38071586
## MZFONDS   MZFONDS  0.31613260
## MINK123M MINK123M  0.30173680
## MGEMLEEF MGEMLEEF  0.26253060
## MOPLLAAG MOPLLAAG  0.16165917
## MFALLEEN MFALLEEN  0.09723737
## MAANTHUI MAANTHUI  0.00000000
## MRELSA     MRELSA  0.00000000
## PWABEDR   PWABEDR  0.00000000
## PWALAND   PWALAND  0.00000000
## PBESAUT   PBESAUT  0.00000000
## PVRAAUT   PVRAAUT  0.00000000
## PAANHANG PAANHANG  0.00000000
## PTRACTOR PTRACTOR  0.00000000
## PWERKT     PWERKT  0.00000000
## PBROM       PBROM  0.00000000
## PPERSONG PPERSONG  0.00000000
## PGEZONG   PGEZONG  0.00000000
## PWAOREG   PWAOREG  0.00000000
## PZEILPL   PZEILPL  0.00000000
## PPLEZIER PPLEZIER  0.00000000
## PFIETS     PFIETS  0.00000000
## PINBOED   PINBOED  0.00000000
## AWAPART   AWAPART  0.00000000
## AWABEDR   AWABEDR  0.00000000
## AWALAND   AWALAND  0.00000000
## ABESAUT   ABESAUT  0.00000000
## AMOTSCO   AMOTSCO  0.00000000
## AVRAAUT   AVRAAUT  0.00000000
## AAANHANG AAANHANG  0.00000000
## ATRACTOR ATRACTOR  0.00000000
## AWERKT     AWERKT  0.00000000
## ABROM       ABROM  0.00000000
## ALEVEN     ALEVEN  0.00000000
## APERSONG APERSONG  0.00000000
## AGEZONG   AGEZONG  0.00000000
## AWAOREG   AWAOREG  0.00000000
## AZEILPL   AZEILPL  0.00000000
## APLEZIER APLEZIER  0.00000000
## AFIETS     AFIETS  0.00000000
## AINBOED   AINBOED  0.00000000
## ABYSTAND ABYSTAND  0.00000000

PPERSAUT and MKOOPKLA are the most important variables.

Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20 %. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this dataset?

boost.prob = predict(boost.caravan, Caravan.test, n.trees = 1000, type = "response")
boost.pred = ifelse(boost.prob > 0.2, 1, 0)
table(Caravan.test$Purchase, boost.pred)

##    boost.pred
##        0    1
##   0 4415  118
##   1  257   32

32/(118+32)

## [1] 0.2133333

Around 20% of the people predicted to make a purchase end up making a purchase.

lm.caravan = glm(Purchase~., data=Caravan.train, family=binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

lm.prob = predict(lm.caravan, Caravan.test, type="response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

lm.pred = ifelse(lm.prob > 0.2, 1, 0)
table(Caravan.test$Purchase, lm.pred)

##    lm.pred
##        0    1
##   0 4183  350
##   1  231   58

58/(350 + 58)

## [1] 0.1421569

Around 14% of the people predicted to make a purchase end up making a purchase.

12. Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?

We will use the candy data set from fivethirtyeight

library(gbm)
library(randomForest)
candy_df <- read.csv(file="candy-data.csv", header=TRUE, sep=",")
set.seed(1)
train =  sample(nrow(candy_df), 2/3 * nrow(candy_df))
candy.train = candy_df[train, -1]
candy.test = candy_df[-train, -1]

Boosting

pows = seq(-10, -0.2, by=0.1)
lambdas = 10 ^ pows
train.errors = rep(NA, length(lambdas))
test.errors = rep(NA, length(lambdas))
for(i in 1:length(lambdas)){
  boost.candy = gbm(winpercent~., data=candy.train, distribution="gaussian", n.trees=5000, shrinkage=lambdas[i]) 
  train.pred = predict(boost.candy, candy.train, n.trees=1000)
  test.pred = predict(boost.candy, candy.test, n.trees=1000)
  train.errors[i] = mean((candy.train$winpercent - train.pred)^2)
  test.errors[i] = mean((candy.test$winpercent - test.pred)^2)
}

plot(lambdas, test.errors, type="b", xlab="Shrinkage", ylab="Train MSE", col="red", pch=20, ylim = c(0, 190))

min(test.errors)

## [1] 86.46589

Bagging

bag.candy = randomForest(winpercent~., data=candy.train, mtry=11)
bag.pred = predict(bag.candy, candy.test)
mean((candy.test$winpercent - bag.pred)^2)

## [1] 106.7529

Random Forest

oob.err=double(11)
test.err=double(11)
for(mtry in 1:11){
  rf.candy = randomForest(winpercent ~ ., data = candy.train, mtry = mtry, ntree = 500, 
      importance = T)
  oob.err[mtry]=rf.candy$mse[500]
  rf.pred = predict(rf.candy, candy.test)
  test.err[mtry] = mean((candy.test$winpercent - rf.pred)^2)
  cat(mtry," ")
}

## 1  2  3  4  5  6  7  8  9  10  11

matplot(1:mtry,cbind(test.err),pch=19,col=c("red"),type="b",ylab="Mean Squared Error")

importance(rf.candy)

##                      %IncMSE IncNodePurity
## chocolate        22.87750164    4129.02298
## fruity            3.11873475     222.88931
## caramel           5.22605566     364.68501
## peanutyalmondy    5.15124872     524.00309
## nougat           -4.45493734      69.11863
## crispedricewafer  0.07805355     214.07156
## hard             -1.75614048      59.66623
## bar               6.55586400     299.69649
## pluribus         -0.42883209     244.70892
## sugarpercent     -5.25807255    2089.29218
## pricepercent      9.71411694    2545.07728

test.err[which.min(test.err)]

## [1] 104.1517

Linear Regression & Lasso

library(glmnet)
set.seed(1)
lm.fit = lm(winpercent~. -sugarpercent -pricepercent -pluribus, data=candy.train)
lm.pred = predict(lm.fit, candy.test)
summary(lm.fit)

## 
## Call:
## lm(formula = winpercent ~ . - sugarpercent - pricepercent - pluribus, 
##     data = candy.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.257  -4.572   1.335   6.188  22.197 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        33.507      4.880   6.867  1.3e-08 ***
## chocolate          22.472      5.446   4.126 0.000149 ***
## fruity             11.334      5.201   2.179 0.034377 *  
## caramel             7.360      4.262   1.727 0.090776 .  
## peanutyalmondy     11.397      4.057   2.809 0.007219 ** 
## nougat              2.044      5.817   0.351 0.726817    
## crispedricewafer    6.055      5.715   1.060 0.294781    
## hard               -1.667      4.553  -0.366 0.716004    
## bar                -3.851      4.803  -0.802 0.426778    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.73 on 47 degrees of freedom
## Multiple R-squared:  0.5467, Adjusted R-squared:  0.4696 
## F-statistic: 7.087 on 8 and 47 DF,  p-value: 4.197e-06

mean((candy.test$winpercent - lm.pred)^2)

## [1] 133.2194

x = model.matrix(winpercent~., data=candy.train)
y = candy.train$winpercent
x.test = model.matrix(winpercent~., data=candy.test)
lasso.fit = glmnet(x, y, alpha=1)
lasso.pred = predict(lasso.fit, s=0.01, newx=x.test)
mean((candy.test$winpercent - lasso.pred)^2)

## [1] 123.7198