(b) Fit a regression tree to the training set. Plot the tree, and
interpret the results. What test MSE do you obtain?
tree.carseats <- tree(Sales ~ ., Carseats, subset = train)
plot(tree.carseats)
text(tree.carseats, pretty = 0)

The important indicators to predict Sales seems to be
shelving location and pricing. The better the location and cheaper the
price, the higher the sales will be. The first branch differentiates
cheap (< 129.5) from the expensive car seats, then the next branch
differentiate the good locations from bad and medium locations.
pred <- predict(tree.carseats, newdata = Carseats.test)
test_mse <- mean((Carseats.test$Sales - pred)^2)
print(paste("Test MSE:", round(test_mse, 3)))
## [1] "Test MSE: 4.472"
(c) Use cross-validation in order to determine the optimal level of
tree complexity. Does pruning the tree improve the test MSE?
cv.carseats <- cv.tree(tree.carseats)
plot(cv.carseats$size, cv.carseats$dev, type = "b")

In this case, the most complex tree is selected by cross-validation.
Although we can also use the tree with size ~3 according to the elbow
rule.
prune1.carseats <- prune.tree(tree.carseats, best = 3)
prune2.carseats <- prune.tree(tree.carseats, best = 14)
plot(prune1.carseats)
text(prune1.carseats, pretty = 0)

plot(prune2.carseats)
text(prune2.carseats, pretty = 0)

yhat1 <- predict(prune1.carseats, newdata = Carseats[-train, ])
yhat2 <- predict(prune2.carseats, newdata = Carseats[-train, ])
carseats.test <- Carseats[-train, "Sales"]
mean((yhat1 - carseats.test)^2)
## [1] 6.555128
mean((yhat2 - carseats.test)^2)
## [1] 4.471569
Pruning the tree didn’t improve the test MSE.
(d) Use the bagging approach in order to analyze this data. What
test MSE do you obtain? Use the importance() function to
determine which variables are most important.
set.seed(1)
bag.carseats <- randomForest(Sales ~ ., data = Carseats,
subset = train, mtry = 10, importance = TRUE)
yhat.bag <- predict(bag.carseats, newdata = Carseats.test)
plot(yhat.bag, carseats.test)
abline(0, 1)

# MSE
mean((yhat.bag - carseats.test)^2)
## [1] 2.555523
# Important variables
importance(bag.carseats)
## %IncMSE IncNodePurity
## CompPrice 25.64599599 213.096278
## Income 5.67740403 76.385083
## Advertising 12.22093633 104.986108
## Population 0.59715138 61.294727
## Price 56.50749020 535.237246
## ShelveLoc 41.05022443 283.698935
## Age 9.11485795 118.103039
## Education -0.05758758 39.442533
## Urban 0.79459812 8.828001
## US 1.45384950 7.611851
Price and ShelveLoc are most important
variables with CompPrice relatively important.
(e) Use random forests to analyze this data. What test MSE do you
obtain? Use the importance() function to determine which
variables are most important. Describe the effect of m, the number of
variables considered at each split, on the error rate obtained.
set.seed(1)
rf.carseats <- randomForest(Sales ~ ., data = Carseats,
subset = train, mtry = 3, importance = TRUE) # m = p^(1/2)
yhat.rf <- predict(rf.carseats, newdata = Carseats.test)
# MSE
mean((yhat.rf - carseats.test)^2)
## [1] 3.20635
# Important variables
importance(rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 11.34376552 158.83906
## Income 6.40545807 121.23469
## Advertising 10.24191488 123.49672
## Population 0.09842459 98.17871
## Price 36.27025901 403.38433
## ShelveLoc 31.07819285 241.99669
## Age 6.77865583 145.64539
## Education 1.51515779 65.62168
## Urban 1.14914879 14.60700
## US 2.71845074 14.78064
The most important variables are still Price and
ShelveLoc. When m increase, the error rate decrease.
(f) Now analyze the data using BART, and report your results.
x <- Carseats[, 2:11]
y <- Carseats[, "Sales"]
xtrain <- x[train, ]
ytrain <- y[train]
xtest <- x[-train, ]
ytest <- y[-train]
set.seed(1)
bartfit <- gbart(xtrain, ytrain, x.test = xtest)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 200, 14, 200
## y1,yn: 0.154000, -2.406000
## x1,x[n*p]: 140.000000, 1.000000
## xp1,xp[np*p]: 111.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 67 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.287616,3,0.212816,7.346
## *****sigma: 1.045244
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 2s
## trcnt,tecnt: 1000,1000
yhat.bart <- bartfit$yhat.test.mean
mean((ytest - yhat.bart)^2)
## [1] 1.372413
Using BART, the error is much lower than using RandomForest or
regression tree (around 2x lower).