Lab Homework #9 - Tree Methods

3. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆp m1 . The x-axis should display ˆp m1 , ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, ˆp m1 = 1 − ˆp m2 . You could make this plot by hand, but it will be much easier to make in R.

p_m1 <- seq(0, 1, length.out = 1000)
p_m2 <- 1 - p_m1

df <- data.frame(
  p_m1 = p_m1,
  gini = (p_m1 - p_m1^2) + (p_m2 - p_m2^2),
  class_error = 1 - pmax(p_m1, p_m2),
  entropy = (-p_m1 * log(p_m1 + 1e-10)) + (-p_m2 * log(p_m2 + 1e-10))
)

df_long <- pivot_longer(df, cols = -p_m1, names_to = "Measure", values_to = "Value")

ggplot(df_long, aes(x = p_m1, y = Value, color = Measure, linetype = Measure)) +
  geom_line() +
  labs(
    title = "Gini Index, Classification Error, and Entropy",
    x = expression(hat(p)[m1]),
    y = "Value"
  ) +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5)))

8. In the lab, a classification tree was applied to the `Carseats` data set after converting `Sales` into a qualitative response variable. Now we will seek to predict `Sales` using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

set.seed(2)
train <- sample(1:nrow(Carseats), 200)
Carseats.test <- Carseats[-train, ]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

tree.carseats <- tree(Sales ~ ., Carseats, subset = train)
plot(tree.carseats)
text(tree.carseats, pretty = 0)

The important indicators to predict Sales seems to be shelving location and pricing. The better the location and cheaper the price, the higher the sales will be. The first branch differentiates cheap (< 129.5) from the expensive car seats, then the next branch differentiate the good locations from bad and medium locations.

pred <- predict(tree.carseats, newdata = Carseats.test)

test_mse <- mean((Carseats.test$Sales - pred)^2)
print(paste("Test MSE:", round(test_mse, 3)))

## [1] "Test MSE: 4.472"

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

cv.carseats <- cv.tree(tree.carseats)
plot(cv.carseats$size, cv.carseats$dev, type = "b")

In this case, the most complex tree is selected by cross-validation. Although we can also use the tree with size ~3 according to the elbow rule.

prune1.carseats <- prune.tree(tree.carseats, best = 3)
prune2.carseats <- prune.tree(tree.carseats, best = 14)

plot(prune1.carseats)
text(prune1.carseats, pretty = 0)

plot(prune2.carseats)
text(prune2.carseats, pretty = 0)

yhat1 <- predict(prune1.carseats, newdata = Carseats[-train, ])
yhat2 <- predict(prune2.carseats, newdata = Carseats[-train, ])

carseats.test <- Carseats[-train, "Sales"]

mean((yhat1 - carseats.test)^2)

## [1] 6.555128

mean((yhat2 - carseats.test)^2)

## [1] 4.471569

Pruning the tree didn’t improve the test MSE.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important.

set.seed(1)
bag.carseats <- randomForest(Sales ~ ., data = Carseats,
    subset = train, mtry = 10, importance = TRUE)

yhat.bag <- predict(bag.carseats, newdata = Carseats.test)
plot(yhat.bag, carseats.test)
abline(0, 1)

# MSE
mean((yhat.bag - carseats.test)^2)

## [1] 2.555523

# Important variables
importance(bag.carseats)

##                 %IncMSE IncNodePurity
## CompPrice   25.64599599    213.096278
## Income       5.67740403     76.385083
## Advertising 12.22093633    104.986108
## Population   0.59715138     61.294727
## Price       56.50749020    535.237246
## ShelveLoc   41.05022443    283.698935
## Age          9.11485795    118.103039
## Education   -0.05758758     39.442533
## Urban        0.79459812      8.828001
## US           1.45384950      7.611851

Price and ShelveLoc are most important variables with CompPrice relatively important.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

set.seed(1)
rf.carseats <- randomForest(Sales ~ ., data = Carseats,
    subset = train, mtry = 3, importance = TRUE) # m = p^(1/2)

yhat.rf <- predict(rf.carseats, newdata = Carseats.test)

# MSE
mean((yhat.rf - carseats.test)^2)

## [1] 3.20635

# Important variables
importance(rf.carseats)

##                 %IncMSE IncNodePurity
## CompPrice   11.34376552     158.83906
## Income       6.40545807     121.23469
## Advertising 10.24191488     123.49672
## Population   0.09842459      98.17871
## Price       36.27025901     403.38433
## ShelveLoc   31.07819285     241.99669
## Age          6.77865583     145.64539
## Education    1.51515779      65.62168
## Urban        1.14914879      14.60700
## US           2.71845074      14.78064

The most important variables are still Price and ShelveLoc. When m increase, the error rate decrease.

(f) Now analyze the data using BART, and report your results.

x <- Carseats[, 2:11]
y <- Carseats[, "Sales"]
xtrain <- x[train, ]
ytrain <- y[train]
xtest <- x[-train, ]
ytest <- y[-train]
set.seed(1)
bartfit <- gbart(xtrain, ytrain, x.test = xtest)

## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 200, 14, 200
## y1,yn: 0.154000, -2.406000
## x1,x[n*p]: 140.000000, 1.000000
## xp1,xp[np*p]: 111.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 67 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.287616,3,0.212816,7.346
## *****sigma: 1.045244
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 2s
## trcnt,tecnt: 1000,1000

yhat.bart <- bartfit$yhat.test.mean
mean((ytest - yhat.bart)^2)

## [1] 1.372413

Using BART, the error is much lower than using RandomForest or regression tree (around 2x lower).

Lab Homework #9 - Tree Methods

Cuong Dang

2025-04-10

Lab Homework #9 - Tree Methods

Cuong Dang

2025-04-10

8. In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

(f) Now analyze the data using BART, and report your results.

8. In the lab, a classification tree was applied to the `Carseats` data set after converting `Sales` into a qualitative response variable. Now we will seek to predict `Sales` using regression trees and related approaches, treating the response as a quantitative variable.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.