8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,  importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

Did the random forest model significantly use the uninformative predictors

The random forest model did not significantly use the uninformative predictors. The importance scores for the uninformative predictors are all 0. (V6 – V10)? (b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change?

model2 <- randomForest(y ~ ., data = simulated,  importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

The importance score for V1 did not change.

What happens when you add another predictor that is also highly correlated with V1?

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9408631

model3 <- randomForest(y ~ ., data = simulated,  importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3

When we add another predictor that is highly correlated with V1, the importance score for the other predictors decrease.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model4 <- cforest(y ~ ., data = simulated[, c(1:11)])
cfImp1 <- varimp(model4, conditional = FALSE)
cfImp1

##         V1         V2         V3         V4         V5         V6         V7 
##  8.3013758  6.3849498  0.2021611  7.3074416  2.1097038 -0.1612203  0.1021704 
##         V8         V9        V10 
## -0.1382315 -0.1127212 -0.1345254

The importances show the same pattern as the traditional random forest model.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

#Cubist Model

set.seed(100)
cubist <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubist$finalModel, scale = FALSE)

rfImp6

This model shows the same pattern as the traditional random forest model.

#Boosted Trees

library(gbm)

## Warning: package 'gbm' was built under R version 4.3.3

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

gbm_tuning_grid <- expand.grid(
  interaction.depth = seq(1, 7, by = 2),
  n.trees = seq(100, 1000, by = 50),
  shrinkage = c(0.01, 0.1),
  n.minobsinnode = 10
)

set.seed(592)

gbm_model <- train(
  y ~ ., 
  data = simulated[, 1:11],
  method = "gbm",
  tuneGrid = gbm_tuning_grid,
  verbose = FALSE
)

rfImp5 <- varImp(gbm_model$finalModel, scale = FALSE)

rfImp5

They show the same pattern as the traditional random forest model.

8.2. Use a simulation to show tree bias with different granularities.

set.seed(592)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y" 


rpart_tree <- rpart(y ~ ., data = simulated)
rpart_importance <- rpart_tree$variable.importance  # Extract variable importance
rpart_importance

##        V4        V2        V1        V5        V6        V3        V9        V8 
## 1612.7016 1149.6641 1107.7297  486.8192  423.1049  379.0755  243.3718  223.2010 
##       V10        V7 
##  146.0495  124.6534

plot(as.party(rpart_tree), gp = gpar(fontsize = 7))

In this case, the tree bias is shown with different granularities. The variable importance is shown in the plot.

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

TThe bagging fraction determines how much training data is used per iteration, and the learning rate controls how quickly the model learns. A smaller learning rate and bagging fraction lead to slower, more accurate learning, reducing overfitting. A higher learning rate and bagging fraction can cause overfitting by focusing too much on a few predictors.

Which model do you think would be more predictive of other samples?

The model on the left generalizes better with more iterations, reducing the weight of each predictor and improving accuracy on other samples.So I think that the model on the left would be more predictive of other samples.

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction depth refers to the number of splits or maximum nodes in a tree. As interaction depth increases, the importance of predictors grows, allowing smaller but important predictors to have a greater impact. This makes the slope steeper or increases it.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

set.seed(100)

data(ChemicalManufacturingProcess)

# imputation
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)

# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]

set.seed(624)

# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)

# train 
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]

# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]

Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest Model

set.seed(592)

rfTune <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)

rfpred <- predict(rfTune, test_x)

postResample(rfpred, test_y)

##      RMSE  Rsquared       MAE 
## 1.2685304 0.7169606 0.8994923

#Bagged tree Model

set.seed(592)

baggedTree <- ipredbagg(train_y, train_x)
 
baggedPred <- predict(baggedTree, test_x)

postResample(baggedPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.3201044 0.6712407 0.9782215

Boosted Tree Model

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)

set.seed(592)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.1779112 0.7254734 0.8715403

Cubist Model

set.seed(592)

cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)

##      RMSE  Rsquared       MAE 
## 0.9804618 0.7964499 0.7363520

CART Model

set.seed(592)

cartTune <- train(train_x, train_y,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

cartPred <- predict(cartTune, test_x)

postResample(cartPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.5225133 0.5105273 1.1917615

The lowest RMSE is the Cubist model

rbind(rf = postResample(rfpred, test_y),
      bagged = postResample(baggedPred, test_y),
      gbm = postResample(gbmPred, test_y),
      cubist = postResample(cubistPred, test_y),
      cart = postResample(cartPred, test_y))

##             RMSE  Rsquared       MAE
## rf     1.2685304 0.7169606 0.8994923
## bagged 1.3201044 0.6712407 0.9782215
## gbm    1.1779112 0.7254734 0.8715403
## cubist 0.9804618 0.7964499 0.7363520
## cart   1.5225133 0.5105273 1.1917615

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

cubistImp <- varImp(cubistTuned$finalModel, scale = FALSE)
cubistImp

plot(varImp(cubistTuned), top = 10)

The manufacturing process variables dominate the list of important predictors.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])

plot(as.party(rpartTree), gp = gpar(fontsize = 7))

Data HW 7

Mikhail Broomes

2024-10-30

Random Forest Model

Boosted Tree Model

Cubist Model

CART Model