Exercise 8.1

Recreate the simulated data from Exercise 7.2:

In the hidden code chunk below we create the simulated data.

# Check out packages
library(mlbench)
library(randomForest)
library(caret)
library(knitr)
library(party)
library(gbm)
library(Cubist)
library(rules)
library(rpart)
library(ggplot2)
library(AppliedPredictiveModeling)
library(rpart.plot)
library(rattle)
#library(partykit)

# Create simulated data
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"


Part a

Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The variable importance scores are displayed below. The random forest model did not significantly use the uninformative predictors, V6 through V10.

# Fit a random forest model
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
kable(rfImp1, align="c")
Overall
V1 8.6274328
V2 6.2743724
V3 0.7230546
V4 7.5025858
V5 2.1357565
V6 0.1239500
V7 0.0292789
V8 -0.1172432
V9 -0.1034480
V10 0.0431256


Part b

Now add an additional predictor that is highly correlated with one of the informative predictors. Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Most of the predictors kept the same rough level of importance, however V1 decreased in importance ~24%, and the new, highly correlated with V1, predictor, duplicate1, had significant importance in the model.

# Add additional predictor correlated with V1
simulated2 <- simulated
set.seed(300)
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1

# Confirm they are highly correlated: 0.9528954
#cor(simulated$duplicate1, simulated$V1)

# Fit a random forest model
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
kable(rfImp2, align="c")
Overall
V1 8.8573953
V2 6.4406713
V3 0.8226900
V4 7.7206880
V5 2.2082585
V6 0.0889597
V7 0.0649301
V8 -0.0247189
V9 -0.1349200
V10 -0.1739487


Part c

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

cforest builds a random forest where the splits are selected with statistical hypothesis tests instead of trying to maximize an impurity measure like a Gini index. Because of this it tends to produce more balanced trees that are less sensitive to irrelevant splits.

We’re showing similar importance values as before except the irrelevant predictors are selected less often for splits and V3 is now of no significant importance in the model.

Note, the default for varimp is to calculate the importance of each variable without taking into consideration the correlation with the other predictors. If we set conditional equal to TRUE, it takes longer to calculate, and we see the relative importance of the variables without any inflationary redundancy between the variables due to correlation.

# Fit a random forest model using conditional inference trees
set.seed(300)
model3 <- cforest(y ~ ., data = simulated)
rfImp3 <- varImp(model3, conditional=TRUE)
kable(rfImp3, align="c")
Overall
V1 6.1485270
V2 5.3241662
V3 0.0306096
V4 6.3559137
V5 1.2414177
V6 0.0005028
V7 0.0127525
V8 -0.0067162
V9 0.0062369
V10 -0.0309395


Part d

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

The same pattern is occurring for both boosted trees and cubist, except for the cubist model V3 is identified as having no importance instead of low importance.


Boosted trees

Here we show the variable importance of the boosted trees model. All five of the relevant variables are significant, except V3 is on the verge of insignificance.

# Fit a boosted trees model
set.seed(300)
model4 <- gbm(y ~ ., data = simulated, distribution = "gaussian")
rfImp4 <- summary(model4, plotit=FALSE)
rownames(rfImp4) <- NULL
kable(rfImp4, align="c")
var rel.inf
V4 30.5478789
V1 26.5715289
V2 23.6176964
V5 11.9753864
V3 6.9728920
V9 0.1605657
V7 0.1540518
V6 0.0000000
V8 0.0000000
V10 0.0000000


Cubist

Here we show the importance of the variables in the cubist model. Four of the relevant predictors are relevant to the model and V3 is not relevant.

The cubist model has linear model at each leaf and using alternative code commented out below you can see the coefficients of the combined linear model to get a sense of variable importance that way.

# Fit a cubist model
set.seed(300)
model5 <- cubist(simulated[, -11], simulated$y)
rfImp5 <- varImp(model5)
kable(rfImp5, align="c")
Overall
V1 50
V2 50
V4 50
V5 50
V3 0
V6 0
V7 0
V8 0
V9 0
V10 0
# Show the relative coefficients across the rules combined as a single linear model
#rule_df <- tidy(model5)
#rule_df$estimate[[1]]



Exercise 8.2

Use a simulation to show tree bias with different granularities.

Here what we’re expecting to see is that when you fit models with low granularity (a low depth or low number of leaves) then you get underfitting models with high bias and low variance. And when you fit models with high granularity (a high max depth or high number of leaves) then you get overfitted models with low bias and high variance. Let’s see how that plays out when we actually attempt it.

Fitting trees with different granularities

Here we show that as maxdepth varies from low to high granularity, the mean squared error goes down, showing a reduction in bias as the granularity increases. Likely we’d want to use the lowest maxdepth that achieves the full possible reduction in bias but without increasing variance.

If we were to use k-fold cross validation we would expect to see mean squared error increase as granularity increases showing an increase in variance as maxdepth goes up. This we’ll leave as a future exercise.

# Generate a simple simulated dataset
set.seed(300)
n <- 200
x <- runif(n, -10, 10)
y <- sin(x) + rnorm(n, sd = 0.5)
data <- data.frame(x = x, y = y)

# Function to fit decision trees with varying granularity and measure bias
results <- data.frame()
for (depth in 1:10) {
  model <- rpart(y ~ x, data = data, control = rpart.control(maxdepth = depth))
  predictions <- predict(model, data)
  mse <- mean((predictions - data$y)^2)
  results <- rbind(results, data.frame(Granularity = depth, MSE = mse))
}

# Plotting the results
ggplot(results, aes(x = Granularity, y = MSE)) +
  geom_line() +
  geom_point() +
  labs(title = "Tree Bias vs. Granularity",
       x = "Tree Granularity (Max Depth)",
       y = "Mean Squared Error (Bias)")




Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:


Part a

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has a “winner’s takes all” effect because the high bagging fraction (0.9) means that 90% of the data is used in each boosting iteration, so there is less randomness in the data for each sample and so the model is allowed to become biased towards the overall dataset, magnifying the importance of fewer predictors.

Also, since the model on the right has a high learning rate (0.9), the model “learns quickly”, making larger updates during each iteration and more rapidly converging to a fewer dominant predictors early in the boosting process.


Part b

Which model do you think would be more predictive of other samples?

The model on the left with a bagging fraction and learning rate of 0.1 is more likely to generalize successfully to other samples because, since the bagging fraction is 10% of the overall data, it’s less likely to be overfit. Also, since the learning rate is 0.1, importance is more likely to be spread across more predictors and the trees built will be more balanced, which in turn should help the model perform better on new data.

Remember though, these are just two extremes, and we’d have to tune the model to get the ideal bagging fraction and learning rate.


Part c

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

The interaction depth is the same as the tree depth, meaning the number of times a tree branches.

I believe increasing the interaction depth would make the slope of predictor importance less steep because having more branches in a tree would force a larger variety of predictors to be selected across the trees, broadening the contributions to the importance metric across a wider array of predictors.

However, I’m not sure whether the phenomenon of increased tree depth flattening out the variable importance slope, could win out over a high bagging fraction and high learning rates combined “winner takes all” effect of concentrating the importance among a few predictors.




Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:


Set up data

Here, in the hidden code chunk below, we load, impute and split the data the same as before.

# Load the data
data("ChemicalManufacturingProcess")

# Impute missing values using kNN
imputeModel <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
CMP <- predict(imputeModel, ChemicalManufacturingProcess)

# Split data
set.seed(522)
trainingRows <- createDataPartition(CMP[,1], p = .80, list=FALSE)
trainPredictors <- CMP[trainingRows,-1]
trainTarget <- CMP[trainingRows,1]
testPredictors <- CMP[-trainingRows,-1]
testTarget <- CMP[-trainingRows,1]


Random Forest

With Random Forest, we’re arriving at an RMSE of 0.6166 and an R-squared of 0.6607 on the training data and 0.5484 and 0.7927 on the test data.

Note, mtry is the number of predictors randomly available for splitting at each node. It can be as high as the total number of predictors we have. Reducing the number of predictors available for splitting at each node helps de-correlate the trees in the forest which should help the model generalize to new data.

# Train the random forest model
tuneGrid <- expand.grid(mtry = seq(1, 10, by = 1))
set.seed(175175328)
model6 <- train(x = trainPredictors,
                y = trainTarget,
                method = "rf",
                preProc = c("center", "scale"),
                tuneGrid = tuneGrid,
                trControl = trainControl(method = "cv", number = 10 ))
model6
## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 131, 130, 130, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    1    0.7286591  0.5858724  0.5832262
##    2    0.6732224  0.6286727  0.5353257
##    3    0.6500265  0.6422395  0.5175240
##    4    0.6393521  0.6478105  0.5071868
##    5    0.6354561  0.6407565  0.5053535
##    6    0.6324021  0.6502140  0.5008222
##    7    0.6264827  0.6485984  0.4944392
##    8    0.6266584  0.6518903  0.4952487
##    9    0.6166086  0.6606518  0.4876063
##   10    0.6208711  0.6519127  0.4895918
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.
# Apply our random forest model to the test data
model6_pred <- predict(model6, newdata = testPredictors)
postResample(pred = model6_pred, obs = testTarget)
##      RMSE  Rsquared       MAE 
## 0.5483681 0.7926986 0.4398753


Random Forest with conditional inference

With Random Forest with conditional inference trees, we’re arriving at an RMSE of 0.6834 and an R-squared of 0.5839 on the training data and 0.6133 and 0.7130 on the test data.

I’m curious for any theoretic generalities that could explain why adding conditional inference would make a random forest model perform worse.

# Train the random forest with conditional inference trees model 
tuneGrid <- expand.grid(mtry = seq(1, 10, by = 1))
set.seed(175175328)
model7 <- train(x = trainPredictors,
                y = trainTarget,
                method = "cforest",
                preProc = c("center", "scale"),
                tuneGrid = tuneGrid,
                trControl = trainControl(method = "cv", number = 10 ))
model7
## Conditional Inference Random Forest 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 131, 130, 130, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    1    0.8896558  0.4748586  0.7237117
##    2    0.8055261  0.5204348  0.6493272
##    3    0.7563850  0.5424041  0.6057099
##    4    0.7277064  0.5551902  0.5795782
##    5    0.7140324  0.5599167  0.5671698
##    6    0.7025270  0.5668312  0.5625513
##    7    0.6938436  0.5770839  0.5530291
##    8    0.6923971  0.5706127  0.5516456
##    9    0.6833793  0.5838995  0.5450433
##   10    0.6854868  0.5733028  0.5472057
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.
# Apply our conditional inference random forest model to the test data
model7_pred <- predict(model7, newdata = testPredictors)
postResample(pred = model7_pred, obs = testTarget)
##      RMSE  Rsquared       MAE 
## 0.6133351 0.7129650 0.4766234


Boosted trees

With Boosted trees, we’re arriving at an RMSE of 0.5817 and an R-squared of 0.6660 on the training data and 0.4691 and 0.7802 on the test data.

Boosted trees are very sensitive to the hyperparameters selected. We shrank the tuning grid to one combination to reduce the amount of output.

# Train a boosted trees model 
tuneGrid <- expand.grid(interaction.depth = c(5),
                        n.trees = c(500),
                        shrinkage = c(0.1),
                        n.minobsinnode = c(10))
set.seed(175175328)
model8 <- train(x = trainPredictors,
                y = trainTarget,
                method = "gbm",
                preProc = c("center", "scale"),
                tuneGrid = tuneGrid,
                trControl = trainControl(method = "cv", number = 10 ))
# In it's own code block to hide the lengthy output from training the model
model8
## Stochastic Gradient Boosting 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 131, 130, 130, 130, 129, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.5817299  0.6659662  0.4652698
## 
## Tuning parameter 'n.trees' was held constant at a value of 500
## Tuning
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
# Apply our boosted trees model to the test data
model8_pred <- predict(model8, newdata = testPredictors)
postResample(pred = model8_pred, obs = testTarget)
##      RMSE  Rsquared       MAE 
## 0.4690847 0.7802397 0.3579802


Cubist

With this cubist model, we’re arriving at an RMSE of 0.5278 and an R-squared of 0.7183 on the training data and 0.5241 and 0.7253 on the test data.

# Train a cubist model 
tuneGrid <- expand.grid(committees = c(1, 10, 50),
                        neighbors = c(0, 1, 5))
set.seed(175175328)
model9 <- train(x = trainPredictors,
                y = trainTarget,
                method = "cubist",
                preProc = c("center", "scale"),
                tuneGrid = tuneGrid,
                trControl = trainControl(method = "cv", number = 10 ))
model9
## Cubist 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 131, 130, 130, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    1          0          0.6311769  0.6045652  0.4983715
##    1          1          0.5740341  0.6856529  0.4202134
##    1          5          0.5290273  0.7180859  0.4225843
##   10          0          0.6439002  0.5811025  0.5067713
##   10          1          0.5916300  0.6793025  0.4163314
##   10          5          0.5869912  0.6606555  0.4495494
##   50          0          0.6157128  0.6047530  0.4804664
##   50          1          0.5624922  0.6877002  0.3942901
##   50          5          0.5544594  0.6798224  0.4221757
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 1 and neighbors = 5.
# Apply our cubist model to the test data
model9_pred <- predict(model9, newdata = testPredictors)
postResample(pred = model9_pred, obs = testTarget)
##      RMSE  Rsquared       MAE 
## 0.5240975 0.7252825 0.4267336


Part a

Which tree-based regression model gives the optimal resampling and test set performance?

The random forest model had the best test set performance; however the cubist model had the best optimal resampling.

Maybe some aspect of the cubist model allowed it to capture noise or specifics of the cross-validation folds but the random forest was able to generalize better.

Model Test RMSE Test R-squared Train RMSE Train R-squared
Random Forest 0.5484 0.7927 0.6166 0.6607
RF wCond. Inf. 0.6133 0.7130 0.6834 0.5839
Boosted Trees 0.4691 0.7802 0.5817 0.6660
Cubist 0.5241 0.7253 0.5278 0.7183


Part b

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The top ten most important predictors are the same as the optimal nonlinear model just in a different order. Eight of the 10 match the top ten from the optimal linear model. Manufacturing processes dominate the list, 7 to 3, and hold the top four spots.

# Print the top ten most important predictors
rfVarImp <- varImp(model6)
top10_rf <- rfVarImp$importance |>
  as.data.frame() |>
  dplyr::arrange(desc(Overall)) |>
  head(10)
print(top10_rf)
##                          Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess13  57.49393
## ManufacturingProcess31  43.96549
## ManufacturingProcess09  37.88973
## BiologicalMaterial03    36.14293
## BiologicalMaterial06    33.44704
## ManufacturingProcess17  32.63360
## BiologicalMaterial12    32.34145
## ManufacturingProcess06  31.75614
## ManufacturingProcess36  22.63993


Part c

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Here we have a simplified rendering of the decision tree with only the first two levels of branching displayed. The top number in any oval is the average yield and the percent is how many of the data points end up this far on this particular path in the decision tree. Splits higher up in the decision tree have the largest impact and splits further down the decision tree refine the resulting prediction.

This shows manufacturing process 32 as having the largest predictive impact for values with the split at 0.21. If manufacturing process 32 is higher than .21 then we’re expecting significantly higher yield. If it is higher and manufacturing process 06 is higher than 0.36 we see the highest outcomes in yield. If manufacturing process 32 is lower than its split, and biological material 11 is lower than -0.37, then we see the lowest outcomes in yield.

What’s not clear to me is if we could set manufacturing process 32 greater than 0.2 for all data points and would that be cost effective? It may be that there are so many manufacturing processes but they have to be set differently depending on which combination of biological materials you start with and so while we could describe the ideal way to produce yield, what we currently have may be the optimal workflow for the biological materials available to us.

Using the partykit package we can display the decision tree as follows:

Fitted party:
[1] root
| [2] ManufacturingProcess32 < 0.2139
| | [3] BiologicalMaterial11 < -0.36738: -0.889 (n = 40, err = 13.5)
| | [4] BiologicalMaterial11 >= -0.36738: -0.151 (n = 43, err = 22.3)
| [5] ManufacturingProcess32 >= 0.2139
| | [6] ManufacturingProcess06 < 0.36415: 0.286 (n = 35, err = 19.6)
| | [7] ManufacturingProcess06 >= 0.36415: 1.240 (n = 26, err = 12.8)

# Train a single tree model 
tuneGrid <- expand.grid(cp = seq(0.01, 0.1, by = 0.01))
set.seed(175175328)
model10 <- train(x = trainPredictors,
                y = trainTarget,
                method = "rpart",
                preProc = c("center", "scale"),
                tuneGrid = tuneGrid,
                trControl = trainControl(method = "cv", number = 10 ))
fancyRpartPlot(model10$finalModel)

#Alternative plot that sometimes shows more branching
#rpart.plot(model10$finalModel, type = 2, fallen.leaves = TRUE, cex = 0.5)

# Display party tree results
#party_tree <- as.party(model10$finalModel)
#party_tree



References

These exercises come from ‘Applied Predictive Modeling’ by Max Kuhn and Kjell Johnson.

For more information about Cubist models, specifically how to extract variable importance and average relative variable coefficients across the rules, I referred to: https://cran.r-project.org/web/packages/Cubist/vignettes/cubist.html (see Exercise 8.1 Part d)