Recreate the simulated data from Exercise 7.2:
In the hidden code chunk below we create the simulated data.
# Check out packages
library(mlbench)
library(randomForest)
library(caret)
library(knitr)
library(party)
library(gbm)
library(Cubist)
library(rules)
library(rpart)
library(ggplot2)
library(AppliedPredictiveModeling)
library(rpart.plot)
library(rattle)
#library(partykit)
# Create simulated data
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The variable importance scores are displayed below. The random forest
model did not significantly use the uninformative predictors,
V6
through V10
.
# Fit a random forest model
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
kable(rfImp1, align="c")
Overall | |
---|---|
V1 | 8.6274328 |
V2 | 6.2743724 |
V3 | 0.7230546 |
V4 | 7.5025858 |
V5 | 2.1357565 |
V6 | 0.1239500 |
V7 | 0.0292789 |
V8 | -0.1172432 |
V9 | -0.1034480 |
V10 | 0.0431256 |
Now add an additional predictor that is highly correlated
with one of the informative predictors. Fit another random forest model
to these data. Did the importance score for V1
change? What
happens when you add another predictor that is also highly correlated
with V1
?
Most of the predictors kept the same rough level of importance,
however V1
decreased in importance ~24%, and the new,
highly correlated with V1
, predictor,
duplicate1
, had significant importance in the model.
# Add additional predictor correlated with V1
simulated2 <- simulated
set.seed(300)
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
# Confirm they are highly correlated: 0.9528954
#cor(simulated$duplicate1, simulated$V1)
# Fit a random forest model
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
kable(rfImp2, align="c")
Overall | |
---|---|
V1 | 8.8573953 |
V2 | 6.4406713 |
V3 | 0.8226900 |
V4 | 7.7206880 |
V5 | 2.2082585 |
V6 | 0.0889597 |
V7 | 0.0649301 |
V8 | -0.0247189 |
V9 | -0.1349200 |
V10 | -0.1739487 |
Use the cforest
function in the
party
package to fit a random forest model using
conditional inference trees. The party
package function
varimp
can calculate predictor importance. The
conditional
argument of that function toggles between the
traditional importance measure and the modified version described in
Strobl et al. (2007). Do these importances show the same pattern as the
traditional random forest model?
cforest
builds a random forest where the splits are
selected with statistical hypothesis tests instead of trying to maximize
an impurity measure like a Gini index. Because of this it tends to
produce more balanced trees that are less sensitive to irrelevant
splits.
We’re showing similar importance values as before except the
irrelevant predictors are selected less often for splits and
V3
is now of no significant importance in the model.
Note, the default for varimp
is to calculate the
importance of each variable without taking into consideration the
correlation with the other predictors. If we set
conditional
equal to TRUE
, it takes longer to
calculate, and we see the relative importance of the variables without
any inflationary redundancy between the variables due to
correlation.
# Fit a random forest model using conditional inference trees
set.seed(300)
model3 <- cforest(y ~ ., data = simulated)
rfImp3 <- varImp(model3, conditional=TRUE)
kable(rfImp3, align="c")
Overall | |
---|---|
V1 | 6.1485270 |
V2 | 5.3241662 |
V3 | 0.0306096 |
V4 | 6.3559137 |
V5 | 1.2414177 |
V6 | 0.0005028 |
V7 | 0.0127525 |
V8 | -0.0067162 |
V9 | 0.0062369 |
V10 | -0.0309395 |
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
The same pattern is occurring for both boosted trees and cubist,
except for the cubist model V3
is identified as having no
importance instead of low importance.
Here we show the variable importance of the boosted trees model. All
five of the relevant variables are significant, except V3
is on the verge of insignificance.
# Fit a boosted trees model
set.seed(300)
model4 <- gbm(y ~ ., data = simulated, distribution = "gaussian")
rfImp4 <- summary(model4, plotit=FALSE)
rownames(rfImp4) <- NULL
kable(rfImp4, align="c")
var | rel.inf |
---|---|
V4 | 30.5478789 |
V1 | 26.5715289 |
V2 | 23.6176964 |
V5 | 11.9753864 |
V3 | 6.9728920 |
V9 | 0.1605657 |
V7 | 0.1540518 |
V6 | 0.0000000 |
V8 | 0.0000000 |
V10 | 0.0000000 |
Here we show the importance of the variables in the cubist model.
Four of the relevant predictors are relevant to the model and
V3
is not relevant.
The cubist model has linear model at each leaf and using alternative code commented out below you can see the coefficients of the combined linear model to get a sense of variable importance that way.
# Fit a cubist model
set.seed(300)
model5 <- cubist(simulated[, -11], simulated$y)
rfImp5 <- varImp(model5)
kable(rfImp5, align="c")
Overall | |
---|---|
V1 | 50 |
V2 | 50 |
V4 | 50 |
V5 | 50 |
V3 | 0 |
V6 | 0 |
V7 | 0 |
V8 | 0 |
V9 | 0 |
V10 | 0 |
# Show the relative coefficients across the rules combined as a single linear model
#rule_df <- tidy(model5)
#rule_df$estimate[[1]]
Use a simulation to show tree bias with different granularities.
Here what we’re expecting to see is that when you fit models with low granularity (a low depth or low number of leaves) then you get underfitting models with high bias and low variance. And when you fit models with high granularity (a high max depth or high number of leaves) then you get overfitted models with low bias and high variance. Let’s see how that plays out when we actually attempt it.
Here we show that as maxdepth
varies from low to high
granularity, the mean squared error goes down, showing a reduction in
bias as the granularity increases. Likely we’d want to use the lowest
maxdepth
that achieves the full possible reduction in bias
but without increasing variance.
If we were to use k-fold cross validation we would expect to see mean
squared error increase as granularity increases showing an increase in
variance as maxdepth
goes up. This we’ll leave as a future
exercise.
# Generate a simple simulated dataset
set.seed(300)
n <- 200
x <- runif(n, -10, 10)
y <- sin(x) + rnorm(n, sd = 0.5)
data <- data.frame(x = x, y = y)
# Function to fit decision trees with varying granularity and measure bias
results <- data.frame()
for (depth in 1:10) {
model <- rpart(y ~ x, data = data, control = rpart.control(maxdepth = depth))
predictions <- predict(model, data)
mse <- mean((predictions - data$y)^2)
results <- rbind(results, data.frame(Granularity = depth, MSE = mse))
}
# Plotting the results
ggplot(results, aes(x = Granularity, y = MSE)) +
geom_line() +
geom_point() +
labs(title = "Tree Bias vs. Granularity",
x = "Tree Granularity (Max Depth)",
y = "Mean Squared Error (Bias)")
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The model on the right has a “winner’s takes all” effect because the high bagging fraction (0.9) means that 90% of the data is used in each boosting iteration, so there is less randomness in the data for each sample and so the model is allowed to become biased towards the overall dataset, magnifying the importance of fewer predictors.
Also, since the model on the right has a high learning rate (0.9), the model “learns quickly”, making larger updates during each iteration and more rapidly converging to a fewer dominant predictors early in the boosting process.
Which model do you think would be more predictive of other samples?
The model on the left with a bagging fraction and learning rate of 0.1 is more likely to generalize successfully to other samples because, since the bagging fraction is 10% of the overall data, it’s less likely to be overfit. Also, since the learning rate is 0.1, importance is more likely to be spread across more predictors and the trees built will be more balanced, which in turn should help the model perform better on new data.
Remember though, these are just two extremes, and we’d have to tune the model to get the ideal bagging fraction and learning rate.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
The interaction depth is the same as the tree depth, meaning the number of times a tree branches.
I believe increasing the interaction depth would make the slope of predictor importance less steep because having more branches in a tree would force a larger variety of predictors to be selected across the trees, broadening the contributions to the importance metric across a wider array of predictors.
However, I’m not sure whether the phenomenon of increased tree depth flattening out the variable importance slope, could win out over a high bagging fraction and high learning rates combined “winner takes all” effect of concentrating the importance among a few predictors.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Here, in the hidden code chunk below, we load, impute and split the data the same as before.
# Load the data
data("ChemicalManufacturingProcess")
# Impute missing values using kNN
imputeModel <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
CMP <- predict(imputeModel, ChemicalManufacturingProcess)
# Split data
set.seed(522)
trainingRows <- createDataPartition(CMP[,1], p = .80, list=FALSE)
trainPredictors <- CMP[trainingRows,-1]
trainTarget <- CMP[trainingRows,1]
testPredictors <- CMP[-trainingRows,-1]
testTarget <- CMP[-trainingRows,1]
With Random Forest, we’re arriving at an RMSE of 0.6166 and an R-squared of 0.6607 on the training data and 0.5484 and 0.7927 on the test data.
Note, mtry
is the number of predictors randomly
available for splitting at each node. It can be as high as the total
number of predictors we have. Reducing the number of predictors
available for splitting at each node helps de-correlate the trees in the
forest which should help the model generalize to new data.
# Train the random forest model
tuneGrid <- expand.grid(mtry = seq(1, 10, by = 1))
set.seed(175175328)
model6 <- train(x = trainPredictors,
y = trainTarget,
method = "rf",
preProc = c("center", "scale"),
tuneGrid = tuneGrid,
trControl = trainControl(method = "cv", number = 10 ))
model6
## Random Forest
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 131, 130, 130, 130, 129, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 0.7286591 0.5858724 0.5832262
## 2 0.6732224 0.6286727 0.5353257
## 3 0.6500265 0.6422395 0.5175240
## 4 0.6393521 0.6478105 0.5071868
## 5 0.6354561 0.6407565 0.5053535
## 6 0.6324021 0.6502140 0.5008222
## 7 0.6264827 0.6485984 0.4944392
## 8 0.6266584 0.6518903 0.4952487
## 9 0.6166086 0.6606518 0.4876063
## 10 0.6208711 0.6519127 0.4895918
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.
# Apply our random forest model to the test data
model6_pred <- predict(model6, newdata = testPredictors)
postResample(pred = model6_pred, obs = testTarget)
## RMSE Rsquared MAE
## 0.5483681 0.7926986 0.4398753
With Random Forest with conditional inference trees, we’re arriving at an RMSE of 0.6834 and an R-squared of 0.5839 on the training data and 0.6133 and 0.7130 on the test data.
I’m curious for any theoretic generalities that could explain why adding conditional inference would make a random forest model perform worse.
# Train the random forest with conditional inference trees model
tuneGrid <- expand.grid(mtry = seq(1, 10, by = 1))
set.seed(175175328)
model7 <- train(x = trainPredictors,
y = trainTarget,
method = "cforest",
preProc = c("center", "scale"),
tuneGrid = tuneGrid,
trControl = trainControl(method = "cv", number = 10 ))
model7
## Conditional Inference Random Forest
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 131, 130, 130, 130, 129, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 0.8896558 0.4748586 0.7237117
## 2 0.8055261 0.5204348 0.6493272
## 3 0.7563850 0.5424041 0.6057099
## 4 0.7277064 0.5551902 0.5795782
## 5 0.7140324 0.5599167 0.5671698
## 6 0.7025270 0.5668312 0.5625513
## 7 0.6938436 0.5770839 0.5530291
## 8 0.6923971 0.5706127 0.5516456
## 9 0.6833793 0.5838995 0.5450433
## 10 0.6854868 0.5733028 0.5472057
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.
# Apply our conditional inference random forest model to the test data
model7_pred <- predict(model7, newdata = testPredictors)
postResample(pred = model7_pred, obs = testTarget)
## RMSE Rsquared MAE
## 0.6133351 0.7129650 0.4766234
With Boosted trees, we’re arriving at an RMSE of 0.5817 and an R-squared of 0.6660 on the training data and 0.4691 and 0.7802 on the test data.
Boosted trees are very sensitive to the hyperparameters selected. We shrank the tuning grid to one combination to reduce the amount of output.
# Train a boosted trees model
tuneGrid <- expand.grid(interaction.depth = c(5),
n.trees = c(500),
shrinkage = c(0.1),
n.minobsinnode = c(10))
set.seed(175175328)
model8 <- train(x = trainPredictors,
y = trainTarget,
method = "gbm",
preProc = c("center", "scale"),
tuneGrid = tuneGrid,
trControl = trainControl(method = "cv", number = 10 ))
# In it's own code block to hide the lengthy output from training the model
model8
## Stochastic Gradient Boosting
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 131, 130, 130, 130, 129, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.5817299 0.6659662 0.4652698
##
## Tuning parameter 'n.trees' was held constant at a value of 500
## Tuning
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
# Apply our boosted trees model to the test data
model8_pred <- predict(model8, newdata = testPredictors)
postResample(pred = model8_pred, obs = testTarget)
## RMSE Rsquared MAE
## 0.4690847 0.7802397 0.3579802
With this cubist model, we’re arriving at an RMSE of 0.5278 and an R-squared of 0.7183 on the training data and 0.5241 and 0.7253 on the test data.
# Train a cubist model
tuneGrid <- expand.grid(committees = c(1, 10, 50),
neighbors = c(0, 1, 5))
set.seed(175175328)
model9 <- train(x = trainPredictors,
y = trainTarget,
method = "cubist",
preProc = c("center", "scale"),
tuneGrid = tuneGrid,
trControl = trainControl(method = "cv", number = 10 ))
model9
## Cubist
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 131, 130, 130, 130, 129, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.6311769 0.6045652 0.4983715
## 1 1 0.5740341 0.6856529 0.4202134
## 1 5 0.5290273 0.7180859 0.4225843
## 10 0 0.6439002 0.5811025 0.5067713
## 10 1 0.5916300 0.6793025 0.4163314
## 10 5 0.5869912 0.6606555 0.4495494
## 50 0 0.6157128 0.6047530 0.4804664
## 50 1 0.5624922 0.6877002 0.3942901
## 50 5 0.5544594 0.6798224 0.4221757
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 1 and neighbors = 5.
# Apply our cubist model to the test data
model9_pred <- predict(model9, newdata = testPredictors)
postResample(pred = model9_pred, obs = testTarget)
## RMSE Rsquared MAE
## 0.5240975 0.7252825 0.4267336
Which tree-based regression model gives the optimal resampling and test set performance?
The random forest model had the best test set performance; however the cubist model had the best optimal resampling.
Maybe some aspect of the cubist model allowed it to capture noise or specifics of the cross-validation folds but the random forest was able to generalize better.
Model | Test RMSE | Test R-squared | Train RMSE | Train R-squared |
---|---|---|---|---|
Random Forest | 0.5484 | 0.7927 | 0.6166 | 0.6607 |
RF wCond. Inf. | 0.6133 | 0.7130 | 0.6834 | 0.5839 |
Boosted Trees | 0.4691 | 0.7802 | 0.5817 | 0.6660 |
Cubist | 0.5241 | 0.7253 | 0.5278 | 0.7183 |
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
The top ten most important predictors are the same as the optimal nonlinear model just in a different order. Eight of the 10 match the top ten from the optimal linear model. Manufacturing processes dominate the list, 7 to 3, and hold the top four spots.
# Print the top ten most important predictors
rfVarImp <- varImp(model6)
top10_rf <- rfVarImp$importance |>
as.data.frame() |>
dplyr::arrange(desc(Overall)) |>
head(10)
print(top10_rf)
## Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess13 57.49393
## ManufacturingProcess31 43.96549
## ManufacturingProcess09 37.88973
## BiologicalMaterial03 36.14293
## BiologicalMaterial06 33.44704
## ManufacturingProcess17 32.63360
## BiologicalMaterial12 32.34145
## ManufacturingProcess06 31.75614
## ManufacturingProcess36 22.63993
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
Here we have a simplified rendering of the decision tree with only the first two levels of branching displayed. The top number in any oval is the average yield and the percent is how many of the data points end up this far on this particular path in the decision tree. Splits higher up in the decision tree have the largest impact and splits further down the decision tree refine the resulting prediction.
This shows manufacturing process 32 as having the largest predictive impact for values with the split at 0.21. If manufacturing process 32 is higher than .21 then we’re expecting significantly higher yield. If it is higher and manufacturing process 06 is higher than 0.36 we see the highest outcomes in yield. If manufacturing process 32 is lower than its split, and biological material 11 is lower than -0.37, then we see the lowest outcomes in yield.
What’s not clear to me is if we could set manufacturing process 32 greater than 0.2 for all data points and would that be cost effective? It may be that there are so many manufacturing processes but they have to be set differently depending on which combination of biological materials you start with and so while we could describe the ideal way to produce yield, what we currently have may be the optimal workflow for the biological materials available to us.
Using the partykit
package we can display the decision
tree as follows:
Fitted party:
[1] root
| [2] ManufacturingProcess32 < 0.2139
| | [3] BiologicalMaterial11 < -0.36738: -0.889 (n = 40, err =
13.5)
| | [4] BiologicalMaterial11 >= -0.36738: -0.151 (n = 43, err =
22.3)
| [5] ManufacturingProcess32 >= 0.2139
| | [6] ManufacturingProcess06 < 0.36415: 0.286 (n = 35, err =
19.6)
| | [7] ManufacturingProcess06 >= 0.36415: 1.240 (n = 26, err =
12.8)
# Train a single tree model
tuneGrid <- expand.grid(cp = seq(0.01, 0.1, by = 0.01))
set.seed(175175328)
model10 <- train(x = trainPredictors,
y = trainTarget,
method = "rpart",
preProc = c("center", "scale"),
tuneGrid = tuneGrid,
trControl = trainControl(method = "cv", number = 10 ))
fancyRpartPlot(model10$finalModel)
#Alternative plot that sometimes shows more branching
#rpart.plot(model10$finalModel, type = 2, fallen.leaves = TRUE, cex = 0.5)
# Display party tree results
#party_tree <- as.party(model10$finalModel)
#party_tree
These exercises come from ‘Applied Predictive Modeling’ by Max Kuhn and Kjell Johnson.
For more information about Cubist models, specifically how to extract variable importance and average relative variable coefficients across the rules, I referred to: https://cran.r-project.org/web/packages/Cubist/vignettes/cubist.html (see Exercise 8.1 Part d)