Some of these models take a long time to train so we’ll use
doParallel to make it faster
library(doParallel)
cluster <- makeCluster(
detectCores() - 1
)
registerDoParallel(
cluster
)
Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
library(caret)
model1 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000
)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
Did the random forest model significantly use the uninformative
predictors (V6 - V10)?
From the above list, we can see that V6 -
V10 were not significant in the model given their low
“Overall” score.
cforest function in the party
package to fit a random forest model using conditional inference trees.
The party package function varimp can
calculate predictor importance. The conditional argument of
that function toggles between the traditional importance measure and the
modified version described in Strobl et al. (2007). Do these importances
show the same pattern as the traditional random forest model?library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
simulated <- simulated[, c(1:11)]
model4 <- cforest(
y ~ .,
data = simulated
)
varimp(model4, conditional = TRUE)
## V1 V2 V3 V4 V5
## 5.699856e+00 5.190114e+00 6.219633e-03 6.741766e+00 1.182547e+00
## V6 V7 V8 V9 V10
## -1.104739e-02 1.491985e-05 -8.383608e-03 -7.355064e-03 -2.342891e-02
Yes, they’re similar to the first model where V1,
V2, V4, and V5 were much more
important than the rest.
Boosted Trees
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
global_tr_control <- trainControl(method = "cv", allowParallel = TRUE)
boost_tree_grid <- expand.grid(
.interaction.depth = seq(1, 7, by = 1),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(.01, .5, by = .05),
.n.minobsinnode = c(5:15)
)
boost_fit <- train(
simulated[, c(1:10)],
simulated$y,
method = "gbm",
tuneGrid = boost_tree_grid,
verbose = FALSE,
trControl = global_tr_control
)
varImp(boost_fit, scale = FALSE)
## gbm variable importance
##
## Overall
## V4 9260.0
## V1 8779.1
## V2 7226.2
## V5 3688.0
## V3 2943.8
## V6 416.8
## V7 375.2
## V10 224.0
## V8 142.4
## V9 141.4
Using Boosted Trees, we see that the the same variables are the most important.
Cubist
library(Cubist)
cubist_fit <- cubist(
simulated[, c(1:10)],
simulated$y
)
varImp(cubist_fit, scale = FALSE)
Again using the cubist model, we can see that the same variables are the most important. Here though the other variables importance was set to 0.
Use a simulation to show tree bias with different granularities.
We’ll create a series of numbers with different levels of grannularity where:
and so on until the 5th, most grannular level.
library(rpart)
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
set.seed(19940211)
c1 <- sample(1:10 / 10, 2000, replace = TRUE)
c2 <- sample(1:100 / 100, 2000, replace = TRUE)
c3 <- sample(1:1000 / 1000, 2000, replace = TRUE)
c4 <- sample(1:10000 / 10000, 2000, replace = TRUE)
y <- c1 + c2 + c3 + c4 + rnorm(100, mean = 0, sd = 1.5)
train_tree_data <- data.frame(c1, c2, c3, c4, y)
sim_tree_fit <- rpart(
y ~ .,
data = train_tree_data
)
plot(as.party(sim_tree_fit))
varImp(sim_tree_fit, scale = FALSE)
The simulation above is fairly random and we can see from the
returned data from varImp() that each variable is
relatively similar in importance.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The model on the right (bagging fraction = 0.9 and shrinkage = 0.9) focuses its importance on the first few predictors because the high learning rate causes the model to converge quickly on some of the most powerful independent predictors. Additionally, because of the high bagging fraction, the model is selecting very large subsets when bootstrapping meaning that the samples are not very diverse.
I belive that the model on the left will be more predictive because the parameters selected would make it better at generalizing across diverse samples.
Increasing the interaction depth would functionally increase the maximum number of nodes on the tree. This would incrase the slope.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
sum(is.na(ChemicalManufacturingProcess[, -c(1)]))
## [1] 106
# Removing entries with low variance
ChemicalManufacturingProcess <- ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)]
# Filling NAs using KNN
knn_impute <- preProcess(
ChemicalManufacturingProcess[, -c(1)],
method = "knnImpute"
)
cmp_independent <- predict(
knn_impute,
ChemicalManufacturingProcess[, -c(1)]
)
cmp_dependent <- ChemicalManufacturingProcess[, c(1), drop = FALSE]
sum(is.na(cmp_independent[, -c(1)]))
## [1] 0
set.seed(19940211)
# Partition the data into a sample of 80% of the full dataset
cmp_train_rows <- createDataPartition(
cmp_independent$BiologicalMaterial01,
p = 0.8,
list = FALSE
)
# Use the sample to create a training dataset
cmp_train_ind <- cmp_independent[cmp_train_rows, ]
cmp_train_dep <- cmp_dependent[cmp_train_rows]
# Use the sample to create a testing dataset
cmp_test_ind <- cmp_independent[-cmp_train_rows, ]
cmp_test_dep <- cmp_dependent[-cmp_train_rows]
Single Tree
cmp_stree_fit <- train(
cmp_train_ind,
cmp_train_dep,
method = "rpart2",
tuneLength = 10,
trControl = global_tr_control
)
cmp_stree_pred <- predict(
cmp_stree_fit,
cmp_test_ind
)
postResample(
cmp_stree_pred,
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.4827649 0.5008663 1.1742636
Bagged Trees
library(ipred)
cmp_bag_fit <- ipredbagg(
cmp_train_dep,
cmp_train_ind
)
cmp_bag_pred <- predict(
cmp_bag_fit,
cmp_test_ind
)
postResample(
cmp_bag_pred,
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.2790524 0.6830667 0.9635714
Random Forest
library(randomForest)
cmp_randforest_fit <- randomForest(
cmp_train_ind,
cmp_train_dep,
importance = TRUE,
ntree = 1000
)
cmp_randforest_pred <- predict(
cmp_randforest_fit,
cmp_test_ind
)
postResample(
cmp_randforest_pred,
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.3094566 0.6734821 0.9844426
Boosted Trees
boosted_animals <- train(
cmp_train_ind,
cmp_train_dep,
method = "gbm",
tuneGrid = boost_tree_grid,
verbose = FALSE,
trControl = global_tr_control
)
boosted_pred <- predict(
boosted_animals,
cmp_test_ind
)
postResample(
boosted_pred,
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.3748034 0.5648721 1.0478236
Cubist
cmp_cubist_fit <- cubist(
cmp_train_ind,
cmp_train_dep
)
cmp_cubist_pred <- predict(
cmp_cubist_fit,
cmp_train_ind
)
postResample(
cmp_cubist_pred,
cmp_test_dep
)
## RMSE Rsquared MAE
## NA 0.004342594 NA
Our bagged tree provided the best results with an \(R^2\) of 0.683 and the lowest RMSE and MAE of 1.28 and 1.17, respectively. The majority of the other models had similar RMSE and MAE scores but most had significantly lower \(R^2\) scores with cubist having one of less than 0.01.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
##
## where
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
varImp(cmp_bag_fit) |>
arrange(desc(Overall)) |>
head(10)
In the bagged tree model, the biological material and manufacturing process predictors both comprised about half of the top 10. In our previous PLS model, the manufacturing process dominated both the PLS and KNN models that we determined were the best in the other models.
library(rpart.plot)
rpart.plot(
cmp_stree_fit$finalModel,
type = 4
)
stopCluster(cluster)
registerDoSEQ()