set.seed(200)
simulated <- mlbench.friedman1(200,sd=1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y ~ ., data = simulated,
importance = T, ntree = 100)
rfImp <- varImp(model1, scale = F)
lv <- rownames(rfImp)
rfImp %>% rownames_to_column(var="Variable") %>%
mutate(Variable = factor(Variable, levels=lv, ordered=T)) %>%
ggplot(aes(x=Overall, y=factor(Variable))) + geom_col() +
labs(title="Variable Importance", subtitle="Random Forest Model",
y="Variable", x="Overall Importance")
The random forest model did not really use the uninformative predictors. However, it did informative predictor v3 to be less important than I expected.
cforest
function in the party package to fit a random forest model using conditional inference trees. The party package function varimp
can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?# Model with conditional inference trees
model.cond <- cforest(y ~ ., data=simulated)
# Get variable importance
condImp <- varimp(model.cond, conditional=T)
condImp <- as.data.frame(condImp)
names(condImp) <- "Overall"
condImp %>% rownames_to_column(var="Variable") %>%
mutate(Variable = factor(Variable, levels=lv, ordered=T)) %>%
ggplot(aes(x=Overall, y=factor(Variable))) + geom_col() +
labs(title="Conditional Variable Importance",
subtitle="Random Forest Model (conditional) w/ 2 duplicates",
y="Variable", x="Conditional Importance")
Yes, the conditional inference trees show a similar pattern in variable importance.
set.seed(56431)
# Set our range of tuning parameters
param.gbm <- expand.grid(interaction.depth = seq(1,7,by=2),
n.trees = seq(100,1000,by=50),
shrinkage = seq(0.01,0.1,by=0.01),
n.minobsinnode = 10)
# Training control
ctrl.gbm <- trainControl(method="cv",n=10)
# Split out dependent and independent variables
dv <- simulated %>% select(y)
iv <- simulated %>% select(-y)
# Tune the model
model.gbm <- train(x=iv, y=dv$y, method="gbm",
trControl = ctrl.gbm,
tuneGrid = param.gbm,
verbose=F)
# Get the best model based on tuning
model.gbm$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 497 200 5 0.07 10
rfImp <- varImp(model.gbm$finalModel, scale = F)
lv <- rownames(rfImp)
rfImp %>% rownames_to_column(var="Variable") %>%
mutate(Variable = factor(Variable, levels=lv, ordered=T)) %>%
ggplot(aes(x=Overall, y=factor(Variable))) + geom_col() +
labs(title="Variable Importance", subtitle="GBM Model w/ 2 duplicates",
y="Variable", x="Overall Importance")
With the GBM model, we see a similar pattern as before with the duplicate variables, they have a non-zero importance in the final model. It also appears that the V1 predictor is also reduced in important as a result.
Interestingly, the V3 predictor is more important under this model, than it was in the previous models.
set.seed(56431)
# Set our range of tuning parameters
param.cubist <- expand.grid(committees = seq(1,10,by=1),
neighbors = seq(1,9,by=2))
# Training control
ctrl.cubist <- trainControl(method="cv",n=10)
#
model.cubist <- train(x=iv, y=dv$y, method="cubist",
trControl = ctrl.cubist,
tuneGrid = param.cubist,
verbose=F)
model.cubist$bestTune
## committees neighbors
## 44 9 7
rfImp <- varImp(model.cubist$finalModel, scale = F)
lv <- rownames(rfImp)
rfImp %>% rownames_to_column(var="Variable") %>%
mutate(Variable = factor(Variable, levels=lv, ordered=T)) %>%
ggplot(aes(x=Overall, y=factor(Variable))) + geom_col() +
labs(title="Variable Importance", subtitle="Cubist Model w/ 2 duplicates",
y="Variable", x="Overall Importance")
The Cubist model, impressively, ignored the duplicate predictors and most of the uninformative predictors!
According to the text\(^1\) (page 182), “…trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors”.
We’ll test this by training a large number of trees and seeing which variable is chosen from simulated data.
set.seed(05251)
varsUsed <- tibble()
for(i in 1:50){
# Generate data
# One variable X1, is associated with the outcome, Y,
# while X2 is not associated at all.
sim <- tibble(X1 = rep(c(1,2), times=50))
sim$X2 <- rnorm(100, mean = 20, sd = 5)
sim$Y <- sim$X1 + rnorm(100,mean = 0, sd = 1)
# Limit to a single split
mod <- rpart(Y ~ ., data = sim)
# Most important variable
v <- varImp(mod) %>% slice_max(order_by = Overall) %>% row.names()
varsUsed <- rbind(varsUsed, v)
}
table(varsUsed)
## varsUsed
## X1 X2
## 4 46
Even though variable X2
had no direct impact on outcome Y
, the trees found that to be the most important variable over 90% of the time.
The learning rate is a regularization parameter which counters the tree algorithm’s greedy nature to choose optimal learners at each stage. In the right-hand plot, the learning rate is barely penalized (0.9), so the same variables tend to dominate the process as they build upon one another.
Similarly, the larger bagging fraction uses more of the data during training, so the stages are more alike, causing a subsequent stage to more often choose the same variables a prior stage did.
The left-hand model is likely to do better on data it has not seen. I would suspect that more dissimilar weak learners contributing, rather than fewer similar trees (in terms of variable selection), gives the final model more flexibility with out-of-sample observations.
I would expect that the slopes would decrease (i.e. less variables would dominate) as the interaction depth was increased because the model allows more variables to be utilized in the trees.
# Load the chemical manufacturing data
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
# Load the tidymodels package
library(tidymodels)
# Create a train/test split
set.seed(1108)
chem_split <- initial_split(ChemicalManufacturingProcess, prop=0.75)
chem_train <- training(chem_split)
chem_test <- testing(chem_split)
# Create our generic recipe with KNN imputation
chem_recipe <- chem_train %>% recipe(Yield ~ .) %>%
step_knnimpute(all_predictors()) %>% prep()
# Get 10 folds for CV model tuning
folds <- vfold_cv(chem_train, v=10)
We’ll start with a basic Random Forest model.
# Our model specification. We'll tune every param.
rf <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("randomForest") %>%
set_mode("regression")
# A grid of tuning parameters
rf_tune <- grid_regular(mtry(c(57,57)),
trees(),
min_n(),
levels = 10)
# Create the workflow
rf_wf <- workflow() %>% add_model(rf) %>% add_recipe(chem_recipe)
# Tune the model (takes a while!)
rf.model <- rf_wf %>% tune_grid(resamples = folds, grid = rf_tune)
# Get best model from the tuning results
rf.model %>% show_best("rmse")
## # A tibble: 5 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 57 889 2 rmse standard 0.971 10 0.0923 Model005
## 2 57 223 2 rmse standard 0.977 10 0.0931 Model002
## 3 57 667 2 rmse standard 0.977 10 0.0915 Model004
## 4 57 1333 2 rmse standard 0.980 10 0.0919 Model007
## 5 57 1555 2 rmse standard 0.980 10 0.0917 Model008
best_rf <- rf.model %>% select_best(metric = "rmse")
The best Random Forest model, from the tuning, is 889 trees with a min_n of 2. The estimated cross-validated RMSE is 0.971 for that model.
# Take the best model and get test metrics
rf_wf <- rf_wf %>% finalize_workflow(best_rf)
rf_final <- rf_wf %>% last_fit(chem_split)
modelEval <- tibble(model = "Random Forest") %>%
bind_cols(rf_final %>% collect_metrics)
rf_final %>% collect_metrics %>% kbl() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 1.5562577 |
rsq | standard | 0.4812533 |
Our test results are rather poor, as shown in the above table. We’ll try a different model and see how it performs.
# Boosted tree model
boost <- boost_tree(mode = "regression",
mtry = tune(), trees = tune(),
min_n = 10, tree_depth = 8,
learn_rate = tune(),
loss_reduction = tune(),
sample_size = .60,
stop_iter = 3) %>% set_engine("xgboost")
# A grid of tuning parameters (had to cut this down for the sake of my CPU)
boost_tune <- grid_regular(finalize(mtry(),select(chem_train,-Yield)),
trees(),
learn_rate(), loss_reduction(),
levels = 5)
# Create the workflow
boost_wf <- workflow() %>% add_model(boost) %>% add_recipe(chem_recipe)
# Tune the model
boost.model <- boost_wf %>% tune_grid(resamples = folds, grid = boost_tune)
# Get best model from the tuning results
boost.model %>% show_best("rmse")
## # A tibble: 5 x 10
## mtry trees learn_rate loss_reduction .metric .estimator mean n std_err
## <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 43 500 0.1 0.0422 rmse standard 0.971 10 0.0920
## 2 43 1000 0.1 0.0422 rmse standard 0.971 10 0.0920
## 3 43 1500 0.1 0.0422 rmse standard 0.971 10 0.0920
## 4 43 2000 0.1 0.0422 rmse standard 0.971 10 0.0918
## 5 29 500 0.1 0.0000562 rmse standard 0.980 10 0.0721
## # … with 1 more variable: .config <chr>
best_boost <- boost.model %>% select_best(metric = "rmse")
The tuning returned a model with 500 trees, a learning rate of 0.1, and a loss-reduction of \(4.2 \text{x} 10^{-2}\).
# Take the best model and get test metrics
boost_wf <- boost_wf %>% finalize_workflow(best_boost)
boost_final <- boost_wf %>% last_fit(chem_split)
temp <- tibble(model = "XGBoost") %>%
bind_cols(boost_final %>% collect_metrics)
modelEval <- rbind(modelEval, temp)
boost_final %>% collect_metrics() %>% kbl() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 1.2495115 |
rsq | standard | 0.6563924 |
The results of the boosting model were better, but not great.
Finally, we will try the Cubist model.
# Set our range of tuning parameters
param.cubist <- expand.grid(committees = seq(1,10,by=1),
neighbors = seq(1,9,by=2))
# Training control
ctrl.cubist <- trainControl(method="cv",n=10)
# Split our data
dv <- chem_train %>% select(Yield)
iv <- chem_train %>% select(-Yield)
# Cubist model
model.cubist <- train(x=iv, y=dv$Yield, method="cubist",
trControl = ctrl.cubist,
tuneGrid = param.cubist,
verbose=F)
# Find the CV tuned model
model.cubist$bestTune
## committees neighbors
## 22 5 3
Cross-validation selected the best model as one with 5 committees and 3 neighbors. We’ll try that on the test data set and see how it compares to the previous two models.
# Make predictions with the final model
pred.cubist <- predict(model.cubist$finalModel, newdata = chem_test)
cubist.metrics <- tibble(predicted = pred.cubist) %>%
bind_cols(actual = chem_test$Yield) %>%
rsq(truth=actual, estimate=predicted)
cubist.metrics <- tibble(predicted = pred.cubist) %>%
bind_cols(actual = chem_test$Yield) %>%
rmse(truth=actual, estimate=predicted) %>%
rbind(cubist.metrics)
modelEval <- tibble(model = "Cubist") %>% bind_cols(cubist.metrics) %>%
rbind(modelEval)
cubist.metrics %>% kbl() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 1.3567007 |
rsq | standard | 0.5936248 |
The Cubist algorithm performed better than Random Forest, but slightly worse than XGBoost. Let’s compare them together:
modelEval
## # A tibble: 6 x 4
## model .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 Cubist rmse standard 1.36
## 2 Cubist rsq standard 0.594
## 3 Random Forest rmse standard 1.56
## 4 Random Forest rsq standard 0.481
## 5 XGBoost rmse standard 1.25
## 6 XGBoost rsq standard 0.656
The best model of the three is the XGBoost model.
library(vip)
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
boost_final %>%
pluck(".workflow", 1) %>%
pull_workflow_fit() %>%
vip(num_features = 10)
## Warning: `as.tibble()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
With the XGBoost model, 6 of the top 10 most important variables were process variables. Biological variables, however were 2 of the top 3, so they were well represented.
Like with the linear and nonlinear models, the most important variable is ManufacturingProcess32
. Also, ManufacturingProcess09
and ManufacturingProcess13
were in the top 10 here as they were in the linear and nonlinear models.
# Single tree model
singletree <- decision_tree(mode = "regression",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()) %>%
set_engine("rpart")
# A grid of tuning parameters
single_tune <- grid_regular(cost_complexity(),
tree_depth(),
min_n(),
levels = 5)
# Create the workflow
single_wf <- workflow() %>% add_model(singletree) %>% add_recipe(chem_recipe)
# Tune the model
single.model <- single_wf %>% tune_grid(resamples = folds, grid = single_tune)
# Take the best model
best_single <- single.model %>% select_best(metric = "rmse")
single_wf <- single_wf %>% finalize_workflow(best_single) %>%
fit(chem_train)
single_tree <-single_wf %>% pull_workflow_fit()
rpart.plot::rpart.plot(single_tree$fit, roundint = F)
The very basic tree model used the same ManufacturingProcess32
variable as the first split, indicating it too found that variable to be important. For the second split, it used BiologicalMaterial11
which was in the top 10 of the XGBoost model above, but lower on the list.
\(^1\) - Kuhn, Max, and Kjell Johnson. Applied Predictive Modeling. Springer, 2016.