library(tidyverse)
library(magrittr)
library(vip)
library(patchwork)
library(flextable)
setwd('C:\\Users\\seanc\\OneDrive\\Desktop\\624\\HW9')
This report focuses on the training and evaluation of tree regression models (e.g., rule-based, boosted, random forest, etc.) and is a response to questions 8.1, 8.2, and 8.7 from Kuhn and Johnson’s Applied Predictive Modeling.
Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
<- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
simulated colnames(simulated)[ncol(simulated)] <- "y"
Note: A general rule for selecting mtry (# of random predictors) is p/3. Similarly, a minimum suggested ntrees (# bootstrap samples) is ~ 1000.
library(randomForest)
# fit random forest model to simulation data
<- randomForest(y ~ ., data = simulated,
rf1 importance = TRUE,
ntree = 1000,
mtry = 4)
# plot variable importance
%>%vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
rf1labs(title='Figure 1. Variable Importance for Simulation Data: Random Forest')+
theme_light()
#rf1.vi <- varImp(rf1, scale = FALSE) --> if using caret
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No, V6-V10 show low overall importance (based on MSE) in the random forest model.
set.seed(201)
$duplicate1 <- simulated$V1 + rnorm(200) * .1
simulatedcor(simulated$duplicate1, simulated$V1)
## [1] 0.9358765
Fit another random forest model to these data. Did the importance score for V1 change?
The relative order of the variable importance score changes as does the actual importance values In addition, V8 is no longer ranked in the top 10 variables.
Collinearity impacts variable importance in two ways: 1. changing variable importance ranks, and 2) lowering the relative importance of non-collinear variables. The presence of collinear variables can, therefore, have an effect on feature selection.
# fit random forest model to simulation data
<- randomForest(y ~ ., data = simulated,
rf2 importance = TRUE,
ntree = 1000,
mtry = 4)
# plot variable importance
%>%vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
rf2labs(title='Figure 2. Variable Importance: Random Forest-Simulated Collinearity')+
theme_light()
What happens when you add another predictor that is also highly correlated with V1?
The relative rankings of importance for V1-V10 remains the same as the previous model, although V8-9 no longer appear in the top 10.
# simulated second collinear variable
set.seed(203)
$duplicate2 <- simulated$V1 + rnorm(200) * .1
simulatedcor(simulated$duplicate2, simulated$V1)
## [1] 0.9454026
# fit random forest model to simulation data
<- randomForest(y ~ ., data = simulated,
rf3 importance = TRUE,
ntree = 1000,
mtry = 4)
# plot variable importance
%>%vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
rf3labs(title='Figure 3. Variable Importance: Random Forest - Simulated Multi-Collinearity')+
theme_light()
cforest
function in the party
package to fit a random forest model using conditional inference trees. The party package function varimp
can calculate predictor importance. The conditional
argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?No, the party::varimp() scores do not include the duplicates (simulated collinear variables) and show a different ranking order than the traditional random forest model - (top four = V4, V2, V1, !2). It’s worth noting that the method reported by Strobl et al. adjusts for correlation between predictors - i.e, represents the effect of a variable in both main effects and interactions.
library(party)
<-cforest(y ~ ., data = simulated, controls =cforest_control(ntree = 1000, mtry = 4))
rf4
<-party::varimp(rf4)%>%
rf4.vias.tibble()%>%
rownames_to_column("Variable")%>%
mutate_at(vars(Variable), list(factor))%>%
rename('VIScore' = value)%>%
arrange(desc(VIScore))
<-party::varimp(rf4, conditional=TRUE)%>%
rf4.vicondas.tibble()%>%
rownames_to_column("Variable")%>%
mutate_at(vars(Variable), list(factor))%>%
rename('VIScore' = value)%>%
arrange(desc(VIScore))
<-ggplot(rf4.vi, aes(x=Variable, y=VIScore))+
p4geom_col(aes(reorder(Variable,VIScore),VIScore), stat = 'identity', fill='midnightblue')+
coord_flip()+
labs(title='Figure 4. Variable Importance: cforest Model')+
theme_classic()
<-ggplot(rf4.vi, aes(x=Variable, y=VIScore))+
p5geom_col(aes(reorder(Variable,VIScore),VIScore), stat = 'identity', fill='midnightblue')+
coord_flip()+
labs(title='Figure 5. Variable Importance: cforest Model - Conditional')+
theme_classic()
/p5 p4
The three highest ranked variables (4,2,1) retain their relative ranks across the three models. However, the relative order changes across variables associated with mid-to-low ranks. Adding a collinear predictor had less effect when using Cubist.
library(gbm)
library(Cubist)
# boosted tree model
<- gbm(y ~ ., data = simulated, distribution = "gaussian", n.trees=1000)
gbm1
<-relative.influence(gbm1, n.trees=1000, scale. = FALSE, sort. = FALSE)%>%
vip.gbmas.tibble()%>%
rownames_to_column("Variable")%>%
mutate_at(vars(Variable), list(factor))%>%
rename('VIScore' = value)%>%
arrange(desc(VIScore))
<-ggplot(vip.gbm, aes(x=Variable, y=VIScore))+
p6geom_col(aes(reorder(Variable,VIScore),VIScore), stat = 'identity', fill='midnightblue')+
coord_flip()+
labs(title='Figure 6. Variable Importance: Boosted Model')+
theme_classic()
# cubist tree model
= subset(simulated, select=-c(y))
features = subset(simulated, select=y)[,1]
target
<-cubist(x = features, y = target)
cub1
<-caret::varImp(cub1)
vip.cub
<-ggplot(vip.gbm, aes(x=Variable, y=VIScore))+
p7geom_col(aes(reorder(Variable,VIScore),VIScore), stat = 'identity', fill='midnightblue')+
coord_flip()+
labs(title='Figure 7. Variable Importance: Cubist Model')+
theme_classic()
/p7 p6
Use a simulation to show tree bias with different granularities.
Decision trees make almost no a-priori assumptions about a target variable that might introduce bias to model predictions. However, they are sensitive to splitting decisions such that minor changes to covariate values can result in altered tree structures.Additionally, continuous covariates may be preferentially selected for splitting decisions relative to categorical variables. This owes to the ‘granularity’ of continuous variables which allows a much greater range of splitting options. Finally, while shallow, less complex, trees reduce the risk of over-fitting, they can introduce more bias into the model.
For simulation purposes, we will explore how changes in covariate values may introduce bias into a decision tree model. And we can evaluate bias based on the model rmse scores for the full dataset. In this case, we would expect the error to increase following minor alterations to covariates; particularly for variables that rank high in terms of predictive importance. The same approach can be applied to training data, where there is a training/test split.
# randomly introduce changes to values in several covariates
#need to create test and train sets......
set.seed(211)
<-mlbench.friedman1(200, sd = 1)%>%
v.sim1as.data.frame()%>%
rename(c(V1 = x.1, V4 = x.4))%>%
select(c(V1, V4))
<-mlbench.friedman1(200, sd = 1)%>%
v.sim2as.data.frame()%>%
rename(c(V1 = x.1, V4 = x.4, V2 = x.2))%>%
select(c(V1, V4, V2))
<-simulated%>%
tmp1select(!c(V1, V4))
<-simulated%>%
tmp2select(!c(V1, V4, V2))
<- cbind(v.sim1, tmp1)
sim.bias1 <- cbind(v.sim2, tmp2)
sim.bias2
<-identical(sim.bias1, sim.bias2)
ques
# model
.2var <- randomForest(y ~ ., data = sim.bias1,
rfimportance = TRUE,
ntree = 1000,
mtry = 4)
.3var <- randomForest(y ~ ., data = sim.bias2,
rfimportance = TRUE,
ntree = 1000,
mtry = 4)
# collect rmse for OOB MSE - adapted from: shorturl.at/gnBYZ
<- sqrt(tail(rf1$mse, 1)) #the Mean of squared residuals:is the out-of-bag MSE estimate. Take a square root for rmse
rf1.rmse.2var.rmse<-sqrt(tail(rf.2var$mse, 1))
rf.3var.rmse<-sqrt(tail(rf.3var$mse, 1))
rf
# Construct table of RMSE output for rf1, rf.2var, rf.3var
<-c(rf1.rmse, rf.2var.rmse, rf.3var.rmse)%>% # collect rmse scores
RMSEas.data.frame()
<- c('Original', 'V1.V4 Replaced', 'V1.V4.V8 Replaced')
Models
data.frame(Models, RMSE)%>%
rename(RMSE = '.')%>%
flextable()%>%
set_caption('Table 1. Evaluating Bias Due to Changing Covariate Values in Three Random Forest Models')
Models | RMSE |
Original | 2.455256 |
V1.V4 Replaced | 3.552074 |
V1.V4.V8 Replaced | 4.075332 |
In this example, the Random Forest rmse score for the original simulated data (rf1) is compared with those (rf.1var, rf.2var) where values for variables V1 & V4 as well as V1, V4, & V8 have been altered (via. simulation runs using the mlbench.friedman1 function). The rmse scores shown in Table 1 increase with alteration of the original variable values, indicating a related increase in model bias.
It’s interesting to note that moderate changes to the values of lower ranked variables had relatively little effect on model rmse scores (not shown here).
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance.
Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9.
Alt text
This owes to the fact that the 2nd model has a high learning rate and, as a consequence, will likely identify fewer explanatory variables. This pattern also holds as the bagging fraction (amount of data considered) increases.
I would expect that a slower learner using a smaller fraction of the data for each iteration (model 1) would be more predictive of out-of-bag and/or test samples.
It follows that as interaction depth increases, a greater number of variables will be identified as important. This might have the effect of introducing greater skew in the importance values over the variable rankings. This should result in an increase in the slope of predictor importance.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process.
Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
We will build, tune, and fit three models for comparative purposes: Random Forest, XGboot, and Cubist. Note that ensemble methods help mitigate the effects of collinearity on model predictions and variable importance.
First we start with a Random Forest model.
library(tidymodels)
library(AppliedPredictiveModeling) # dataset
#load data
data(ChemicalManufacturingProcess)
<- ChemicalManufacturingProcess # dim = 176, 58
chem
# create test and train sets
set.seed(212)
# create random data split -- data will not include variable for computer programs
<- initial_split(chem)
chem.split
# create train and test sets from split
<- training(chem.split) # 123, 1108
chem.train
# create resamples for model tuning
set.seed(213)
<-
chem_foldsvfold_cv(chem.train, v=10)
# Build model recipe
<-
chem.recipe recipe(Yield ~., data=chem.train)%>%
step_impute_knn(all_predictors()) %>% # impute missing values
step_normalize(all_predictors())%>% # center and scale
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0)%>% # remove near zero variance predictors
step_corr(threshold = 0.9)
# specify random forest model
<-rand_forest(
chem.rf mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_mode("regression")%>%
set_engine("ranger")
# build random forest workflow
<- workflow() %>%
chem.rf.wf add_recipe(chem.recipe) %>%
add_model(chem.rf)
# select metrics for model evaluation
<- metric_set(rmse, rsq, mae, mape)
chem.metrics
#tune random forest model
set.seed(215)
<-
rf.tune tune_grid(chem.rf.wf,
resamples = chem_folds,
metrics = chem.metrics,
grid = 11)
<-show_best(rf.tune, metric = "rmse")%>%
best.train.rfarrange(desc(mean))%>%
filter(mean==min(mean))%>%
select(.metric, mean)
autoplot(rf.tune, metric='rmse')+labs(title='Figure 8. Tuning Results: Random Forest')
# build final fit
<- chem.rf.wf %>%
chem.rf.tuned finalize_workflow(select_best(rf.tune))
<- last_fit(chem.rf.tuned,chem.split) # final fit on train and test data
chem.rf.final
# collect metrics for test data
<-collect_metrics(chem.rf.final, metric='rmse')
random.metric
# plot test predictions
collect_predictions(chem.rf.final) %>%
ggplot(aes(Yield, .pred)) +
geom_point(alpha = 0.5, color = "darkred") +
coord_fixed()+
labs(title='Figure 8. Predicted vs. Observed Yield for Chemical Processes', subtitle='Random Forest', x= 'Observed Yield', y='Predicted Yield')+
theme_light()
# Plot variable importance - For a ranger model, update the engine with importance = "permutation" in order to compute feature importance.
<- chem.rf %>%
var_spec finalize_model(select_best(rf.tune)) %>%
set_engine("ranger", importance = "permutation")
workflow() %>%
add_recipe(chem.recipe) %>%
add_model(var_spec) %>%
fit(chem.train) %>%
pull_workflow_fit() %>%
vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
labs(title='Figure 9. Variable Importance Chemical Manufacturing: Random Forest')+
theme_light()
Second, we will construct a XGBoost Model with Parameter Optimization.
# specify boosted tree
set.seed(216)
<-boost_tree(
chem.xgmtry = tune(),
trees = 1000,
min_n = tune(),
tree_depth = tune(),
learn_rate = tune())%>%
set_mode('regression')%>%
set_engine('xgboost')
# create workflow for boosted tree
<- workflow() %>%
chem.xg.wf add_recipe(chem.recipe) %>%
add_model(chem.xg)
# tune xgboost parameters
set.seed(218)
<-
xg.tune tune_grid(chem.xg.wf,
resamples = chem_folds,
metrics = chem.metrics,
grid = 11)
<-show_best(xg.tune, metric = "rmse")%>%
best.train.xgarrange(desc(mean))%>%
filter(mean==min(mean))%>%
select(.metric, mean)
autoplot(xg.tune, metric='rmse')+labs(title='Figure 10. Tuning Results: XGboost')
# update xg model with selected parameters
<- chem.xg.wf%>%
chem.xg.tuned finalize_workflow(select_best(xg.tune))
# fit xg on trained and evaluate on test
<- last_fit(chem.xg.tuned, chem.split) # final fit on train and test data
chem.xg.final
# collect metrics for xg test data
<-collect_metrics(chem.xg.final, metric='rmse')
boost.metric
# plot test predictions
collect_predictions(chem.xg.final) %>%
ggplot(aes(Yield, .pred)) +
geom_point(alpha = 0.5, color = "darkred") +
coord_fixed()+
labs(title='Figure 11. Predicted vs. Observed Yield for Chemical Processes', subtitle='XGboost', x= 'Observed Yield', y='Predicted Yield')+
theme_light()
# Plot variable importance - For a xg model,update the engine with best fit prior to plotting.
<- chem.xg %>%
xg.update finalize_model(select_best(xg.tune))
workflow() %>%
add_recipe(chem.recipe) %>%
add_model(xg.update) %>%
fit(chem.train) %>%
pull_workflow_fit() %>%
vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
labs(title='Figure 12. Variable Importance Chemical Manufacturing: XGBoost')+
theme_classic()
Finally, we will construct a Cubist Model with Parameter Optimization.
set.seed(220)
library(rules) # necessary to run grid for cubist
<-cubist_rules(
chem.cubistcommittees = tune(),
neighbors = tune(),
max_rules = tune())%>%
set_mode('regression')%>%
set_engine('Cubist')
# create workflow for boosted tree
<- workflow() %>%
chem.cubist.wf add_recipe(chem.recipe) %>%
add_model(chem.cubist)
# tune xgboost parameters
set.seed(221)
<-
cubist.tune tune_grid(chem.cubist.wf,
resamples = chem_folds,
metrics = chem.metrics,
grid = 11)
<-show_best(cubist.tune, metric = "rmse")%>%
best.train.cubistarrange(desc(mean))%>%
filter(mean==min(mean))%>%
select(.metric, mean)
autoplot(cubist.tune, metric = "rmse")+labs(title='Figure 13. Tuning Results: Cubist')
# update xg model with selected parameters
<- chem.cubist.wf%>%
chem.cubist.tuned finalize_workflow(select_best(cubist.tune))
# fit xg on trained and evaluate on test
<- last_fit(chem.cubist.tuned, chem.split) # final fit on train and test data
chem.cubist.final
# collect metrics for xg test data
<-collect_metrics(chem.cubist.final, metric='rmse')
cubist.metric
# plot test predictions
collect_predictions(chem.cubist.final) %>%
ggplot(aes(Yield, .pred)) +
geom_point(alpha = 0.5, color = "darkred") +
coord_fixed()+
labs(title='Figure 14. Predicted vs. Observed Yield for Chemical Processes', subtitle='Cubist', x= 'Observed Yield', y='Predicted Yield')+
theme_light()
# Plot variable importance - For a cubist model, update the engine with best fit prior to plotting.
<- chem.cubist %>%
cubist.update finalize_model(select_best(cubist.tune))
workflow() %>%
add_recipe(chem.recipe) %>%
add_model(cubist.update) %>%
fit(chem.train) %>%
pull_workflow_fit() %>%
vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
labs(title='Figure 15. Variable Importance Chemical Manufacturing: Cubist')+
theme_classic()
Now we compare the fit for each model (RMSE) to the resampling and test data.
# create col values for a model column
<-c('Random Forest', 'XGBoost', 'Cubist')
nms
# bind model metrics
<-rbind(best.train.rf, best.train.xg, best.train.cubist)
train.compare
<-cbind(nms, train.compare) # include the model column
train.fit
%>%
train.fitrename(c(Models=nms, Metric = .metric, 'Test Estimate' = mean))%>%
pivot_wider(names_from = Metric, values_from = 'Test Estimate')%>%
flextable()%>%
set_caption('Table 2: Comparison of Model Resampling Fit for Chemical Processes Data')
Models | rmse |
Random Forest | 1.1097928 |
XGBoost | 0.9581343 |
Cubist | 0.9061319 |
# create test metrics table for model comparison
<-rbind(random.metric, boost.metric, cubist.metric)
fit.compare
<-cbind(nms, fit.compare) # include the model column
compare.fit
%>%
compare.fitselect(nms, .metric, .estimate)%>%
rename(c(Models=nms, Metric = .metric, 'Test Estimate' = .estimate))%>%
pivot_wider(names_from = Metric, values_from = 'Test Estimate')%>%
arrange(desc(rmse))%>%
flextable()%>%
set_caption('Table 3: Comparison of Model Test Fit for Chemical Processes Data')
Models | rmse | rsq |
Cubist | 1.279096 | 0.6003883 |
Random Forest | 1.248055 | 0.5149594 |
XGBoost | 1.222412 | 0.5067007 |
The cubist model had the optimal resampling performance (Table 2, RMSE=0.9312177) of the three models. The XGBoost model returned the optimal test set performance (Table 3, RMSE=XGBoost 0.9742022). The extent to which data preprocessing decisions (e.g., automated removal of collinear predictors, imputing method, treatment of near zero variance predictors, etc.) influence these outcomes is not considered in this discussion.
ManufacturingProcess32 was the most important variable in the optimal model (i.e., Cubist).
Manufacturing processes dominate the list.
The top 10 important predictors for the xGboost, optimal linear, and nonlinear models are as follows:
Tree Regression(XGboost): M32 M13, M31, B3, M09, B12, M17, B09, B05, M39
Linear Model (Elastic Net): M32,M09,B06, M06, B03, M17, M36, M13
Non-Linear Model (Elastic Net): M24, 328, M28, M39, M29, M36, M42, M30, M43, M22
It’s important to note that the variable importance scores for the non-linear model are flawed (due to issues in previous varimp computation) and do not lend themselves to comparison here.
While to top important variable is the same in the tree and linear models (M32), the remaining variables and relative ranks differ.
Yes, the tree diagram (Figure 14) shows the splitting rules for each node (magnitude of yield) and as well as the hierarchy of nodes/splits. For example, the topmost node, ManufacturingProcess32, includes 100% of the observations, with a predicted mean of 40. M32 also produces higher yields at higher values (>=160). Given the latter, the next most important predictor is M06, whereas for values of M32 < 160, the next most important predictor is M31. _
library(rpart)
library(rattle)
library(RColorBrewer)
# run model and plot results
<- rpart(Yield ~., chem.train, maxdepth=3)
rpart.tree
fancyRpartPlot(rpart.tree, caption='Figure 16. Tree diagram for optimal single tree')