library(tidyverse)
library(magrittr)
library(vip)
library(patchwork)
library(flextable)

setwd('C:\\Users\\seanc\\OneDrive\\Desktop\\624\\HW9')

1 Background

This report focuses on the training and evaluation of tree regression models (e.g., rule-based, boosted, random forest, etc.) and is a response to questions 8.1, 8.2, and 8.7 from Kuhn and Johnson’s Applied Predictive Modeling.

2 Exercise 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Note: A general rule for selecting mtry (# of random predictors) is p/3. Similarly, a minimum suggested ntrees (# bootstrap samples) is ~ 1000.

library(randomForest)


# fit random forest model to simulation data

rf1 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000, 
                      mtry = 4)


# plot variable importance

rf1%>%vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
  labs(title='Figure 1. Variable Importance for Simulation Data: Random Forest')+
  theme_light()

#rf1.vi <- varImp(rf1, scale = FALSE) --> if using caret

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No, V6-V10 show low overall importance (based on MSE) in the random forest model.

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
set.seed(201)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9358765

Fit another random forest model to these data. Did the importance score for V1 change?

The relative order of the variable importance score changes as does the actual importance values In addition, V8 is no longer ranked in the top 10 variables.

Collinearity impacts variable importance in two ways: 1. changing variable importance ranks, and 2) lowering the relative importance of non-collinear variables. The presence of collinear variables can, therefore, have an effect on feature selection.

# fit random forest model to simulation data

rf2 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000, 
                      mtry = 4)

# plot variable importance

rf2%>%vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
  labs(title='Figure 2. Variable Importance: Random Forest-Simulated Collinearity')+
  theme_light()

What happens when you add another predictor that is also highly correlated with V1?

The relative rankings of importance for V1-V10 remains the same as the previous model, although V8-9 no longer appear in the top 10.

# simulated second collinear variable

set.seed(203)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9454026
# fit random forest model to simulation data

rf3 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000, 
                      mtry = 4)

# plot variable importance

rf3%>%vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
  labs(title='Figure 3. Variable Importance: Random Forest - Simulated Multi-Collinearity')+
  theme_light()

  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

No, the party::varimp() scores do not include the duplicates (simulated collinear variables) and show a different ranking order than the traditional random forest model - (top four = V4, V2, V1, !2). It’s worth noting that the method reported by Strobl et al. adjusts for correlation between predictors - i.e, represents the effect of a variable in both main effects and interactions.

library(party)

rf4<-cforest(y ~ ., data = simulated, controls =cforest_control(ntree = 1000, mtry = 4))

rf4.vi<-party::varimp(rf4)%>%
  as.tibble()%>%
  rownames_to_column("Variable")%>%
  mutate_at(vars(Variable), list(factor))%>%
  rename('VIScore' = value)%>%
  arrange(desc(VIScore))

rf4.vicond<-party::varimp(rf4, conditional=TRUE)%>%
  as.tibble()%>%
  rownames_to_column("Variable")%>%
  mutate_at(vars(Variable), list(factor))%>%
  rename('VIScore' = value)%>%
  arrange(desc(VIScore))

p4<-ggplot(rf4.vi, aes(x=Variable, y=VIScore))+
  geom_col(aes(reorder(Variable,VIScore),VIScore), stat = 'identity', fill='midnightblue')+
  coord_flip()+
  labs(title='Figure 4. Variable Importance: cforest Model')+
  theme_classic()

p5<-ggplot(rf4.vi, aes(x=Variable, y=VIScore))+
  geom_col(aes(reorder(Variable,VIScore),VIScore), stat = 'identity', fill='midnightblue')+
  coord_flip()+
  labs(title='Figure 5. Variable Importance: cforest Model - Conditional')+
  theme_classic()

p4/p5

  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

The three highest ranked variables (4,2,1) retain their relative ranks across the three models. However, the relative order changes across variables associated with mid-to-low ranks. Adding a collinear predictor had less effect when using Cubist.

library(gbm)
library(Cubist)



# boosted tree model

 gbm1 <- gbm(y ~ ., data = simulated, distribution = "gaussian", n.trees=1000)
 
 vip.gbm<-relative.influence(gbm1, n.trees=1000, scale. = FALSE, sort. = FALSE)%>%
   as.tibble()%>%
   rownames_to_column("Variable")%>%
   mutate_at(vars(Variable), list(factor))%>%
   rename('VIScore' = value)%>%
   arrange(desc(VIScore))

p6<-ggplot(vip.gbm, aes(x=Variable, y=VIScore))+
  geom_col(aes(reorder(Variable,VIScore),VIScore), stat = 'identity', fill='midnightblue')+
  coord_flip()+
  labs(title='Figure 6. Variable Importance: Boosted Model')+
  theme_classic()
 
 
# cubist tree model
 
features = subset(simulated, select=-c(y))
target = subset(simulated, select=y)[,1]
 
cub1<-cubist(x = features, y = target)

vip.cub<-caret::varImp(cub1)

p7<-ggplot(vip.gbm, aes(x=Variable, y=VIScore))+
  geom_col(aes(reorder(Variable,VIScore),VIScore), stat = 'identity', fill='midnightblue')+
  coord_flip()+
  labs(title='Figure 7. Variable Importance: Cubist Model')+
  theme_classic()

p6/p7

3 Exercise 8.2

Use a simulation to show tree bias with different granularities.

Decision trees make almost no a-priori assumptions about a target variable that might introduce bias to model predictions. However, they are sensitive to splitting decisions such that minor changes to covariate values can result in altered tree structures.Additionally, continuous covariates may be preferentially selected for splitting decisions relative to categorical variables. This owes to the ‘granularity’ of continuous variables which allows a much greater range of splitting options. Finally, while shallow, less complex, trees reduce the risk of over-fitting, they can introduce more bias into the model.

For simulation purposes, we will explore how changes in covariate values may introduce bias into a decision tree model. And we can evaluate bias based on the model rmse scores for the full dataset. In this case, we would expect the error to increase following minor alterations to covariates; particularly for variables that rank high in terms of predictive importance. The same approach can be applied to training data, where there is a training/test split.

# randomly introduce changes to values in several covariates

#need to create test and train sets......

set.seed(211)

v.sim1<-mlbench.friedman1(200, sd = 1)%>%
  as.data.frame()%>%
  rename(c(V1 = x.1, V4 = x.4))%>%
  select(c(V1, V4))

v.sim2<-mlbench.friedman1(200, sd = 1)%>%
  as.data.frame()%>%
  rename(c(V1 = x.1, V4 = x.4, V2 = x.2))%>%
  select(c(V1, V4, V2))

tmp1<-simulated%>%
  select(!c(V1, V4))

tmp2<-simulated%>%
  select(!c(V1, V4, V2))

sim.bias1 <- cbind(v.sim1, tmp1)
sim.bias2 <- cbind(v.sim2, tmp2)

ques<-identical(sim.bias1, sim.bias2)

# model

rf.2var <- randomForest(y ~ ., data = sim.bias1,
                      importance = TRUE,
                      ntree = 1000, 
                      mtry = 4)

rf.3var <- randomForest(y ~ ., data = sim.bias2,
                      importance = TRUE,
                      ntree = 1000, 
                      mtry = 4)

# collect rmse for OOB MSE - adapted from: shorturl.at/gnBYZ

rf1.rmse<- sqrt(tail(rf1$mse, 1)) #the Mean of squared residuals:is the out-of-bag MSE estimate. Take a square root for rmse
rf.2var.rmse<-sqrt(tail(rf.2var$mse, 1)) 
rf.3var.rmse<-sqrt(tail(rf.3var$mse, 1))

# Construct table of RMSE output for rf1, rf.2var, rf.3var


RMSE<-c(rf1.rmse, rf.2var.rmse, rf.3var.rmse)%>% # collect rmse scores
  as.data.frame()

Models <- c('Original', 'V1.V4 Replaced', 'V1.V4.V8 Replaced')

data.frame(Models, RMSE)%>%
  rename(RMSE = '.')%>%
  flextable()%>%
  set_caption('Table 1. Evaluating Bias Due to Changing Covariate Values in Three Random Forest Models')

In this example, the Random Forest rmse score for the original simulated data (rf1) is compared with those (rf.1var, rf.2var) where values for variables V1 & V4 as well as V1, V4, & V8 have been altered (via. simulation runs using the mlbench.friedman1 function). The rmse scores shown in Table 1 increase with alteration of the original variable values, indicating a related increase in model bias.

It’s interesting to note that moderate changes to the values of lower ranked variables had relatively little effect on model rmse scores (not shown here).

4 Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance.

Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9.

Alt text

  1. Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

This owes to the fact that the 2nd model has a high learning rate and, as a consequence, will likely identify fewer explanatory variables. This pattern also holds as the bagging fraction (amount of data considered) increases.

  1. Which model do you think would be more predictive of other samples?

I would expect that a slower learner using a smaller fraction of the data for each iteration (model 1) would be more predictive of out-of-bag and/or test samples.

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

It follows that as interaction depth increases, a greater number of variables will be identified as important. This might have the effect of introducing greater skew in the importance values over the variable rankings. This should result in an increase in the slope of predictor importance.

5 Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process.

Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

We will build, tune, and fit three models for comparative purposes: Random Forest, XGboot, and Cubist. Note that ensemble methods help mitigate the effects of collinearity on model predictions and variable importance.

First we start with a Random Forest model.

library(tidymodels)
library(AppliedPredictiveModeling) # dataset


#load data

data(ChemicalManufacturingProcess)

chem<- ChemicalManufacturingProcess # dim = 176, 58 

# create test and train sets

set.seed(212)

# create random data split -- data will not include variable for computer programs

chem.split <- initial_split(chem)

# create train and test sets from split

chem.train <- training(chem.split) # 123, 1108

# create resamples for model tuning

set.seed(213)

chem_folds<-
  vfold_cv(chem.train, v=10) 

# Build model recipe

chem.recipe <- 
  recipe(Yield ~., data=chem.train)%>% 
  step_impute_knn(all_predictors()) %>% # impute missing values
  step_normalize(all_predictors())%>% # center and scale 
  step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0)%>% # remove near zero variance predictors
  step_corr(threshold = 0.9)
  
  
# specify random forest model

chem.rf <-rand_forest(
    mtry = tune(),
    trees = tune(),
    min_n = tune()) %>%
    set_mode("regression")%>%
    set_engine("ranger")
    
# build random forest workflow

chem.rf.wf <- workflow() %>% 
  add_recipe(chem.recipe) %>% 
  add_model(chem.rf)

# select metrics for model evaluation

chem.metrics <- metric_set(rmse, rsq, mae, mape)
    
#tune random forest model

set.seed(215)

rf.tune <-
  tune_grid(chem.rf.wf,
    resamples = chem_folds,
    metrics = chem.metrics,
    grid = 11)
    
best.train.rf<-show_best(rf.tune, metric = "rmse")%>%
  arrange(desc(mean))%>%
  filter(mean==min(mean))%>%
  select(.metric, mean)

autoplot(rf.tune, metric='rmse')+labs(title='Figure 8. Tuning Results: Random Forest')

# build final fit

chem.rf.tuned <- chem.rf.wf  %>%
  finalize_workflow(select_best(rf.tune))

chem.rf.final <- last_fit(chem.rf.tuned,chem.split) # final fit on train and test data

# collect metrics for test data

random.metric<-collect_metrics(chem.rf.final, metric='rmse')


# plot test predictions

collect_predictions(chem.rf.final) %>%
  ggplot(aes(Yield, .pred)) +
  geom_point(alpha = 0.5, color = "darkred") +
  coord_fixed()+
  labs(title='Figure 8. Predicted vs. Observed Yield for Chemical Processes', subtitle='Random Forest', x= 'Observed Yield', y='Predicted Yield')+
  theme_light()

# Plot variable importance - For a ranger model,  update the engine with importance = "permutation" in order to compute feature importance. 

var_spec <- chem.rf %>%
  finalize_model(select_best(rf.tune)) %>%
  set_engine("ranger", importance = "permutation")

workflow() %>%
  add_recipe(chem.recipe) %>%
  add_model(var_spec) %>%
  fit(chem.train) %>%
  pull_workflow_fit() %>%
  vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
  labs(title='Figure 9. Variable Importance Chemical Manufacturing: Random Forest')+
  theme_light()

Second, we will construct a XGBoost Model with Parameter Optimization.

# specify boosted tree

set.seed(216)

chem.xg<-boost_tree(
    mtry = tune(),
    trees = 1000,
    min_n = tune(),
    tree_depth = tune(),
    learn_rate = tune())%>%
    set_mode('regression')%>%
    set_engine('xgboost')

# create workflow for boosted tree

chem.xg.wf <- workflow() %>% 
  add_recipe(chem.recipe) %>% 
  add_model(chem.xg)

# tune xgboost parameters

set.seed(218)

xg.tune <-
  tune_grid(chem.xg.wf,
    resamples = chem_folds,
    metrics = chem.metrics,
    grid = 11)
    
best.train.xg<-show_best(xg.tune, metric = "rmse")%>%
  arrange(desc(mean))%>%
  filter(mean==min(mean))%>%
  select(.metric, mean)

autoplot(xg.tune, metric='rmse')+labs(title='Figure 10. Tuning Results: XGboost')

# update xg model with selected parameters

chem.xg.tuned <- chem.xg.wf%>%
  finalize_workflow(select_best(xg.tune))

# fit xg on trained and evaluate on test

chem.xg.final <- last_fit(chem.xg.tuned, chem.split) # final fit on train and test data

# collect metrics for xg test data

boost.metric<-collect_metrics(chem.xg.final, metric='rmse')


# plot test predictions

collect_predictions(chem.xg.final) %>%
  ggplot(aes(Yield, .pred)) +
  geom_point(alpha = 0.5, color = "darkred") +
  coord_fixed()+
  labs(title='Figure 11. Predicted vs. Observed Yield for Chemical Processes', subtitle='XGboost', x= 'Observed Yield', y='Predicted Yield')+
  theme_light()

# Plot variable importance - For a xg model,update the engine with best fit prior to plotting. 

xg.update <- chem.xg %>%
  finalize_model(select_best(xg.tune))

workflow() %>%
  add_recipe(chem.recipe) %>%
  add_model(xg.update) %>%
  fit(chem.train) %>%
  pull_workflow_fit() %>%
  vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
  labs(title='Figure 12. Variable Importance Chemical Manufacturing: XGBoost')+
  theme_classic()

Finally, we will construct a Cubist Model with Parameter Optimization.

set.seed(220)

library(rules) # necessary to run grid for cubist

chem.cubist<-cubist_rules(
  committees = tune(),
  neighbors = tune(),
  max_rules = tune())%>%
    set_mode('regression')%>%
    set_engine('Cubist')


# create workflow for boosted tree

chem.cubist.wf <- workflow() %>% 
  add_recipe(chem.recipe) %>% 
  add_model(chem.cubist)

# tune xgboost parameters

set.seed(221)

cubist.tune <-
  tune_grid(chem.cubist.wf,
    resamples = chem_folds,
    metrics = chem.metrics,
    grid = 11)
    
best.train.cubist<-show_best(cubist.tune, metric = "rmse")%>%
  arrange(desc(mean))%>%
  filter(mean==min(mean))%>%
  select(.metric, mean)

autoplot(cubist.tune, metric = "rmse")+labs(title='Figure 13. Tuning Results: Cubist')

# update xg model with selected parameters

chem.cubist.tuned <- chem.cubist.wf%>%
  finalize_workflow(select_best(cubist.tune))

# fit xg on trained and evaluate on test

chem.cubist.final <- last_fit(chem.cubist.tuned, chem.split) # final fit on train and test data

# collect metrics for xg test data

cubist.metric<-collect_metrics(chem.cubist.final, metric='rmse')


# plot test predictions

collect_predictions(chem.cubist.final) %>%
  ggplot(aes(Yield, .pred)) +
  geom_point(alpha = 0.5, color = "darkred") +
  coord_fixed()+
  labs(title='Figure 14. Predicted vs. Observed Yield for Chemical Processes', subtitle='Cubist', x= 'Observed Yield', y='Predicted Yield')+
  theme_light()

# Plot variable importance - For a cubist model, update the engine with best fit prior to plotting. 


cubist.update <- chem.cubist %>%
  finalize_model(select_best(cubist.tune))

workflow() %>%
  add_recipe(chem.recipe) %>%
  add_model(cubist.update) %>%
  fit(chem.train) %>%
  pull_workflow_fit() %>%
  vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))+
  labs(title='Figure 15. Variable Importance Chemical Manufacturing: Cubist')+
  theme_classic()

Now we compare the fit for each model (RMSE) to the resampling and test data.

# create col values for a model column

nms<-c('Random Forest', 'XGBoost', 'Cubist')

# bind model metrics

train.compare<-rbind(best.train.rf, best.train.xg, best.train.cubist)

train.fit<-cbind(nms, train.compare) # include the model column

train.fit%>%
  rename(c(Models=nms, Metric = .metric, 'Test Estimate' = mean))%>%
  pivot_wider(names_from = Metric, values_from = 'Test Estimate')%>%
  flextable()%>%
  set_caption('Table 2: Comparison of Model Resampling Fit for Chemical Processes Data')
# create test metrics table for model comparison


fit.compare<-rbind(random.metric, boost.metric, cubist.metric)

compare.fit<-cbind(nms, fit.compare) # include the model column

compare.fit%>%
  select(nms, .metric, .estimate)%>%
  rename(c(Models=nms, Metric = .metric, 'Test Estimate' = .estimate))%>%
  pivot_wider(names_from = Metric, values_from = 'Test Estimate')%>%
  arrange(desc(rmse))%>%
  flextable()%>%
  set_caption('Table 3: Comparison of Model Test Fit for Chemical Processes Data')
  1. Which tree-based regression model gives the optimal resampling and test set performance?

The cubist model had the optimal resampling performance (Table 2, RMSE=0.9312177) of the three models. The XGBoost model returned the optimal test set performance (Table 3, RMSE=XGBoost 0.9742022). The extent to which data preprocessing decisions (e.g., automated removal of collinear predictors, imputing method, treatment of near zero variance predictors, etc.) influence these outcomes is not considered in this discussion.

  1. Which predictors are most important in the optimal tree-based regression model? - Do either the biological or process variables dominate the list? - How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

ManufacturingProcess32 was the most important variable in the optimal model (i.e., Cubist).

Manufacturing processes dominate the list.

The top 10 important predictors for the xGboost, optimal linear, and nonlinear models are as follows:

  • Tree Regression(XGboost): M32 M13, M31, B3, M09, B12, M17, B09, B05, M39

  • Linear Model (Elastic Net): M32,M09,B06, M06, B03, M17, M36, M13

  • Non-Linear Model (Elastic Net): M24, 328, M28, M39, M29, M36, M42, M30, M43, M22

It’s important to note that the variable importance scores for the non-linear model are flawed (due to issues in previous varimp computation) and do not lend themselves to comparison here.

While to top important variable is the same in the tree and linear models (M32), the remaining variables and relative ranks differ.

  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Yes, the tree diagram (Figure 14) shows the splitting rules for each node (magnitude of yield) and as well as the hierarchy of nodes/splits. For example, the topmost node, ManufacturingProcess32, includes 100% of the observations, with a predicted mean of 40. M32 also produces higher yields at higher values (>=160). Given the latter, the next most important predictor is M06, whereas for values of M32 < 160, the next most important predictor is M31. _

library(rpart)
library(rattle)
library(RColorBrewer)

# run model and plot results

rpart.tree<- rpart(Yield ~., chem.train, maxdepth=3)

fancyRpartPlot(rpart.tree, caption='Figure 16. Tree diagram for optimal single tree')