DATA 624 - Homework 9

8.1 - Recreate the simulated data from Exercise: 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200,sd=1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated,
                       importance = T, ntree = 100)

rfImp <- varImp(model1, scale = F)

Did the random forest model significantly use the uninformative predictors (v6 - v10)?

lv <- rownames(rfImp)

rfImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=lv, ordered=T)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col() +
  labs(title="Variable Importance", subtitle="Random Forest Model",
       y="Variable", x="Overall Importance")

The random forest model did not really use the uninformative predictors. However, it did informative predictor v3 to be less important than I expected.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9483112

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

# Fit another random forest model
model2 <- randomForest(y ~ ., data = simulated,
                       importance = T, ntree = 100)

rfImp <- varImp(model2, scale = F)

lv <- rownames(rfImp)

rfImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=lv, ordered=T)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col() +
  labs(title="Variable Importance", subtitle="Random Forest Model w/ 1 duplicate",
       y="Variable", x="Overall Importance")

The importance of V1 decreased with the addition of the highly-correlated predictor. This is probably because some of the trees in the random forest chose the duplicate predictor over V1, which is not unusual given it’s high degree of correlation.

Now we add another duplicate and refit:

# Add another duplicate variable
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1

# Fit another random forest model
model3 <- randomForest(y ~ ., data = simulated,
                       importance = T, ntree = 100)

rfImp <- varImp(model3, scale = F)

lv <- rownames(rfImp)

rfImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=lv, ordered=T)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col() +
  labs(title="Variable Importance", subtitle="Random Forest Model w/ 2 duplicates",
       y="Variable", x="Overall Importance")

Adding another duplicate predictor that is well-correlated with V1 reduces the importance of V1 even further. It also redices the iportance of the first duplicate variable. This is likely because of the similarity of the 3 variables - some trees choose V1, while others choose one of the duplicates. This sort of “waters down” the influence of each predictor.

(c) Use the `cforest` function in the party package to fit a random forest model using conditional inference trees. The party package function `varimp` can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

# Model with conditional inference trees
model.cond <- cforest(y ~ ., data=simulated)

# Get variable importance
condImp <- varimp(model.cond, conditional=T)
condImp <- as.data.frame(condImp)
names(condImp) <- "Overall"

condImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=lv, ordered=T)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col() +
  labs(title="Conditional Variable Importance",
       subtitle="Random Forest Model (conditional) w/ 2 duplicates",
       y="Variable", x="Conditional Importance")

Yes, the conditional inference trees show a similar pattern in variable importance.

(d) Repeat the process with different tree models, such as boosted trees and Cubist. Does the same pattern ooccur?

GBM (Gradiant Boosting Machine)

set.seed(56431)

# Set our range of tuning parameters
param.gbm <- expand.grid(interaction.depth = seq(1,7,by=2),
                         n.trees = seq(100,1000,by=50),
                         shrinkage = seq(0.01,0.1,by=0.01),
                         n.minobsinnode = 10)

# Training control
ctrl.gbm <- trainControl(method="cv",n=10)

# Split out dependent and independent variables
dv <- simulated %>% select(y)
iv <- simulated %>% select(-y)


# Tune the model
model.gbm <- train(x=iv, y=dv$y, method="gbm",
                   trControl = ctrl.gbm,
                   tuneGrid = param.gbm,
                   verbose=F)

# Get the best model based on tuning
model.gbm$bestTune

##     n.trees interaction.depth shrinkage n.minobsinnode
## 497     200                 5      0.07             10

rfImp <- varImp(model.gbm$finalModel, scale = F)

lv <- rownames(rfImp)

rfImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=lv, ordered=T)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col() +
  labs(title="Variable Importance", subtitle="GBM Model w/ 2 duplicates",
       y="Variable", x="Overall Importance")

With the GBM model, we see a similar pattern as before with the duplicate variables, they have a non-zero importance in the final model. It also appears that the V1 predictor is also reduced in important as a result.

Interestingly, the V3 predictor is more important under this model, than it was in the previous models.

Cubist

set.seed(56431)

# Set our range of tuning parameters
param.cubist <- expand.grid(committees = seq(1,10,by=1),
                            neighbors = seq(1,9,by=2))

# Training control
ctrl.cubist <- trainControl(method="cv",n=10)

# 
model.cubist <- train(x=iv, y=dv$y, method="cubist",
                   trControl = ctrl.cubist,
                   tuneGrid = param.cubist,
                   verbose=F)

model.cubist$bestTune

##    committees neighbors
## 44          9         7

rfImp <- varImp(model.cubist$finalModel, scale = F)

lv <- rownames(rfImp)

rfImp %>% rownames_to_column(var="Variable") %>%
  mutate(Variable = factor(Variable, levels=lv, ordered=T)) %>%
  ggplot(aes(x=Overall, y=factor(Variable))) + geom_col() +
  labs(title="Variable Importance", subtitle="Cubist Model w/ 2 duplicates",
       y="Variable", x="Overall Importance")

The Cubist model, impressively, ignored the duplicate predictors and most of the uninformative predictors!

8.2 - Use a simulation to show tree bias with different granularities.

According to the text\(^1\) (page 182), “…trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors”.

We’ll test this by training a large number of trees and seeing which variable is chosen from simulated data.

set.seed(05251)

varsUsed <- tibble()

for(i in 1:50){
  # Generate data
  # One variable X1, is associated with the outcome, Y,
  # while X2 is not associated at all.

  sim <- tibble(X1 = rep(c(1,2), times=50))
  sim$X2 <- rnorm(100, mean = 20, sd = 5)
  sim$Y <- sim$X1 + rnorm(100,mean = 0, sd = 1)
  
  # Limit to a single split
  mod <- rpart(Y ~ ., data = sim)
  
  # Most important variable
  v <- varImp(mod) %>% slice_max(order_by = Overall) %>% row.names()

  varsUsed <- rbind(varsUsed, v)
}

table(varsUsed)

## varsUsed
## X1 X2 
##  4 46

Even though variable X2 had no direct impact on outcome Y, the trees found that to be the most important variable over 90% of the time.

8.3 - In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the few predictors, whereas the model on the left spreads importance across more predictors?

The learning rate is a regularization parameter which counters the tree algorithm’s greedy nature to choose optimal learners at each stage. In the right-hand plot, the learning rate is barely penalized (0.9), so the same variables tend to dominate the process as they build upon one another.

Similarly, the larger bagging fraction uses more of the data during training, so the stages are more alike, causing a subsequent stage to more often choose the same variables a prior stage did.

(b) Which model do you think would be more predictive of other samples?

The left-hand model is likely to do better on data it has not seen. I would suspect that more dissimilar weak learners contributing, rather than fewer similar trees (in terms of variable selection), gives the final model more flexibility with out-of-sample observations.

(c) How would increasing interaction depth affect the slope of the predictor importance for either model in Fig. 8.24?

I would expect that the slopes would decrease (i.e. less variables would dominate) as the interaction depth was increased because the model allows more variables to be utilized in the trees.

8.7 - Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

# Load the chemical manufacturing data
library(AppliedPredictiveModeling)

data("ChemicalManufacturingProcess")

# Load the tidymodels package
library(tidymodels)

# Create a train/test split
set.seed(1108)

chem_split <- initial_split(ChemicalManufacturingProcess, prop=0.75)
chem_train <- training(chem_split)
chem_test <- testing(chem_split)

# Create our generic recipe with KNN imputation
chem_recipe <- chem_train %>% recipe(Yield ~ .) %>%
  step_knnimpute(all_predictors()) %>% prep()

# Get 10 folds for CV model tuning
folds <- vfold_cv(chem_train, v=10)

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest

We’ll start with a basic Random Forest model.

# Our model specification. We'll tune every param.
rf <- rand_forest(mtry = tune(),
                  trees = tune(),
                  min_n = tune()) %>%
  set_engine("randomForest") %>%
  set_mode("regression")

# A grid of tuning parameters
rf_tune <- grid_regular(mtry(c(57,57)),
                        trees(),
                        min_n(),
                        levels = 10)

# Create the workflow
rf_wf <- workflow() %>% add_model(rf) %>% add_recipe(chem_recipe)

# Tune the model (takes a while!)
rf.model <- rf_wf %>% tune_grid(resamples = folds, grid = rf_tune)

# Get best model from the tuning results
rf.model %>% show_best("rmse")

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config 
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>   
## 1    57   889     2 rmse    standard   0.971    10  0.0923 Model005
## 2    57   223     2 rmse    standard   0.977    10  0.0931 Model002
## 3    57   667     2 rmse    standard   0.977    10  0.0915 Model004
## 4    57  1333     2 rmse    standard   0.980    10  0.0919 Model007
## 5    57  1555     2 rmse    standard   0.980    10  0.0917 Model008

best_rf <- rf.model %>% select_best(metric = "rmse")

The best Random Forest model, from the tuning, is 889 trees with a min_n of 2. The estimated cross-validated RMSE is 0.971 for that model.

# Take the best model and get test metrics
rf_wf <- rf_wf %>% finalize_workflow(best_rf)

rf_final <- rf_wf %>% last_fit(chem_split)

modelEval <- tibble(model = "Random Forest") %>%
  bind_cols(rf_final %>% collect_metrics)

rf_final %>% collect_metrics %>% kbl() %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

.metric	.estimator	.estimate
rmse	standard	1.5562577
rsq	standard	0.4812533

Our test results are rather poor, as shown in the above table. We’ll try a different model and see how it performs.

Boosted Tree

# Boosted tree model
boost <- boost_tree(mode = "regression",
                    mtry = tune(), trees = tune(),
                    min_n = 10, tree_depth = 8,
                    learn_rate = tune(),
                    loss_reduction = tune(),
                    sample_size = .60,
                    stop_iter = 3) %>% set_engine("xgboost")

# A grid of tuning parameters (had to cut this down for the sake of my CPU)
boost_tune <- grid_regular(finalize(mtry(),select(chem_train,-Yield)),
                           trees(),
                           learn_rate(), loss_reduction(),
                           levels = 5)

# Create the workflow
boost_wf <- workflow() %>% add_model(boost) %>% add_recipe(chem_recipe)

# Tune the model
boost.model <- boost_wf %>% tune_grid(resamples = folds, grid = boost_tune)

# Get best model from the tuning results
boost.model %>% show_best("rmse")

## # A tibble: 5 x 10
##    mtry trees learn_rate loss_reduction .metric .estimator  mean     n std_err
##   <int> <int>      <dbl>          <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    43   500        0.1      0.0422    rmse    standard   0.971    10  0.0920
## 2    43  1000        0.1      0.0422    rmse    standard   0.971    10  0.0920
## 3    43  1500        0.1      0.0422    rmse    standard   0.971    10  0.0920
## 4    43  2000        0.1      0.0422    rmse    standard   0.971    10  0.0918
## 5    29   500        0.1      0.0000562 rmse    standard   0.980    10  0.0721
## # … with 1 more variable: .config <chr>

best_boost <- boost.model %>% select_best(metric = "rmse")

The tuning returned a model with 500 trees, a learning rate of 0.1, and a loss-reduction of \(4.2 \text{x} 10^{-2}\).

# Take the best model and get test metrics
boost_wf <- boost_wf %>% finalize_workflow(best_boost)

boost_final <- boost_wf %>% last_fit(chem_split)

temp <- tibble(model = "XGBoost") %>%
  bind_cols(boost_final %>% collect_metrics)

modelEval <- rbind(modelEval, temp)

boost_final %>% collect_metrics() %>% kbl() %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

.metric	.estimator	.estimate
rmse	standard	1.2495115
rsq	standard	0.6563924

The results of the boosting model were better, but not great.

Cubist

Finally, we will try the Cubist model.

# Set our range of tuning parameters
param.cubist <- expand.grid(committees = seq(1,10,by=1),
                            neighbors = seq(1,9,by=2))

# Training control
ctrl.cubist <- trainControl(method="cv",n=10)

# Split our data
dv <- chem_train %>% select(Yield)

iv <- chem_train %>% select(-Yield)

# Cubist model
model.cubist <- train(x=iv, y=dv$Yield, method="cubist",
                   trControl = ctrl.cubist,
                   tuneGrid = param.cubist,
                   verbose=F)

# Find the CV tuned model
model.cubist$bestTune

##    committees neighbors
## 22          5         3

Cross-validation selected the best model as one with 5 committees and 3 neighbors. We’ll try that on the test data set and see how it compares to the previous two models.

# Make predictions with the final model
pred.cubist <- predict(model.cubist$finalModel, newdata = chem_test)

cubist.metrics <- tibble(predicted = pred.cubist) %>%
  bind_cols(actual = chem_test$Yield) %>%
  rsq(truth=actual, estimate=predicted)

cubist.metrics <- tibble(predicted = pred.cubist) %>%
  bind_cols(actual = chem_test$Yield) %>%
  rmse(truth=actual, estimate=predicted) %>%
  rbind(cubist.metrics)

modelEval <- tibble(model = "Cubist") %>% bind_cols(cubist.metrics) %>%
  rbind(modelEval)

cubist.metrics %>% kbl() %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

.metric	.estimator	.estimate
rmse	standard	1.3567007
rsq	standard	0.5936248

The Cubist algorithm performed better than Random Forest, but slightly worse than XGBoost. Let’s compare them together:

modelEval

## # A tibble: 6 x 4
##   model         .metric .estimator .estimate
##   <chr>         <chr>   <chr>          <dbl>
## 1 Cubist        rmse    standard       1.36 
## 2 Cubist        rsq     standard       0.594
## 3 Random Forest rmse    standard       1.56 
## 4 Random Forest rsq     standard       0.481
## 5 XGBoost       rmse    standard       1.25 
## 6 XGBoost       rsq     standard       0.656

The best model of the three is the XGBoost model.

(b) Which predictors are the most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

library(vip)

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

boost_final %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 10)

## Warning: `as.tibble()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

With the XGBoost model, 6 of the top 10 most important variables were process variables. Biological variables, however were 2 of the top 3, so they were well represented.

Like with the linear and nonlinear models, the most important variable is ManufacturingProcess32. Also, ManufacturingProcess09 and ManufacturingProcess13 were in the top 10 here as they were in the linear and nonlinear models.

(c) Plot the optimal single tree with the distribution of yield in the terminals. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

# Single tree model
singletree <- decision_tree(mode = "regression",
                    cost_complexity = tune(),
                    tree_depth = tune(),
                    min_n = tune()) %>%
  set_engine("rpart")

# A grid of tuning parameters 
single_tune <- grid_regular(cost_complexity(),
                           tree_depth(),
                           min_n(),
                           levels = 5)

# Create the workflow
single_wf <- workflow() %>% add_model(singletree) %>% add_recipe(chem_recipe)

# Tune the model
single.model <- single_wf %>% tune_grid(resamples = folds, grid = single_tune)

# Take the best model
best_single <- single.model %>% select_best(metric = "rmse")

single_wf <- single_wf %>% finalize_workflow(best_single) %>%
  fit(chem_train)

single_tree <-single_wf %>% pull_workflow_fit()

rpart.plot::rpart.plot(single_tree$fit, roundint = F)

The very basic tree model used the same ManufacturingProcess32 variable as the first split, indicating it too found that variable to be important. For the second split, it used BiologicalMaterial11 which was in the top 10 of the XGBoost model above, but lower on the list.

Sources

\(^1\) - Kuhn, Max, and Kjell Johnson. Applied Predictive Modeling. Springer, 2016.

DATA 624 - Homework 9

Adam Douglas

11/14/2020

8.1 - Recreate the simulated data from Exercise: 7.2:

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predictors (v6 - v10)?

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

(d) Repeat the process with different tree models, such as boosted trees and Cubist. Does the same pattern ooccur?

GBM (Gradiant Boosting Machine)

Cubist

8.2 - Use a simulation to show tree bias with different granularities.

(a) Why does the model on the right focus its importance on just the few predictors, whereas the model on the left spreads importance across more predictors?

(b) Which model do you think would be more predictive of other samples?

(c) How would increasing interaction depth affect the slope of the predictor importance for either model in Fig. 8.24?

8.7 - Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest

Boosted Tree

Cubist

(b) Which predictors are the most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

(c) Plot the optimal single tree with the distribution of yield in the terminals. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Sources