8.1

Recreate the simulated data from Exercise 7.2

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

A

Fit a random fores model to all the predictors, then estimate the variable importance scores:

rf_model_1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rf_imp_1 <- varImp(rf_model_1, scale = FALSE)
rf_imp_1  %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename(Variable = rowname) %>%
  arrange(desc(Overall)) %>%
  kable() %>%
  kable_styling()
Variable Overall
V1 8.7322354
V4 7.6151188
V2 6.4153694
V5 2.0235246
V3 0.7635918
V6 0.1651112
V7 -0.0059617
V10 -0.0749448
V9 -0.0952927
V8 -0.1663626

Did the random forest model significantly use the uninformative predictors (V6-V10)?

No, the model did not as the scores are very close to zero indicating they are uninformative.

B

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

rf_model_2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rf_imp_2 <- varImp(rf_model_2, scale = FALSE)
rf_imp_2 %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename(Variable = rowname) %>%
  arrange(desc(Overall)) %>%
  kable() %>%
  kable_styling()
Variable Overall
V4 7.0475224
V2 6.0689606
V1 5.6911997
duplicate1 4.2833158
V5 1.8723844
V3 0.6297022
V6 0.1356906
V10 0.0289481
V9 0.0084044
V7 -0.0134564
V8 -0.0437056

The importance of v1 has decreased since adding the additional correlated predictor. When adding however, both were significant predictors in the decision trees which caused a decrease in significance.

C

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Stobl et al. (2007). Do these importance show the same pattern as the traditional random forest model?

rf_model_3 <- cforest(y ~ ., data = simulated)
rf_imp_3 <- varImp(rf_model_3, conditional = TRUE)
rf_imp_3 %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename(Variable = rowname) %>%
  arrange(desc(Overall)) %>%
  kable() %>%
  kable_styling()
Variable Overall
V4 6.0471707
V2 4.8021627
duplicate1 1.9703660
V1 1.8986240
V5 0.9850544
V3 0.0229993
V9 0.0004516
V10 -0.0074653
V7 -0.0104328
V8 -0.0104863
V6 -0.0119652

D

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Tree

# Training it without the duplicate predictor
gbm_grid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
                        .n.trees = seq(100, 1000, by = 100),
                        .shrinkage = c(0.01, 0.1),
                        .n.minobsinnode = 10)
gbm_model_1 <- train(y ~ ., data = select(simulated, -duplicate1), method="gbm", tuneGrid = gbm_grid, verbose = FALSE)

gbm_imp_1 <- varImp(gbm_model_1) 

gbm_imp_1$importance %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename(Variable = rowname) %>%
  arrange(desc(Overall)) %>%
  kable() %>%
  kable_styling()
Variable Overall
V4 100.0000000
V1 94.3492687
V2 85.3573190
V5 37.3645989
V3 31.2757450
V6 3.2412743
V7 0.9391656
V9 0.5040653
V10 0.1698201
V8 0.0000000

The boosted tree model was not succesful as it did not show uninformative predictors.

# With duplicate predictor
gbm_model_2 <- train(y ~ ., data = simulated, method="gbm", tuneGrid = gbm_grid, verbose = FALSE)

gbm_imp_2 <- varImp(gbm_model_2) 

gbm_imp_2$importance %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename(Variable = rowname) %>%
  arrange(desc(Overall)) %>%
  kable() %>%
  kable_styling()
Variable Overall
V4 100.000000
V2 81.847724
V1 54.988209
V5 43.118057
duplicate1 39.179131
V3 33.518028
V6 2.929334
V8 2.052144
V7 2.045898
V10 1.305226
V9 0.000000

The duplicate becomes an important variables as importance of v1 decreases.

Cubist

# Without the duplicate predictor
cubist_model_1 <- train(y ~ ., data = select(simulated, -duplicate1), method="cubist")

cubist_imp_1 <- varImp(cubist_model_1) 

cubist_imp_1$importance %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename(Variable = rowname) %>%
  arrange(desc(Overall)) %>%
  kable() %>%
  kable_styling()
Variable Overall
V1 100.00000
V2 75.69444
V4 68.05556
V3 58.33333
V5 55.55556
V6 15.27778
V7 0.00000
V8 0.00000
V9 0.00000
V10 0.00000

The cubist model does not show unimportant variables.

cubist_model_2 <- train(y ~ ., data = simulated, method="cubist")

cubist_imp_2 <- varImp(cubist_model_2) 

cubist_imp_2$importance %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename(Variable = rowname) %>%
  arrange(desc(Overall)) %>%
  kable() %>%
  kable_styling()
Variable Overall
V2 100.00000
V1 89.51613
V4 80.64516
V3 67.74194
duplicate1 59.67742
V5 50.00000
V6 25.00000
V7 0.00000
V8 0.00000
V9 0.00000
V10 0.00000

The cubist model also exhibits a similar behavior as the random forest model, but is not as pronounced.

8.2

Use a simulation to show tree bias with different granularities.

The simulated dataset generated will be done by a non-linear function then the tree based models will be used to train. MSE on the training set and test set will also be examined.

set.seed(42)

n_sample <- 350

nonlinear_function <- function(x){
  sin(1.25 * x) + 2 * cos(.25*x)
}

x <- runif(n_sample, 1, 25)
f_of_x <- nonlinear_function(x)
noise <- rnorm(n_sample, 0, 2)
y <- f_of_x + noise


df <- data.frame(y=y, x=x)
in_train <- createDataPartition(df$y, p = .8, list = FALSE, times = 1)
train_df <- df[in_train,]
test_df <- df[-in_train,]


results <- data.frame(Granularity = c(NA), MSE = c(NA), data = c(NA)) %>% na.omit()

get_mse <- function(model, data){
  y_hat <- predict(model, data)
  mse <- mean((y_hat - data$y)^2)
  return(mse)
}

for(depth in seq(1:10)){
  rtree_model <- rpart(y ~ x, data = train_df, control=rpart.control(maxdepth=depth))
  results <- rbind(results, data.frame(Granularity = depth, MSE = get_mse(rtree_model, train_df), data = "Training"))
  results <- rbind(results, data.frame(Granularity = depth, MSE = get_mse(rtree_model, test_df), data = "Test"))
}

ggplot(results, aes(Granularity, MSE, color = data, group = data)) +
  geom_line() +
  geom_point() +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom", legend.title = element_blank())

As the granularity increases, the MSE on the training set decreases. The MSE on the test set initially begins to decline then increases again.

8.3

In stocastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1. and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

A

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The importance spreads out over more predictors and the higher learning rate will focus the importance on a smaller set of variables.

B

Which model do you think would be more predictive of other samples?

The model on the left will generalize while the model on the right will overfit the training data.

C

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

The importance spreads across more predictors as the interaction depth would increase therefore the slope decreases.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

A

Which tree-based regression model gives the optimal resampling and test set performance?

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

# Reproducible
set.seed(42)

knn_model <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df <- predict(knn_model, ChemicalManufacturingProcess)

df <- df %>%
  select_at(vars(-one_of(nearZeroVar(., names = TRUE))))

in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]

pls_model <- train(
  Yield ~ ., data = train_df, method = "pls",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
pls_predictions <- predict(pls_model, test_df)
pls_in_sample <- pls_model$results[pls_model$results$ncomp == pls_model$bestTune$ncomp,]
results <- data.frame(t(postResample(pred = pls_predictions, obs = test_df$Yield))) %>%
  mutate("In Sample RMSE" = pls_in_sample$RMSE,
         "In Sample Rsquared" = pls_in_sample$Rsquared,
         "In Sample MAE" = pls_in_sample$MAE,
         "Model"= "PLS")
pls_model
## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 128, 129, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8824790  0.3779221  0.6711462
##    2     1.1458456  0.4219806  0.7086431
##    3     0.7363066  0.5244517  0.5688553
##    4     0.8235294  0.5298005  0.5933120
##    5     0.9670735  0.4846010  0.6371199
##    6     0.9959036  0.4776684  0.6427478
##    7     0.9119517  0.4986338  0.6200233
##    8     0.9068621  0.5012144  0.6293371
##    9     0.8517370  0.5220166  0.6163795
##   10     0.8919356  0.5062912  0.6332243
##   11     0.9173758  0.4934557  0.6463164
##   12     0.9064125  0.4791526  0.6485663
##   13     0.9255289  0.4542181  0.6620193
##   14     1.0239913  0.4358371  0.6944056
##   15     1.0754710  0.4365214  0.7077991
##   16     1.1110579  0.4269065  0.7135684
##   17     1.1492855  0.4210485  0.7222868
##   18     1.1940639  0.4132534  0.7396357
##   19     1.2271867  0.4079005  0.7494818
##   20     1.2077102  0.4022859  0.7470327
##   21     1.2082648  0.4026711  0.7452969
##   22     1.2669285  0.3987044  0.7634170
##   23     1.3663033  0.3970188  0.7957514
##   24     1.4531634  0.3898475  0.8243034
##   25     1.5624265  0.3820102  0.8612935
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.