Required Libraries

library(tidyverse)
library(kableExtra)
library(janitor)
library(mlbench)
library(caret) ## KNN
library(latex2exp)
library(AppliedPredictiveModeling)
library(ggcorrplot)
library(randomForest)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(partykit)
library(ipred)
library(rpart.plot)

8.1

Recreate the simulated data from Exercise 7.2:

Simulated Data
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 y
0.5338 0.6478 0.8508 0.1816 0.9290 0.3618 0.8267 0.4214 0.5911 0.5886 18.4640
0.5838 0.4382 0.6727 0.6692 0.1638 0.4531 0.6490 0.8446 0.9282 0.7584 16.0984
0.5896 0.5879 0.4097 0.3381 0.8941 0.0268 0.1786 0.3496 0.0176 0.4441 17.7616
0.6910 0.2260 0.0334 0.0669 0.6374 0.5250 0.5134 0.7970 0.6899 0.4451 13.7873
0.6673 0.8189 0.7168 0.8032 0.0831 0.2234 0.6645 0.9039 0.3970 0.5501 18.4298
0.8393 0.3863 0.6462 0.8611 0.6304 0.4370 0.3360 0.6489 0.5312 0.9066 20.8582
* Values have been rounded to 4 sig. fig.
  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The uninformative predictors (V6 – V10) were not significantly used in our model.

set.seed(200)

model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

kbl(rfImp1,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           symbol = c("Simulated Data: Random Forest"))
Variable Importance
Overall
V1 8.7259
V2 6.6302
V3 0.7313
V4 7.8096
V5 2.0500
V6 0.1495
V7 0.1353
V8 -0.0763
V9 -0.0201
V10 -0.0113
* Simulated Data: Random Forest
  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1

paste0("duplicate1 Corr.: ", round(cor(simulated$duplicate1, simulated$V1), 4))
## [1] "duplicate1 Corr.: 0.9497"

Fit another random forest model to these data. Did the importance score for V1 change?

The importance of V1 went down from 8.7259 to 6.2236, while duplicate1 became of importance as well. We also see V4 become the most important predictor overall, compared to V1 previously.

set.seed(200)
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

kbl(rfImp2,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "Added highly correlated variable",
           symbol = c("Simulated Data: Random Forest"))
Variable Importance
Overall
V1 6.2236
V2 6.1384
V3 0.6836
V4 6.9340
V5 2.2132
V6 0.1077
V7 0.0298
V8 0.0033
V9 -0.0597
V10 0.0147
duplicate1 3.8355
Added highly correlated variable
* Simulated Data: Random Forest

What happens when you add another predictor that is also highly correlated with V1?

We see that V1 becomes even less important while our other predictors move up the list in importance.

set.seed(123)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1

paste0("duplicate2 Corr.: ", round(cor(simulated$duplicate1, simulated$V1), 4))
## [1] "duplicate2 Corr.: 0.9497"
set.seed(200)
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)

kbl(rfImp3,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "Added second highly correlated variable",
           symbol = c("Simulated Data: Random Forest"))
Variable Importance
Overall
V1 4.4092
V2 6.4247
V3 0.4393
V4 7.3494
V5 2.1337
V6 0.1793
V7 0.1346
V8 -0.0397
V9 0.0052
V10 -0.0135
duplicate1 2.5598
duplicate2 3.5651
Added second highly correlated variable
* Simulated Data: Random Forest
  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

V4 becomes the most important predictor compared to V1, however overall the model chooses the same important variables overall, while minimally using our uninformative predictors

set.seed(200)

model4 <- cforest(y ~ ., data = simulated[, c(1:11)])
rfImp4 <- varimp(model4, conditional = TRUE)

kbl(rfImp4,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "",
           symbol = c("Simulated Data: Conditional Inference Trees"))
Variable Importance
x
V1 6.0910
V2 5.0941
V3 0.1277
V4 6.2070
V5 1.4594
V6 -0.1818
V7 -0.1429
V8 -0.4577
V9 -0.2489
V10 -0.2298
* Simulated Data: Conditional Inference Trees
  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

The boosted model ranks our V1 to V5 as the most important variables, where V4 and V1 are the top 2. So again, these switch in importance compared to our first tree model. As for the Cubist model it is similar results but V1 is the top predictor.

Boosted

set.seed(200)

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)

gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

rfImp5 <- varImp(gbmTune$finalModel, scale = FALSE)


kbl(rfImp5,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "",
           symbol = c("Simulated Data: Boosted Trees"))
Variable Importance
Overall
V1 4687.5152
V2 3742.1961
V3 1469.6142
V4 4716.7730
V5 1966.6999
V6 186.1673
V7 170.1968
V8 141.7410
V9 169.9807
V10 100.8460
* Simulated Data: Boosted Trees

Cubist

set.seed(200)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubistTuned$finalModel, scale = FALSE)

kbl(rfImp6,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "",
           symbol = c("Simulated Data: Cubist Trees"))
Variable Importance
Overall
V1 72.0
V3 42.0
V2 54.5
V4 49.0
V5 40.0
V6 11.0
V7 0.0
V8 0.0
V9 0.0
V10 0.0
* Simulated Data: Cubist Trees

8.2

Use a simulation to show tree bias with different granularities.

We can see that our most important variables are b and c. What is interesting is that our head node is d, indicating that because we have such a large distinction in our values, it is considered the most unique variable as we branch out towards other variables. This can create a huge bias as our model may be overfitting these unique values as to why it is the head node. However, b appears most often as we move down these branches which shows it’s influence in determining which path to follow on.

set.seed(200)

a <- sample(x = 1:10 / 10, size = 300, replace = TRUE)
b <- sample(x = 1:100 / 100, size = 300, replace = TRUE)
c <- sample(x = 1:1000 / 1000, size = 300, replace = TRUE)
d <- sample(x = 1:10000 / 10000, size = 300, replace = TRUE)

y <- a + b + c + d

simulate_data <- data.frame(a, b, c, d, y) 

rpart_tree <- rpart(y ~ ., data = simulate_data)

plot(as.party(rpart_tree))

rfImp7 <- varImp(rpart_tree)

kbl(rfImp7,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "",
           symbol = c("Simulated Data: Tree Bias"))
Variable Importance
Overall
a 2.3565
b 4.2698
c 3.4416
d 1.7506
* Simulated Data: Tree Bias

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

We have a high bagging fraction on the right model causing us to have fewer predictors as we are avoiding overfitting these predictors values. Alternatively, the left model uses a lower bagging fraction which forces the model to include more predictors to be able to explain how our model function. This all then creates a higher learning rate on the right model with just a few predictors, but possibly mistakenly giving them too much weight.

  1. Which model do you think would be more predictive of other samples?

The model on the left would be our preferred choice for other samples. It creates flexibility on how to interpret these various predictors without the loss of information in the right model, assuming they are not nearly important for the model

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing our interaction depth would allow inclusion of our smaller important predictors to be a deciding factor in the model. Overall the slope would not be as steep and progressively decrease slower than otherwise.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models.

Data Preparation

data(ChemicalManufacturingProcess)

chem <- ChemicalManufacturingProcess
kbl(chem[1:5, 1:10],
    caption = "Chemical Manufacturing Process") |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "Dimensions: ",
          TeX(paste0(nrow(chem), " x ", ncol(chem))))
Chemical Manufacturing Process
Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
38.00 6.25 49.58 56.97 12.74 19.51 43.73 100 16.66 11.44
42.44 8.01 60.97 67.48 14.65 19.36 53.14 100 19.04 12.55
42.03 8.01 60.97 67.48 14.65 19.36 53.14 100 19.04 12.55
41.42 8.01 60.97 67.48 14.65 19.36 53.14 100 19.04 12.55
42.49 7.47 63.33 72.25 14.02 17.91 54.66 100 18.22 12.80
Dimensions:
176 x 58

We can see which variables have missing values below. Afterwards, we can use a KNN imputation method to replace these NA values and confirm that we have zero missing values.

missing <- 
  chem %>% 
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  pivot_longer(., 
            cols = everything()) |>
  filter(value != 0) |>
  rename(variable = name, 
         total = value) |>
  mutate(pct = round(total/nrow(chem), 4)) |>
  arrange(desc(pct), variable) |>
  mutate(variable = fct_reorder(variable, pct))

kbl(missing,
  caption = "Missing Values") |>
  kable_classic(full_width = F, html_font = "Cambria") %>%
  kable_styling(latex_options = "HOLD_position")
Missing Values
variable total pct
ManufacturingProcess03 15 0.0852
ManufacturingProcess11 10 0.0568
ManufacturingProcess10 9 0.0511
ManufacturingProcess25 5 0.0284
ManufacturingProcess26 5 0.0284
ManufacturingProcess27 5 0.0284
ManufacturingProcess28 5 0.0284
ManufacturingProcess29 5 0.0284
ManufacturingProcess30 5 0.0284
ManufacturingProcess31 5 0.0284
ManufacturingProcess33 5 0.0284
ManufacturingProcess34 5 0.0284
ManufacturingProcess35 5 0.0284
ManufacturingProcess36 5 0.0284
ManufacturingProcess02 3 0.0170
ManufacturingProcess06 2 0.0114
ManufacturingProcess01 1 0.0057
ManufacturingProcess04 1 0.0057
ManufacturingProcess05 1 0.0057
ManufacturingProcess07 1 0.0057
ManufacturingProcess08 1 0.0057
ManufacturingProcess12 1 0.0057
ManufacturingProcess14 1 0.0057
ManufacturingProcess22 1 0.0057
ManufacturingProcess23 1 0.0057
ManufacturingProcess24 1 0.0057
ManufacturingProcess40 1 0.0057
ManufacturingProcess41 1 0.0057
missing |>
  ggplot(aes(x=variable, y=pct)) +
  geom_bar(stat='identity', fill="#4E79A7") +
  coord_flip() +
  theme_bw() +
  labs(x = "") +
  geom_text(aes(label = paste0(round(pct*100, 2), "%")), colour = "white", hjust = 1, size=3, fontface = "bold") +
  scale_y_continuous(labels = scales::percent)

#t1 <- tableGrob(missing)  # transform into a tableGrob

#grid.arrange(missing_plot, t1, nrow=1)
set.seed(42)

knn_impute <- preProcess(chem, "knnImpute", k=5)
impute_chem <- predict(knn_impute, chem)

impute_missing <-
  impute_chem %>% 
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  pivot_longer(., 
            cols = everything()) |>
  filter(value != 0) %>%
  nrow(.)

print(paste0("Missing Values: ", impute_missing >0))
## [1] "Missing Values: FALSE"

We will filter out predictors with low frequency using the nearZeroVar function. We then will be able to build our non-linear models.

nzv_predictors <- nearZeroVar(impute_chem |> select(-Yield),
                              names = TRUE, saveMetrics = FALSE)

filtered_chem <-
  as_tibble(impute_chem) |>
  select(-all_of(nzv_predictors)) |>
  clean_names()

kbl(filtered_chem[1:5, 1:10],
    caption = "Filtered Chemical Manufacturing Process") |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "Dimensions: ",
          TeX(paste0(nrow(filtered_chem), " x ", ncol(filtered_chem))))
Filtered Chemical Manufacturing Process
yield biological_material01 biological_material02 biological_material03 biological_material04 biological_material05 biological_material06 biological_material08 biological_material09 biological_material10
-1.1792673 -0.2261036 -1.514098 -2.683036 0.2201765 0.4941942 -1.382888 -1.233131 -3.3962895 1.1005296
1.2263678 2.2391498 1.308996 -0.056235 1.2964386 0.4128555 1.129077 2.282619 -0.7227225 1.1005296
1.0042258 2.2391498 1.308996 -0.056235 1.2964386 0.4128555 1.129077 2.282619 -0.7227225 1.1005296
0.6737219 2.2391498 1.308996 -0.056235 1.2964386 0.4128555 1.129077 2.282619 -0.7227225 1.1005296
1.2534583 1.4827653 1.893939 1.135948 0.9414412 -0.3734185 1.534835 1.071310 -0.1205678 0.4162193
Dimensions:
176 x 57
set.seed(42)

train_index <- createDataPartition(filtered_chem$yield, p = 0.80, list = FALSE)

train_df <- filtered_chem[train_index, ]
train_x <- train_df |>
    select(-yield) %>%
  as.data.frame()
train_y <- train_df$yield
train_y <- as.numeric(train_y) 

test_df <- filtered_chem[-train_index, ]
test_x <- test_df |>
    select(-yield) %>%
  as.data.frame()
test_y <- test_df$yield
test_y <- as.numeric(test_y)
  1. Which tree-based regression model gives the optimal resampling and test set performance?

The optimal performance model is the cubist model where our RMSE and \(R^2\) are the best values

Random Forest

set.seed(42)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

rf <- postResample(rfPred, test_y)

Bagged

set.seed(42)

baggedTree <- ipredbagg(train_y, train_x)
 
baggedPred <- predict(baggedTree, test_x)

bag <- postResample(baggedPred, test_y)

Boosted

set.seed(42)

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

boost <- postResample(gbmPred, test_y)

Cubist

set.seed(42)

cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

cube <- postResample(cubistPred, test_y)
append_tables <- function(train_table, model_name){
  
  train_table <-
    train_table |>
    select(RMSE, Rsquared) |>
    mutate(Model = model_name) |>
    relocate(Model)
  
  
    return(train_table)
}

rf_table <- append_tables(as_tibble(as.list(rf)), "Random Forest")
bag_table <- append_tables(as_tibble(as.list(bag)), "Bagged")
boost_table <- append_tables(as_tibble(as.list(boost)), "Boosted")
cube_table <- append_tables(as_tibble(as.list(cube)), "Cubist")
model_table <- 
  bind_rows(rf_table, bag_table, boost_table, cube_table)

kbl(model_table,
    caption = "Model Comparisons",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria")
Model Comparisons
Model RMSE Rsquared
Random Forest 0.6284 0.7528
Bagged 0.6529 0.7019
Boosted 0.5978 0.7445
Cubist 0.5215 0.7966
  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Our manufacturing_process32 and manufacturing_process17 are the most important predictors in our cubist model. We again have these process variables dominate our list of predictors. Comparing these predictors to our linear and non-linear models, we definitely see how our process variables are the most important variables to use in the model. manufacturing_process32 was the leading variable for all three models. However, after that, the process variables rankings of importance are different.

cube_top_pred <- 
  varImp(cubistTuned)$importance |>
  arrange(desc(Overall))

kbl(head(cube_top_pred, 10),
    caption = "Top 10 Predictors") |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote("Cubist Model")
Top 10 Predictors
Overall
manufacturing_process17 100.00000
manufacturing_process32 100.00000
manufacturing_process09 68.05556
manufacturing_process33 43.05556
manufacturing_process01 26.38889
manufacturing_process29 25.00000
manufacturing_process25 22.22222
biological_material03 20.83333
manufacturing_process21 18.05556
manufacturing_process45 16.66667
Note:
Cubist Model
cube_top_pred |>
  arrange(desc(Overall)) |>
  head(10) %>%
  ggplot(aes(y=reorder(row.names(.), Overall), x=Overall, label=round(Overall,2))) +
  geom_bar(stat='identity', fill="#F28E2B") +
  geom_text(position = position_stack(vjust = 0.5),
            size=3,
            fontface = "bold",
            colour="black") +
  theme_minimal() +
  labs(y="", title = "Cubist Predictors by Model Importance") +
  theme(panel.border = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        axis.line = element_line(colour = "black"))

  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

We can see that our tree has manufacturing_process32 as the top head node. It is followed by manufacturing_process13 and manufacturing_process17 down the next level. For biological_material03, we can see it becomes an important factor in the third level down in how we determine our yield. This could be important as we now can put more emphasis on this biological material when producing our yield outside of any process.

chem_tree <- rpart(yield ~ ., data = train_df)

rpart.plot(chem_tree)