Trees and Rules

Required Libraries

library(tidyverse)
library(kableExtra)
library(janitor)
library(mlbench)
library(caret) ## KNN
library(latex2exp)
library(AppliedPredictiveModeling)
library(ggcorrplot)
library(randomForest)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(partykit)
library(ipred)
library(rpart.plot)

8.1

Recreate the simulated data from Exercise 7.2:

Simulated Data
V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	y
0.5338	0.6478	0.8508	0.1816	0.9290	0.3618	0.8267	0.4214	0.5911	0.5886	18.4640
0.5838	0.4382	0.6727	0.6692	0.1638	0.4531	0.6490	0.8446	0.9282	0.7584	16.0984
0.5896	0.5879	0.4097	0.3381	0.8941	0.0268	0.1786	0.3496	0.0176	0.4441	17.7616
0.6910	0.2260	0.0334	0.0669	0.6374	0.5250	0.5134	0.7970	0.6899	0.4451	13.7873
0.6673	0.8189	0.7168	0.8032	0.0831	0.2234	0.6645	0.9039	0.3970	0.5501	18.4298
0.8393	0.3863	0.6462	0.8611	0.6304	0.4370	0.3360	0.6489	0.5312	0.9066	20.8582
^* Values have been rounded to 4 sig. fig.

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The uninformative predictors (V6 – V10) were not significantly used in our model.

set.seed(200)

model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

kbl(rfImp1,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           symbol = c("Simulated Data: Random Forest"))

Variable Importance
	Overall
V1	8.7259
V2	6.6302
V3	0.7313
V4	7.8096
V5	2.0500
V6	0.1495
V7	0.1353
V8	-0.0763
V9	-0.0201
V10	-0.0113
^* Simulated Data: Random Forest

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1

paste0("duplicate1 Corr.: ", round(cor(simulated$duplicate1, simulated$V1), 4))

## [1] "duplicate1 Corr.: 0.9497"

Fit another random forest model to these data. Did the importance score for V1 change?

The importance of V1 went down from 8.7259 to 6.2236, while duplicate1 became of importance as well. We also see V4 become the most important predictor overall, compared to V1 previously.

set.seed(200)
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

kbl(rfImp2,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "Added highly correlated variable",
           symbol = c("Simulated Data: Random Forest"))

Variable Importance
	Overall
V1	6.2236
V2	6.1384
V3	0.6836
V4	6.9340
V5	2.2132
V6	0.1077
V7	0.0298
V8	0.0033
V9	-0.0597
V10	0.0147
duplicate1	3.8355
Added highly correlated variable
^* Simulated Data: Random Forest

What happens when you add another predictor that is also highly correlated with V1?

We see that V1 becomes even less important while our other predictors move up the list in importance.

set.seed(123)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1

paste0("duplicate2 Corr.: ", round(cor(simulated$duplicate1, simulated$V1), 4))

## [1] "duplicate2 Corr.: 0.9497"

set.seed(200)
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)

kbl(rfImp3,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "Added second highly correlated variable",
           symbol = c("Simulated Data: Random Forest"))

Variable Importance
	Overall
V1	4.4092
V2	6.4247
V3	0.4393
V4	7.3494
V5	2.1337
V6	0.1793
V7	0.1346
V8	-0.0397
V9	0.0052
V10	-0.0135
duplicate1	2.5598
duplicate2	3.5651
Added second highly correlated variable
^* Simulated Data: Random Forest

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

V4 becomes the most important predictor compared to V1, however overall the model chooses the same important variables overall, while minimally using our uninformative predictors

set.seed(200)

model4 <- cforest(y ~ ., data = simulated[, c(1:11)])
rfImp4 <- varimp(model4, conditional = TRUE)

kbl(rfImp4,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "",
           symbol = c("Simulated Data: Conditional Inference Trees"))

Variable Importance
	x
V1	6.0910
V2	5.0941
V3	0.1277
V4	6.2070
V5	1.4594
V6	-0.1818
V7	-0.1429
V8	-0.4577
V9	-0.2489
V10	-0.2298

^* Simulated Data: Conditional Inference Trees

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

The boosted model ranks our V1 to V5 as the most important variables, where V4 and V1 are the top 2. So again, these switch in importance compared to our first tree model. As for the Cubist model it is similar results but V1 is the top predictor.

Boosted

set.seed(200)

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)

gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

rfImp5 <- varImp(gbmTune$finalModel, scale = FALSE)


kbl(rfImp5,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "",
           symbol = c("Simulated Data: Boosted Trees"))

Variable Importance
	Overall
V1	4687.5152
V2	3742.1961
V3	1469.6142
V4	4716.7730
V5	1966.6999
V6	186.1673
V7	170.1968
V8	141.7410
V9	169.9807
V10	100.8460

^* Simulated Data: Boosted Trees

Cubist

set.seed(200)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubistTuned$finalModel, scale = FALSE)

kbl(rfImp6,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "",
           symbol = c("Simulated Data: Cubist Trees"))

Variable Importance
	Overall
V1	72.0
V3	42.0
V2	54.5
V4	49.0
V5	40.0
V6	11.0
V7	0.0
V8	0.0
V9	0.0
V10	0.0

^* Simulated Data: Cubist Trees

8.2

Use a simulation to show tree bias with different granularities.

We can see that our most important variables are b and c. What is interesting is that our head node is d, indicating that because we have such a large distinction in our values, it is considered the most unique variable as we branch out towards other variables. This can create a huge bias as our model may be overfitting these unique values as to why it is the head node. However, b appears most often as we move down these branches which shows it’s influence in determining which path to follow on.

set.seed(200)

a <- sample(x = 1:10 / 10, size = 300, replace = TRUE)
b <- sample(x = 1:100 / 100, size = 300, replace = TRUE)
c <- sample(x = 1:1000 / 1000, size = 300, replace = TRUE)
d <- sample(x = 1:10000 / 10000, size = 300, replace = TRUE)

y <- a + b + c + d

simulate_data <- data.frame(a, b, c, d, y) 

rpart_tree <- rpart(y ~ ., data = simulate_data)

plot(as.party(rpart_tree))

rfImp7 <- varImp(rpart_tree)

kbl(rfImp7,
    caption = "Variable Importance",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "",
           "",
           symbol = c("Simulated Data: Tree Bias"))

Variable Importance
	Overall
a	2.3565
b	4.2698
c	3.4416
d	1.7506

^* Simulated Data: Tree Bias

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

We have a high bagging fraction on the right model causing us to have fewer predictors as we are avoiding overfitting these predictors values. Alternatively, the left model uses a lower bagging fraction which forces the model to include more predictors to be able to explain how our model function. This all then creates a higher learning rate on the right model with just a few predictors, but possibly mistakenly giving them too much weight.

Which model do you think would be more predictive of other samples?

The model on the left would be our preferred choice for other samples. It creates flexibility on how to interpret these various predictors without the loss of information in the right model, assuming they are not nearly important for the model

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing our interaction depth would allow inclusion of our smaller important predictors to be a deciding factor in the model. Overall the slope would not be as steep and progressively decrease slower than otherwise.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models.

Data Preparation

data(ChemicalManufacturingProcess)

chem <- ChemicalManufacturingProcess

kbl(chem[1:5, 1:10],
    caption = "Chemical Manufacturing Process") |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "Dimensions: ",
          TeX(paste0(nrow(chem), " x ", ncol(chem))))

Chemical Manufacturing Process
Yield	BiologicalMaterial01	BiologicalMaterial02	BiologicalMaterial03	BiologicalMaterial04	BiologicalMaterial05	BiologicalMaterial06	BiologicalMaterial07	BiologicalMaterial08	BiologicalMaterial09
38.00	6.25	49.58	56.97	12.74	19.51	43.73	100	16.66	11.44
42.44	8.01	60.97	67.48	14.65	19.36	53.14	100	19.04	12.55
42.03	8.01	60.97	67.48	14.65	19.36	53.14	100	19.04	12.55
41.42	8.01	60.97	67.48	14.65	19.36	53.14	100	19.04	12.55
42.49	7.47	63.33	72.25	14.02	17.91	54.66	100	18.22	12.80
Dimensions:
176 x 58

We can see which variables have missing values below. Afterwards, we can use a KNN imputation method to replace these NA values and confirm that we have zero missing values.

missing <- 
  chem %>% 
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  pivot_longer(., 
            cols = everything()) |>
  filter(value != 0) |>
  rename(variable = name, 
         total = value) |>
  mutate(pct = round(total/nrow(chem), 4)) |>
  arrange(desc(pct), variable) |>
  mutate(variable = fct_reorder(variable, pct))

kbl(missing,
  caption = "Missing Values") |>
  kable_classic(full_width = F, html_font = "Cambria") %>%
  kable_styling(latex_options = "HOLD_position")

Missing Values
variable	total	pct
ManufacturingProcess03	15	0.0852
ManufacturingProcess11	10	0.0568
ManufacturingProcess10	9	0.0511
ManufacturingProcess25	5	0.0284
ManufacturingProcess26	5	0.0284
ManufacturingProcess27	5	0.0284
ManufacturingProcess28	5	0.0284
ManufacturingProcess29	5	0.0284
ManufacturingProcess30	5	0.0284
ManufacturingProcess31	5	0.0284
ManufacturingProcess33	5	0.0284
ManufacturingProcess34	5	0.0284
ManufacturingProcess35	5	0.0284
ManufacturingProcess36	5	0.0284
ManufacturingProcess02	3	0.0170
ManufacturingProcess06	2	0.0114
ManufacturingProcess01	1	0.0057
ManufacturingProcess04	1	0.0057
ManufacturingProcess05	1	0.0057
ManufacturingProcess07	1	0.0057
ManufacturingProcess08	1	0.0057
ManufacturingProcess12	1	0.0057
ManufacturingProcess14	1	0.0057
ManufacturingProcess22	1	0.0057
ManufacturingProcess23	1	0.0057
ManufacturingProcess24	1	0.0057
ManufacturingProcess40	1	0.0057
ManufacturingProcess41	1	0.0057

missing |>
  ggplot(aes(x=variable, y=pct)) +
  geom_bar(stat='identity', fill="#4E79A7") +
  coord_flip() +
  theme_bw() +
  labs(x = "") +
  geom_text(aes(label = paste0(round(pct*100, 2), "%")), colour = "white", hjust = 1, size=3, fontface = "bold") +
  scale_y_continuous(labels = scales::percent)

#t1 <- tableGrob(missing)  # transform into a tableGrob

#grid.arrange(missing_plot, t1, nrow=1)

set.seed(42)

knn_impute <- preProcess(chem, "knnImpute", k=5)
impute_chem <- predict(knn_impute, chem)

impute_missing <-
  impute_chem %>% 
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  pivot_longer(., 
            cols = everything()) |>
  filter(value != 0) %>%
  nrow(.)

print(paste0("Missing Values: ", impute_missing >0))

## [1] "Missing Values: FALSE"

We will filter out predictors with low frequency using the nearZeroVar function. We then will be able to build our non-linear models.

nzv_predictors <- nearZeroVar(impute_chem |> select(-Yield),
                              names = TRUE, saveMetrics = FALSE)

filtered_chem <-
  as_tibble(impute_chem) |>
  select(-all_of(nzv_predictors)) |>
  clean_names()

kbl(filtered_chem[1:5, 1:10],
    caption = "Filtered Chemical Manufacturing Process") |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote(general_title = "Dimensions: ",
          TeX(paste0(nrow(filtered_chem), " x ", ncol(filtered_chem))))

Filtered Chemical Manufacturing Process
yield	biological_material01	biological_material02	biological_material03	biological_material04	biological_material05	biological_material06	biological_material08	biological_material09	biological_material10
-1.1792673	-0.2261036	-1.514098	-2.683036	0.2201765	0.4941942	-1.382888	-1.233131	-3.3962895	1.1005296
1.2263678	2.2391498	1.308996	-0.056235	1.2964386	0.4128555	1.129077	2.282619	-0.7227225	1.1005296
1.0042258	2.2391498	1.308996	-0.056235	1.2964386	0.4128555	1.129077	2.282619	-0.7227225	1.1005296
0.6737219	2.2391498	1.308996	-0.056235	1.2964386	0.4128555	1.129077	2.282619	-0.7227225	1.1005296
1.2534583	1.4827653	1.893939	1.135948	0.9414412	-0.3734185	1.534835	1.071310	-0.1205678	0.4162193
Dimensions:
176 x 57

set.seed(42)

train_index <- createDataPartition(filtered_chem$yield, p = 0.80, list = FALSE)

train_df <- filtered_chem[train_index, ]
train_x <- train_df |>
    select(-yield) %>%
  as.data.frame()
train_y <- train_df$yield
train_y <- as.numeric(train_y) 

test_df <- filtered_chem[-train_index, ]
test_x <- test_df |>
    select(-yield) %>%
  as.data.frame()
test_y <- test_df$yield
test_y <- as.numeric(test_y)

Which tree-based regression model gives the optimal resampling and test set performance?

The optimal performance model is the cubist model where our RMSE and \(R^2\) are the best values

Random Forest

set.seed(42)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

rf <- postResample(rfPred, test_y)

Bagged

set.seed(42)

baggedTree <- ipredbagg(train_y, train_x)
 
baggedPred <- predict(baggedTree, test_x)

bag <- postResample(baggedPred, test_y)

Boosted

set.seed(42)

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

boost <- postResample(gbmPred, test_y)

Cubist

set.seed(42)

cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

cube <- postResample(cubistPred, test_y)

append_tables <- function(train_table, model_name){
  
  train_table <-
    train_table |>
    select(RMSE, Rsquared) |>
    mutate(Model = model_name) |>
    relocate(Model)
  
  
    return(train_table)
}

rf_table <- append_tables(as_tibble(as.list(rf)), "Random Forest")
bag_table <- append_tables(as_tibble(as.list(bag)), "Bagged")
boost_table <- append_tables(as_tibble(as.list(boost)), "Boosted")
cube_table <- append_tables(as_tibble(as.list(cube)), "Cubist")

model_table <- 
  bind_rows(rf_table, bag_table, boost_table, cube_table)

kbl(model_table,
    caption = "Model Comparisons",
    digits=4) |>
  kable_classic(full_width = F, html_font = "Cambria")

Model Comparisons
Model	RMSE	Rsquared
Random Forest	0.6284	0.7528
Bagged	0.6529	0.7019
Boosted	0.5978	0.7445
Cubist	0.5215	0.7966

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Our manufacturing_process32 and manufacturing_process17 are the most important predictors in our cubist model. We again have these process variables dominate our list of predictors. Comparing these predictors to our linear and non-linear models, we definitely see how our process variables are the most important variables to use in the model. manufacturing_process32 was the leading variable for all three models. However, after that, the process variables rankings of importance are different.

cube_top_pred <- 
  varImp(cubistTuned)$importance |>
  arrange(desc(Overall))

kbl(head(cube_top_pred, 10),
    caption = "Top 10 Predictors") |>
  kable_classic(full_width = F, html_font = "Cambria") |>
  footnote("Cubist Model")

Top 10 Predictors
	Overall
manufacturing_process17	100.00000
manufacturing_process32	100.00000
manufacturing_process09	68.05556
manufacturing_process33	43.05556
manufacturing_process01	26.38889
manufacturing_process29	25.00000
manufacturing_process25	22.22222
biological_material03	20.83333
manufacturing_process21	18.05556
manufacturing_process45	16.66667
Note:
Cubist Model

cube_top_pred |>
  arrange(desc(Overall)) |>
  head(10) %>%
  ggplot(aes(y=reorder(row.names(.), Overall), x=Overall, label=round(Overall,2))) +
  geom_bar(stat='identity', fill="#F28E2B") +
  geom_text(position = position_stack(vjust = 0.5),
            size=3,
            fontface = "bold",
            colour="black") +
  theme_minimal() +
  labs(y="", title = "Cubist Predictors by Model Importance") +
  theme(panel.border = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        axis.line = element_line(colour = "black"))

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

We can see that our tree has manufacturing_process32 as the top head node. It is followed by manufacturing_process13 and manufacturing_process17 down the next level. For biological_material03, we can see it becomes an important factor in the third level down in how we determine our yield. This could be important as we now can put more emphasis on this biological material when producing our yield outside of any process.

chem_tree <- rpart(yield ~ ., data = train_df)

rpart.plot(chem_tree)

Trees and Rules

John Cruz

2024-04-12

Required Libraries

8.1

Boosted

Cubist

8.2

8.3

8.7

Data Preparation

Random Forest

Bagged

Boosted

Cubist