library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rpart)
Recreate the simulated data from Exercise 7.2:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
set.seed(201)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 %>% arrange(desc(Overall))
## Overall
## V1 8.765986597
## V4 7.896367545
## V2 6.529195380
## V5 2.229576452
## V3 0.786661957
## V6 0.032819437
## V10 -0.007697345
## V7 -0.017614112
## V9 -0.056301236
## V8 -0.099819262
The importance scores for variables V6-V10 were very low indicating that these variables contributed little or in cases of a negative value (V8 and V9) potentially decreased the performance of the model.
library(party)
## Warning: package 'party' was built under R version 4.3.3
## Warning: package 'strucchange' was built under R version 4.3.3
## Warning: package 'sandwich' was built under R version 4.3.3
set.seed(205)
cforest_model <- cforest(formula = y ~ V1+V2+V3+V4+V5+V6+V7+V8+V9+V10, data = simulated,
# Adjust parameters ntree and mtry to values used in the random forest model
controls = cforest_control(ntree = 1000, mtry = 3))
cforest_model
##
## Random Forest using Conditional Inference Trees
##
## Number of trees: 1000
##
## Response: y
## Inputs: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10
## Number of observations: 200
varImp(cforest_model) %>% arrange(desc(Overall))
## Overall
## V1 6.764646178
## V4 6.393648430
## V2 5.334192335
## V5 1.836656095
## V3 0.123622518
## V7 0.033265543
## V6 0.005995986
## V9 -0.008123966
## V10 -0.033533213
## V8 -0.035588481
rfImp1 %>% arrange(desc(Overall))
## Overall
## V1 8.765986597
## V4 7.896367545
## V2 6.529195380
## V5 2.229576452
## V3 0.786661957
## V6 0.032819437
## V10 -0.007697345
## V7 -0.017614112
## V9 -0.056301236
## V8 -0.099819262
Do these importance’s show the same pattern as the traditional random forest model?
The same patterns occur using a condition inference random forest model. The top 5 important features remained in the same order but lower scores. The order of the bottom 5 importance features changed slightly but still exhibited low or negative scores.
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# gbm model
# gbmModel <- gbm(y ~ V1+V2+V3+V4+V5+V6+V7+V8+V9+V10, data = simulated, distribution = "gaussian",
#n.trees = 1000,
#shrinkage = 0.01)
set.seed(206)
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode=10)
gbmTune <- train(simulated %>% select(-c(y, duplicate1, duplicate2)),
simulated$y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
## The gbm() function produces copious amounts
## of output, so pass in the verbose option
## to avoid printing a lot to the screen.
# Extract the best model
gbmTune$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 101 350 3 0.1 10
# View important feature scores
varImp(gbmTune)
## gbm variable importance
##
## Overall
## V4 100.000
## V1 87.864
## V2 76.054
## V5 39.680
## V3 26.446
## V7 3.181
## V6 2.493
## V8 1.452
## V9 1.123
## V10 0.000
The same pattern occurs when fitting a boosted tree model, the top and bottom five most important remain the same, but slightly different order.
set.seed(207)
cubistGrid <- expand.grid(committees = c(1, 10, 50, 100),
neighbors = c(0, 1, 5, 9))
cubistTuned <- train(simulated %>% select(-c(y, duplicate1, duplicate2)),
simulated$y,
method = "cubist",
tuneGrid = cubistGrid,
verbose = FALSE)
ggplot(cubistTuned)
# Extract the best model
cubistTuned$bestTune
## committees neighbors
## 16 100 9
# View important feature scores
varImp(cubistTuned)
## cubist variable importance
##
## Overall
## V1 100.00
## V2 81.82
## V4 67.13
## V3 65.73
## V5 46.15
## V6 18.18
## V9 0.00
## V7 0.00
## V10 0.00
## V8 0.00
The same pattern occurs when fitting a boosted tree model or a rules based Cubist model. The top and bottom five most important remain the same, but slightly different order and magnitude.
Use a simulation to show tree bias with different granularities.
To show tree bias, six variables are created with varying levels of granularity. A predictor variable will also be created by taking adding two of the variables plus a random number between 10 - 15 in 0.001 increments.
set.seed(100)
X1 <- sample(seq(5, 10, 1), 200, replace = TRUE) # 5 distinct values
X2 <- sample(seq(400, 450, 5), 200, replace = TRUE) # 11 distinct values
X3 <- sample(seq(15, 16, 0.01), 200, replace = TRUE) # 101 distinct values
X4 <- sample(seq(1,3, 0.01), 200, replace = TRUE) # 201 distinct values
X5 <- sample(seq(0, 0.5, 0.001), 200, replace = TRUE) # 501 distinct values
X6 <- sample(seq(0.5, 2, 0.001), 200, replace = TRUE) # > 1,501 distinct values
Y <- X6 + X3 + sample(seq(10, 15, 0.001), 200, replace = TRUE) # Y is a sample with variables X6 and X3 added to it
# Create data frame
simulated2 <- data.frame(X1,X2,X3,
X4,X5,X6,Y)
model_sim <- rpart(Y ~., data=simulated2)
# Extract variable importance using varImp
varImp(model_sim) %>% arrange(desc(Overall))
## Overall
## X5 1.0316934
## X6 0.9930010
## X3 0.6764543
## X4 0.5835573
## X2 0.5796439
## X1 0.5675177
There appears to be some selection bias with X5, X6 the 2 variable with the most distinct values and least granular being selected as the top 2 most important variables.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data.
The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The bagging fraction affects how much of the training data each tree sees, while the learning rate controls how much weight each tree’s predictions get in the final ensemble. The model on the right focuses its importance on the first few features as the learning rate is much higher, giving more weight to the models where those top features are used, thus determining them more important.
Based on the learning rate and bagging fraction values the model on the left would be more successful at predicting other samples. The lower values are less likely to contribute to a model that is over fit. Lower values are generally preferred as they make the model robust to the specific characteristics of tree and thus allowing it to generalize well.
Increasing the interaction depth or tree depth would potentially capture more complex relationships between predictors. As more predictors are available to be used at a split it would give incresaed exposure and weight to a larger number of predictors thus increasing their importance value. As the importance value increases for seemingly less important ones the slope of predictor importance would decrease.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
# Load libraries
library(tidymodels)
library(AppliedPredictiveModeling)
# Load data
data("ChemicalManufacturingProcess")
chemical_df <- ChemicalManufacturingProcess
#head(chemical_df)
# Find number of missing values in each predictor
chemical_df %>% summarise_all(~ sum(is.na(.))) %>% pivot_longer(cols = colnames(chemical_df), names_to = "Variable", values_to = "Counts") %>%
filter(Counts > 0)
## # A tibble: 28 Ă— 2
## Variable Counts
## <chr> <int>
## 1 ManufacturingProcess01 1
## 2 ManufacturingProcess02 3
## 3 ManufacturingProcess03 15
## 4 ManufacturingProcess04 1
## 5 ManufacturingProcess05 1
## 6 ManufacturingProcess06 2
## 7 ManufacturingProcess07 1
## 8 ManufacturingProcess08 1
## 9 ManufacturingProcess10 9
## 10 ManufacturingProcess11 10
## # ℹ 18 more rows
# Create a function that imputes the median of the predictor for the missing values
for(i in 1:ncol(chemical_df)) {
chemical_df[ , i][is.na(chemical_df[ , i])] <- median(chemical_df[ , i], na.rm=TRUE)
}
# Get zero variance predictors
nzv_labels <- nearZeroVar(chemical_df, names = TRUE)
print(paste0("The predictor/s with near zero variance are: ", nzv_labels))
## [1] "The predictor/s with near zero variance are: BiologicalMaterial07"
# Filter out the zero variance predictors
filtered_chemical_df <- as_tibble(chemical_df) %>% select(-all_of(nzv_labels))
set.seed(300)
# Split data into train and test sets
chemical_split <- initial_split(filtered_chemical_df,
prop = 0.7)
# Create the training data
chemical_train <- chemical_split %>%
training()
chemXtrain <- data.frame(chemical_train %>% select(-Yield))
chemYtrain <- chemical_train %>% select(Yield)
# Create the test data
chemical_test <- chemical_split %>%
testing()
chemXtest <- data.frame(chemical_test %>% select(-Yield))
chemYtest <- chemical_test %>% select(Yield)
ctrl <- trainControl(method = "cv", number = 10)
# getModelInfo("rpart2")
set.seed(301)
rpartTune <- train(chemXtrain, chemYtrain$Yield,
method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"),
metric = "Rsquared")
## note: only 9 possible values of the max tree depth from the initial fit.
## Truncating the grid to 9 .
rpartTune
## CART
##
## 123 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111, 111, 111, 110, 110, 111, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 1.503475 0.3688235 1.162670
## 2 1.500293 0.3871477 1.169326
## 3 1.520698 0.4047003 1.200330
## 4 1.463409 0.4703839 1.131641
## 6 1.436952 0.5185660 1.109244
## 7 1.418451 0.5362022 1.084360
## 8 1.442204 0.5313016 1.094717
## 9 1.453427 0.5238343 1.101385
## 10 1.453427 0.5238343 1.101385
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 7.
rpartPred <- predict(rpartTune, newdata = chemXtest)
## The function 'postResample' can be used to get performance values
rpart_metrics <- postResample(pred = rpartPred, obs = chemYtest$Yield)
rpart_metrics
## RMSE Rsquared MAE
## 1.4292893 0.4249483 1.1197365
# getModelInfo("rf")
set.seed(302)
rfTune <- train(chemXtrain, chemYtrain$Yield,
method = "rf",
tuneLength = 10,
trControl = trainControl(method = "cv"),
metric = "Rsquared")
rfTune
## Random Forest
##
## 123 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111, 111, 110, 111, 111, 110, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.277038 0.6261434 1.0265381
## 8 1.170290 0.6577711 0.9361999
## 14 1.141997 0.6662932 0.9092988
## 20 1.141338 0.6555855 0.9121757
## 26 1.144411 0.6498069 0.9105170
## 32 1.135027 0.6503234 0.8976756
## 38 1.148485 0.6387736 0.9087014
## 44 1.146302 0.6366574 0.9047472
## 50 1.146526 0.6364851 0.9055020
## 56 1.147323 0.6337794 0.9042186
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.
rfPred <- predict(rfTune, newdata = chemXtest)
## The function 'postResample' can be used to get performance values
rf_metrics <- postResample(pred = rfPred, obs = chemYtest$Yield)
rf_metrics
## RMSE Rsquared MAE
## 1.0158680 0.6932786 0.8100351
set.seed(303)
bagTune <- train(chemXtrain, chemYtrain$Yield,
method = "treebag",
tuneLength = 10,
trControl = trainControl(method = "cv"),
metric = "Rsquared")
bagTune
## Bagged CART
##
## 123 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111, 111, 110, 110, 111, 111, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.258478 0.573961 1.006399
bagPred <- predict(bagTune, newdata = chemXtest)
## The function 'postResample' can be used to get performance values
bag_metrics <- postResample(pred = bagPred, obs = chemYtest$Yield)
bag_metrics
## RMSE Rsquared MAE
## 1.0043339 0.6843670 0.8267982
# getModelInfo("gbm")
set.seed(304)
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode=10)
gbmTune2 <- train(chemXtrain,
chemYtrain$Yield,
method = "gbm",
tuneGrid = gbmGrid,
trControl = trainControl(method = "cv"),
metric = "Rsquared",
verbose = FALSE)
gbmTune2$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 55 900 5 0.01 10
gbmPred <- predict(gbmTune2, newdata = chemXtest)
## The function 'postResample' can be used to get performance values
gbm_metrics <- postResample(pred = gbmPred, obs = chemYtest$Yield)
gbm_metrics
## RMSE Rsquared MAE
## 1.0146358 0.6816695 0.7605147
# getModelInfo("cubist")
set.seed(305)
cubistGrid <- expand.grid(committees = c(1, 10, 50, 100),
neighbors = c(0, 1, 5, 9))
cubistTuned <- train(chemXtrain,
chemYtrain$Yield,
method = "cubist",
trControl = trainControl(method = "cv"),
tuneGrid = cubistGrid,
metric = "Rsquared",
verbose = FALSE)
cubistTuned
## Cubist
##
## 123 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111, 110, 111, 109, 111, 111, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.422159 0.5071353 1.0684736
## 1 1 1.341539 0.5739216 0.9294287
## 1 5 1.344637 0.5745633 0.9928622
## 1 9 1.374678 0.5365619 1.0177208
## 10 0 1.279789 0.5862690 0.9885075
## 10 1 1.196680 0.6599332 0.8524738
## 10 5 1.198060 0.6516036 0.8843722
## 10 9 1.239305 0.6103248 0.9412864
## 50 0 1.132927 0.6330376 0.9225310
## 50 1 1.037712 0.7008581 0.7616969
## 50 5 1.043635 0.7051735 0.8120854
## 50 9 1.083676 0.6642215 0.8621034
## 100 0 1.141282 0.6345595 0.9294671
## 100 1 1.048064 0.7033339 0.7813476
## 100 5 1.058763 0.6976084 0.8252739
## 100 9 1.102042 0.6567972 0.8790249
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were committees = 50 and neighbors = 5.
cubistPred <- predict(cubistTuned, newdata = chemXtest)
## The function 'postResample' can be used to get performance values
cubist_metrics <- postResample(pred = cubistPred, obs = chemYtest$Yield)
cubist_metrics
## RMSE Rsquared MAE
## 0.9198151 0.7376965 0.6614308
a. Which tree-based regression model gives the optimal re-sampling and test set performance?
models <- c("Single Tree", "Random Forest", "Bagged Tree", "Boosted Tree", "Cubist")
values <- c(rpart_metrics["Rsquared"], rf_metrics["Rsquared"], bag_metrics["Rsquared"], gbm_metrics["Rsquared"], cubist_metrics["Rsquared"])
tibble(Models = models, R2_Test = values) %>% arrange(desc(R2_Test))
## # A tibble: 5 Ă— 2
## Models R2_Test
## <chr> <dbl>
## 1 Cubist 0.738
## 2 Random Forest 0.693
## 3 Bagged Tree 0.684
## 4 Boosted Tree 0.682
## 5 Single Tree 0.425
Five different methods were used to train models using a training set from the chemical manufacturing process data. The models included single regression tree, random forest, bagged tree, stochastic gradient boosting, and cubist methods. The model that provided the best re-sampling and test set perfromance was the one using the cubist method. The final values used for the model were committees = 50 and neighbors = 5 with an R-squared of 0.703. On the test set, the R-sqaured was 0.736. The cubist model explained 73.6% of the variability in the target variable Yield.
b1. Which predictors are most important in the optimal tree-based regression model?
varImp(cubistTuned)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess17 68.18
## ManufacturingProcess13 60.23
## BiologicalMaterial03 51.14
## BiologicalMaterial06 42.05
## ManufacturingProcess04 40.91
## BiologicalMaterial12 40.91
## BiologicalMaterial02 39.77
## ManufacturingProcess11 30.68
## ManufacturingProcess33 23.86
## ManufacturingProcess29 22.73
## ManufacturingProcess09 20.45
## ManufacturingProcess39 18.18
## ManufacturingProcess37 18.18
## ManufacturingProcess19 17.05
## ManufacturingProcess02 15.91
## ManufacturingProcess27 13.64
## ManufacturingProcess20 13.64
## ManufacturingProcess16 13.64
## ManufacturingProcess26 13.64
b2. Do either the biological or process variables dominate the list?
Out of the 20 most important features 16 of them were manufacturing processes and 4 were biological material markers.
b3. How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
The optimal linear model which was built using elastic net and the non linear model used a support vector machine. Both had similar patterns of the most important predictor with the majority being manufacturing processes. The predictors in red are ones that all three modesls share in their top 10 most important variables.
Elastic net model
ManufacturingProcess09
ManufacturingProcess32
ManufacturingProcess06
BiologicalMaterial02
BiologicalMaterial03
ManufacturingProcess31
ManufacturingProcess17
ManufacturingProcess13
ManufacturingProcess36
SVM
ManufacturingProcess32
ManufacturingProcess13 ManufacturingProcess17
BiologicalMaterial06
ManufacturingProcess09
BiologicalMaterial03
BiologicalMaterial12
ManufacturingProcess06
BiologicalMaterial02
ManufacturingProcess36
c. Plot the optimal single tree with the distribution of yield in the terminal nodes.
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
rpart.plot(rpartTune$finalModel, fallen.leaves = FALSE, main = "Decision Tree for Chemical Data Set")
c2. Does this view of the data provide additional knowledge
about the biological or process predictors and their relationship with
yield?
Viewing the decision tree does give additional information about the threshold that the level of biological material and the manufacturing process positively or negatively effect the Yield. For example the two paths that produce the highest yeild are when ManufacturingProcess32 is over 160. At the next split that evaluates ManufacturingProcess33, if this value is NOT greater or equal to 33 then according to the model the Yield will be 43, the highest of any leaf node. If a model performs well and we are confident in it’s strength, these cut off measures might be helpful in determining when to tweak certain variables when they rise above or below the threshold that optimizes Yield.