Recreate the simulated data from Exercise 7.2:
# From textbook
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(y ~., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
print(rfImp1)
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
These predictors do not seem to have impacted the model significantly. While the minimum value for informative predictors was 0.76 (V3), none of the uninformative predictors had a value greater than 0.165.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
The importance score for V1 decreased in model 2.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3)
## Overall
## V1 4.91687329
## V2 6.52816504
## V3 0.58711552
## V4 7.04870917
## V5 2.03115561
## V6 0.14213148
## V7 0.10991985
## V8 -0.08405687
## V9 -0.01075028
## V10 0.09230576
## duplicate1 3.80068234
## duplicate2 1.87721959
The inclusion of another predictor also highly correlated with V1 decreased the importance score in model 3.
library(party)
## Warning: package 'party' was built under R version 4.4.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.4.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 4.4.3
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.4.3
ctrl <- cforest_control(mtry = ncol(simulated) - 1)
cforest <- cforest(y ~ ., data = simulated, controls = ctrl)
cfImp <- varimp(cforest)
cfImp
## V1 V2 V3 V4 V5 V6
## 3.885759655 7.668577261 0.022784363 10.187144892 1.944941733 0.001131697
## V7 V8 V9 V10 duplicate1 duplicate2
## 0.046943026 -0.049653633 0.018987655 -0.013785512 5.639384800 0.147354677
Yes, these importances show the same pattern as the traditional random forest model. V1-5 have greater impact than the uninformative predictors V6-10. The highlight correlated variables have a higher importance score than the uninformative variables.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
btree1 <- train(x = simulated[,-11], y = simulated$y, method = "bstTree")
btree1_vi <- varImp(btree1, scale = FALSE)
print(btree1_vi)
## loess r-squared variable importance
##
## Overall
## V4 0.289801
## duplicate1 0.277400
## V1 0.277043
## V2 0.260337
## duplicate2 0.238930
## V5 0.134314
## V3 0.090934
## V9 0.023943
## V10 0.021639
## V8 0.015209
## V6 0.008499
## V7 0.005977
cubist1 <- train(x = simulated[,-11], y = simulated$y, method = "cubist",
tuneLength = 10)
cubist1_vi <- varImp(cubist1, scale = FALSE)
print(cubist1_vi)
## cubist variable importance
##
## Overall
## V2 69.5
## V1 54.0
## V4 50.0
## V5 38.0
## V3 32.0
## duplicate2 25.0
## duplicate1 25.0
## V6 10.0
## V9 0.0
## V8 0.0
## V7 0.0
## V10 0.0
The variable importance pattern in the Cubist model seems to be different.The 5 informative predictors are the most important like in the previous models but some of the uninformative variables and duplicates had no importance to the model.
Use a simulation to show tree bias with different granularities.
library(rpart)
library(partykit)
## Warning: package 'partykit' was built under R version 4.4.3
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.4.3
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
x1 <- sample(1:10 /10, 150, replace = TRUE)
x2 <- sample(1:100 /100, 150, replace = TRUE)
x3 <- sample(1:1000 /1000, 150, replace = TRUE)
x4 <- sample(1:10000 /10000, 150, replace = TRUE)
y1 <- x1 + x2 + x3 + rnorm(150, mean=0, sd=5)
simData <- data.frame(x1, x2, x3, x4, y1)
tree_model_cont <- rpart(y1 ~ ., data = simData, method = "anova")
rpart.plot(tree_model_cont, type = 2, extra = 100, fallen.leaves = TRUE)
The predictor with the highest number of distinct values is X2 and gets split first.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The model on the right has both parameters set to 0.9. This means that the majority of the training set is used to build each tree (high bagging fraction) and each tree is of high-importance (high learning rate). This results in the initial predictors having more influence on the model.
The model on the left uses only a small fraction of the training set to build each tree and each tree is of low importance which decreases how much impact the learners have on the model. This leads to a more balanced variable importance profile.
I think the model on the left would more predictive of other samples. Since the model on the right has a large learning rate, there is a higher risk of overfitting.
Increasing interaction depth would result in more tree splits. This would increase complexity and cause more variables to have greater impact on either model. The slope of the predictor importance would decrease and the variable importance profile would be less steep.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
# From textbook
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
data(ChemicalManufacturingProcess)
# Impute missing values via knn
library(impute)
chemicals <- impute.knn(as.matrix(ChemicalManufacturingProcess), rng.seed = 123)
chemicals <- as.data.frame(chemicals$data)
# Test/train split
training_chemicals <- createDataPartition(chemicals$Yield, p = 0.8, list = FALSE)
train_predictors_chemicals <- chemicals[training_chemicals,-1]
train_yield <- chemicals[training_chemicals, 1]
test_predictors_chemicals <- chemicals[-training_chemicals,-1]
test_yield <- chemicals[-training_chemicals, 1]
# Random forest model
chemicals_rf <- train(x = train_predictors_chemicals,
y = train_yield,
method = "rf",
preProc = c("center", "scale"),
tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
# Boosted tree
chemicals_boost <- train(x = train_predictors_chemicals,
y = train_yield,
method = "gbm",
trControl = trainControl(method = "cv", number = 5),
tuneLength = 10,
verbose = FALSE)
# cubist model
chem_cubist <- train(x = train_predictors_chemicals,
y = train_yield,
method = "cubist",
preProc = c("center", "scale"),
tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
rf_predict <- predict(chemicals_rf, test_predictors_chemicals)
yield_rf <- data.frame(obs = test_yield, pred = rf_predict)
print(defaultSummary(yield_rf))
## RMSE Rsquared MAE
## 1.2500682 0.5569329 0.9597679
boost_predict <- predict(chemicals_boost, test_predictors_chemicals)
yield_boosted <- data.frame(obs = test_yield, pred = boost_predict)
print(defaultSummary(yield_boosted))
## RMSE Rsquared MAE
## 1.1745569 0.5929193 0.9122479
cubist_predict <- predict(chem_cubist, test_predictors_chemicals)
yield_cubist <- data.frame(obs = test_yield, pred = cubist_predict)
print(defaultSummary(yield_cubist))
## RMSE Rsquared MAE
## 0.9977879 0.7017476 0.8410910
The random cubist model performed the best with the highest rsquared score and the lowest RMSE score.
varImp(chem_cubist, scale = FALSE)
## cubist variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 53.5
## ManufacturingProcess17 33.5
## ManufacturingProcess09 20.5
## ManufacturingProcess33 18.5
## BiologicalMaterial02 18.0
## BiologicalMaterial06 18.0
## ManufacturingProcess04 13.5
## BiologicalMaterial05 12.0
## ManufacturingProcess28 11.5
## ManufacturingProcess13 11.0
## ManufacturingProcess45 10.5
## BiologicalMaterial08 10.5
## BiologicalMaterial11 10.0
## ManufacturingProcess15 7.5
## ManufacturingProcess11 6.0
## ManufacturingProcess37 6.0
## ManufacturingProcess27 6.0
## BiologicalMaterial03 6.0
## BiologicalMaterial12 6.0
## BiologicalMaterial01 5.0
Manufacturing processes dominate the top 10 list (7/10). Like in the previous assignments, it should be noted that there are more manufacturing processes overall compared to biological material so it is not necessarily dominating by percentage.
The top manufacturing processes were 32,17,09, 33, 04, 28, and 13. The top biological material were 02,06, and 05.
In the linear model used in hw 7, the top manufacturing processes were 32,17,13, 09, 36, 06, and 33. The top biological material were 06,02, and 03.
In the SVM model used in hw 8, the top manufacturing processes were 32,36,13,17,09,31 and biological material 06,03,02,12.
The variables that appear in all 3 were manufacturing processs 32, 17, 09, 13 and biological material 02 and 06. In all tree models, MP32 was the top variable.
single_tree <- train(x = train_predictors_chemicals,
y = train_yield,
method = "rpart",
preProc = c("center", "scale"),
tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
rpart.plot(single_tree$finalModel)
This tree branches off from manufacturing process 32 which makes sense since it was the highest ranked variable. It is interesting that biological material 11 was included as a branch since this was not a top 10 variable in any of the models tested.