variable importance scores:
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
set.seed(123)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1## Overall
## V1 8.630761154
## V2 6.378205752
## V3 0.737853335
## V4 7.545618226
## V5 2.169481327
## V6 0.097829221
## V7 0.100908417
## V8 -0.194478838
## V9 -0.093326598
## V10 -0.006066088
Did the random forest model significantly use the uninformative predictors (V6– V10)?
The random forest model did not significantly use the uninformatice predictors.
## Warning: package 'party' was built under R version 4.3.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.3.3
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.3.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.3.3
model4 <- cforest(y ~ ., data = simulated[, c(1:11)])
rfImp4 <- varimp(model4, conditional = TRUE)
rfImp4## V1 V2 V3 V4 V5 V6
## 5.378093181 5.364475885 0.027914682 6.628766556 1.304961634 -0.010635023
## V7 V8 V9 V10
## 0.011137367 -0.025004721 -0.009870118 -0.023312382
The variable importance for this model shows similar results to the previous traditional random forest models. where V1, V2 and V4 are the most important variables.
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(123)
gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
rfImp5 <- varImp(gbmTune, scale = FALSE)
rfImp5## gbm variable importance
##
## Overall
## V4 45297.7
## V1 39668.4
## V2 34199.2
## V5 18115.5
## V3 11953.5
## V7 1607.1
## V6 1461.4
## V9 1007.2
## V8 898.8
## V10 670.3
## Warning: package 'Cubist' was built under R version 4.3.3
set.seed(123)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubistTuned, scale = FALSE)
rfImp6## cubist variable importance
##
## Overall
## V1 72.0
## V2 54.5
## V4 49.0
## V3 42.0
## V5 40.0
## V6 11.0
## V8 0.0
## V10 0.0
## V7 0.0
## V9 0.0
These both produce similar results, where V1, V2, and V4 are the most important.
set.seed(123)
v1 <- runif(500,1,250)
v2 <- runif(500,1,10)
v3 <- runif(500,1,500)
y <- v1+v3
df <- data.frame(v1,v2,v3,y)
head(df)## v1 v2 v3 y
## 1 72.60680 4.182455 137.53774 210.1445
## 2 197.28798 4.297973 297.33960 494.6276
## 3 102.83525 3.583901 80.93222 183.7675
## 4 220.87133 1.719756 426.86169 647.7330
## 5 235.17635 4.289088 424.02184 659.1982
## 6 12.34357 2.602124 239.46552 251.8091
## Overall
## v1 45.694763
## v2 -2.104113
## v3 83.356770
The left plot has both tuning parameters set to 0.1. A lower bagging
fraction means each tree is trained on only 10% of the training data,
introducing more randomness.This results in a more distributed
importance pattern among distributors. The right plot has both tuning
parameters set to 0.9. A higher bagging fraction means each tree is
trained on 90% of the traning data. There is less randomness in this
model, which results in a model that is focused on a small number of
variables. The larger learning rate contributes to this as well.
I believe the model on the right would be more predictive of other samples. The larger learning rate allows the model to focus on the most predictive variables. This would be beneficial for a more accurate prediction of other samples.
If the interaction depth would be increased, then I believe
process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
## Warning: package 'ipred' was built under R version 4.3.3
## [1] 106
missing <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(missing, ChemicalManufacturingProcess)
sum(is.na(Chemical))## [1] 0
set.seed(123)
rfModel <- randomForest(train_x, train_y,
importance = TRUE,
ntree = 1000)
rfPred <- predict(rfModel, test_x)
postResample(rfPred, test_y)## RMSE Rsquared MAE
## 1.2500628 0.5652829 0.9556864
set.seed(123)
baggedTree <- ipredbagg(train_y, train_x)
baggedPred <- predict(baggedTree, test_x)
postResample(baggedPred, test_y)## RMSE Rsquared MAE
## 1.3331638 0.4719081 0.9674425
set.seed(123)
cubistTuned <- train(train_x, train_y,
method = "cubist")
cubistPred <- predict(cubistTuned, test_x)
postResample(cubistPred, test_y)## RMSE Rsquared MAE
## 1.0585622 0.6721253 0.9127560
rbind(bagged = postResample(baggedPred, test_y),
randomForest = postResample(rfPred, test_y),
cubist = postResample(cubistPred, test_y))## RMSE Rsquared MAE
## bagged 1.333164 0.4719081 0.9674425
## randomForest 1.250063 0.5652829 0.9556864
## cubist 1.058562 0.6721253 0.9127560
The lowest RMSE is provided by the cubist model. Which means it gives the optimal resampling and testset performance
model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
## cubist variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess17 100.00
## ManufacturingProcess32 95.83
## BiologicalMaterial06 44.79
## ManufacturingProcess13 41.67
## BiologicalMaterial12 36.46
## BiologicalMaterial02 30.21
## ManufacturingProcess09 26.04
## ManufacturingProcess04 21.88
## ManufacturingProcess45 20.83
## ManufacturingProcess37 20.83
## ManufacturingProcess33 17.71
## BiologicalMaterial08 16.67
## ManufacturingProcess39 14.58
## BiologicalMaterial05 13.54
## ManufacturingProcess27 12.50
## ManufacturingProcess11 11.46
## BiologicalMaterial03 11.46
## BiologicalMaterial11 10.42
## ManufacturingProcess15 10.42
## BiologicalMaterial04 10.42
ManufacturingProcess 32 is the top contributor. Out of the top 10 variables, only 3 are Biological, the majority are manufacturing variables. They are similar to the results from the optimal lineal/non-linear models.
nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(rpart)
rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])
rpart.plot::rpart.plot(rpartTree)
This plot does a great job of helping to visualize the importance of the
variables and the relationships between them.