Exercises from Kuhn and Johnson Applied Predictive Modeling, Chapter 8 Regression Trees and Rule-Based Models. All R code is displayed at the end for clarity.
Recreate the simulated data from Exercise 7.2.
set.seed(200)
simulated = mlbench.friedman1( 200, sd = 1 )
simulated = cbind( simulated$x, simulated$y)
simulated = as.data.frame( simulated)
colnames(simulated)[ncol(simulated)] <- "y"Fit a random forest model to all of the predictors, then estimate the variable importance scores:
We apply the code suggested by the textbook.
model1 <- randomForest( y ~ . , data = simulated, importance = TRUE , ntree = 1000 )
rfImp1 = varImp( model1, scale = FALSE ) The variable importance plot below shows that variables V6-V10 have minimal importance in predicting y.
Now add an additional predictor that is highly correlated with one of the informative predictors. The correlation of the duplicate1 variable is 94.6%.
set.seed(2010)
simulated$duplicate1 = simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1 )## [1] 0.9457663
The plot below shows that the duplicate1 variable is regarded as significant by the variable importance score.
When we add a second duplicate2 variable, V1 and its two correlates are all regarded as significant by the variable importance score as shown below.
## [1] 0.9429767
Plotting all three variable importance plots together, we conclude that:
each correlate of V1 is treated as significant but less so than the original variable.
the sum of the variable importance of all three in Model 3 appears to add up to the variable importance of V1 is Model 1. Thus, variable importance is not created from thin air when new variables are added.
If we run the conditional random forest model using the same number of trees, the result is similar.
set.seed( 10203)
cond_forest1 = cforest( y ~ ., data = simulated[, 1:11], control = cforest_unbiased(ntree = 1000) )
cond_forest2 = cforest( y ~ ., data = simulated[, 1:12], control = cforest_unbiased(ntree = 1000) )
cond_forest3 = cforest( y ~ ., data = simulated[, 1:13], control = cforest_unbiased(ntree = 1000) )For conditional inference trees as base learners in a random forest, the variable importance plots show the same pattern and order: V1 and its correlates both spread out their variable importance. The relative importance of the informative predictors remains the same.
set.seed( 10203)
cond_impVar1 = varimp( cond_forest1, conditional = TRUE )
cond_impVar2 = varimp( cond_forest2, conditional = TRUE )
cond_impVar3 = varimp( cond_forest3, conditional = TRUE )
cf_combined = tibble( varName = c(labels(cond_impVar1), labels(cond_impVar2), labels( cond_impVar3)) ,
score = c( cond_impVar1, cond_impVar2, cond_impVar3) ,
Model = c( rep("Model 1", length(cond_impVar1)),
rep("Model 2", length(cond_impVar2)),
rep("Model 3", length(cond_impVar3))
)
)
cdf3 = tibble( varName = labels(cond_impVar3), score = cond_impVar3 )
# Sort the variable name for the graph to show in score size order based on the third and biggest model.
cf_combined$varName = factor( cf_combined$varName, levels = cdf3$varName[ order( cdf3$score, decreasing = TRUE ) ] )The variable importance plot using the conditional permutation importance method of Strobl shows some change in rank of informative predictors:
V5 is more important than duplicate2 under the conditional permutation importance measure.V1 relative to the other informative predictors V4 and V2.We consider the same dataset and variable importance problem using gradient boosted and Cubist trees. The gbm library and variable importance measures are run with 2000 trees and default learning rates.
In the plot below, we consider the same 2 correlates of V1 and show the variable importance plots of the 3 models sorted by the largest model’s variable importance scores.
We observe similar results for gradient boosted tree variable importance as for random forests.
V1) are less importance. For GBM, duplicates are much less useful than V1.V1 importance declines as additional correlates are introduced.Lastly, we use the Cubist tree approach to fitting models to the same 3 versions of the dataset. Here we configure the models to use 15 committees.
The resulting variable importance plot for Cubist models show that:
V1-V5 are important. However, in the first model, the uninformative predictor V6 also shows some importance.V8, V6 show some importance in model 3.duplicate2 predictor shows no importance in model 3.V1 retains its importance even with the addition of correlates.Use a simulation to show tree bias with different granularities.
We consider a random variable \(y=sin(x) + \epsilon\) over a short range of
\[x \in { 0, \pi/6, \pi/3, \pi/2 }\]
where \(\epsilon\) is a uniform white noise distribution.
By altering the variance of the noise \(\epsilon\), thereby creating low and high volatility versions of \(y\) as \(y_{low}\) and \(y_{hi}\), we can study how noise impacts on the variable importance of uninformative predictors.
We define an uninformative predictor \(w\) to be an uncorrelated normal variable with parameters \(N( \mu(y), 5 )\). We then try to predict \(y_{low}\) in terms of \(x, w\) using a CART tree model, repeat the process with \(y_{hi}\) in place of \(y_{low}\) and compare the resulting variable importances.
Next, we build a tree model. The \(y_{hi}\) dataframe shows that the uninformative predictor is more important.
rpart = rpart(y_hi ~ . , data = df)
varImp(rpart)## Overall
## w 0.7388358
## x 0.3167823
In contrast,the \(y_{low}\) dataframe shows that informative predictor \(x\) is more important although \(w\) still has a small impact.
rpart2 = rpart(y_low ~ . , data = df2)
varImp(rpart2)## Overall
## w 0.05436432
## x 0.84365298
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient.
Although the optimal values of these parameters should be obtained through the tuning process - it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data.
The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9.
The model on the right focuses on the first few predictors because it is a greedier algorithm. It seeks to learn using nearly all of the data. Thus, if multiple correlated predictors explain the response, the strongest predictor will be preferred in the tree model (if the depth is shallow). The incremental predictive power of the other predictors will be omitted. Increased bagging fraction causes infrequent data to be ignored for model building.
The model performance should be better for a small bagging fraction and lower learning rate. Thus, the model on the left side (10% learning rate) should perform better provided that sufficient number of trees are allowed to run, say, ntrees=1000.
To answer this question, we fit a gbm model to the solubility data with 2000 trees and tree depth of 1. Then we vary the bagging fraction and learning rate from 0.1 to 0.9. Afterwards, we compare their variable importance plots to those obtained when changing the tree depth to 7.
The baseline models are plotted below. We are able to replicate the pattern in Figure 8.24, but the exact order of predictor importance changes. For example, the textbook produces a top 3 rank of MolWeight, Hydrophilic Factor, NumAromaticBonds whereas we produce a top 3 rank of MolWeight, NumCarbon, SurfaceArea1. These differences don’t appear to be very significant. There are likely differences in training parameters or random seed that could explain these variations.
## Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
## Please use `tibble::rownames_to_column()` instead.
## Selecting by Overall
## Selecting by Overall
The plots below compares the impact of increased tree depth from 1 to 7 for different learning and bagging settings. The left plot shows bagging fraction and learning rate are small: 10%, The right plot shows bagging fraction and learning rate are large: 90%.
The left plot shows that increasing tree depth allows more variables to be used per tree. The variable importance increases for informative predictors. We conclude that a deeper tree flattens the variable importance curve.
The right plot shows that increasing tree depth reorders the importance of variables. But we cannot conclude that it flattens the variable importance curve based on this plot.
Instead, for the learning rate and bagging fraction of 90%, we need to plot variable importance for depth 1 and depth 7 separately sorting each plot by its own variable importance scores. We use the same y-scale in both plots as well.
The top 1 and 2 variables in the right plot (depth 7) have lower scores. Their importance is re-assigned to the variables of rank 4,5,6 in the right plot and to the binary predictors.
We conclude that increasing depth from 1 to 7 does flatten the curve both when bagging fraction and learning rate are high (90%) and when those rates are low (10%).
Let’s tune and training a CART tree model, a random forest using CART trees, a stochastic gradient boosted tree and a Cubist tree. We’ll repeat the data imputation and splitting logic from exercise 6.3 and 7.5 below first.
For ease of replication, we reproduce the entire code block and the preceding text in the subsection below entitled Preprocessing the data from Assignment 8.
To re-use the same data splitting, data imputation and pre-processing steps as before in Exercise 6.3, we replicate and run the code below and briefly recall the steps taken to obtain the prepared datasets. However, we omit the detailed analysis and diagnostic plots used to justify those steps.
Rename the lengthy variable names:
predictors beginning with BiologicalMaterial are shortened to B
predictors beginning with ManufacturingProcess are shortened to M
Split the data into a 75-25 train/test split based on stratified random sampling using the response Yield to define 5 quantiles.
Drop near zero variance predictors: B07 is dropped from both test and train data sets.
Remove outlier zero values for predictors where all other values are positive. Replace the zeros by the median of the population.
Center and scale the predictors.
Use knn imputation to fill in missing values in the data set.
data("ChemicalManufacturingProcess")
chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]
chemX %>%
rename_with(~ str_replace(., "ManufacturingProcess", "M"), starts_with("Manufacturing") ) %>%
rename_with(~ str_replace(., "BiologicalMaterial", "B"), starts_with("Biological") ) -> chemX3set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )chemX_train_bv %>% mutate(
across(c(M18, M20, M16, M26, M25, M27, M31,
M29, M43, M30, M01, M42, M44, M45,
M39, M38 ) , # Run a function across the columns listed in across()
~if_else( condition = ( .x == 0 ) , # anonymous function applied to each column in across() .x refers to current column argument
true = median( . ) , # we use median without removal of NA
false = as.numeric(.) ) ) # keep the existing value of it is nonzero
) -> chemX_train_nz# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))
# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc, chemX_train_nz )
# Becomes the final version of test data for prediction
chemX_test_pp = predict( preProcFunc, chemX_test_nz )Using the rpart library and its training methods, we consider the best model using RMSE as the metric on the best selectionFunction.
set.seed(10393)
library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 15, by = 1 ) )
#cartGrid = data.frame( maxdepth = seq(1, 15, by = 1), .model = TRUE )
#cartControl <- trainControl(method = "cv", number = 10, selectionFunction = "best")
cartControl <- trainControl(method = "boot", number = 30, selectionFunction = "best")
(rpartTune = train( x= chemX_train_pp,
y = chemY_train,
method = "rpart2" ,
metric = "RMSE" ,
tuneGrid = cartGrid ,
trControl = cartControl ) )## CART
##
## 132 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (30 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 1.531248 0.2883396 1.220869
## 2 1.540336 0.3066603 1.219017
## 3 1.545803 0.3158142 1.222760
## 4 1.531157 0.3327614 1.204190
## 5 1.521892 0.3447464 1.192122
## 6 1.517168 0.3517898 1.190026
## 7 1.516941 0.3572683 1.188099
## 8 1.525483 0.3569762 1.196125
## 9 1.540100 0.3493954 1.207663
## 10 1.547503 0.3460554 1.213469
## 11 1.546986 0.3463922 1.213032
## 12 1.547563 0.3461687 1.211958
## 13 1.547563 0.3461687 1.211958
## 14 1.547563 0.3461687 1.211958
## 15 1.547563 0.3461687 1.211958
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 7.
We train a random forest using the rf method using the best selectionFunction on RMSE.
set.seed(1929)
randomForestControl <- trainControl(method = "boot", number = 30, selectionFunction = "best")
(randomForestTune = caret::train( x = chemX_train_pp,
y = chemY_train,
method = "rf",
tuneGrid = expand.grid( mtry = seq( 2, 30, by = 2) ) ,
metric = "RMSE" ,
trControl = randomForestControl ,
ntree = 1000 ) )## Random Forest
##
## 132 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (30 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.331580 0.5744982 1.0704786
## 4 1.280646 0.5915410 1.0243153
## 6 1.257237 0.5996109 1.0058057
## 8 1.247368 0.6004864 0.9932623
## 10 1.242392 0.6006760 0.9880305
## 12 1.235027 0.6014757 0.9794181
## 14 1.233911 0.5994134 0.9773687
## 16 1.229192 0.6001594 0.9723560
## 18 1.233912 0.5943630 0.9752514
## 20 1.230791 0.5954681 0.9711008
## 22 1.231127 0.5923629 0.9699665
## 24 1.233065 0.5899367 0.9715160
## 26 1.234727 0.5872798 0.9717896
## 28 1.232454 0.5869972 0.9683762
## 30 1.233838 0.5850364 0.9693608
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 16.
We run the stochastic gradient boosted tree using the gbm library with the best selectionFunction on RMSE.
set.seed( 1027)
gbmControl <- trainControl(method = "boot", number = 30, selectionFunction = "best")
(gbmTune = caret::train( x = chemX_train_pp,
y = chemY_train,
method = "gbm",
tuneGrid = expand.grid( n.trees = c(300, 500),
shrinkage = c( 0.005, 0.02, 0.1 , 0.15 ) ,
interaction.depth = c( 1, 5, 7) ,
n.minobsinnode = 6
) ,
verbose = FALSE,
metric = "RMSE" ,
trControl = gbmControl ) )## Stochastic Gradient Boosting
##
## 132 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (30 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.005 1 300 1.415166 0.4720228 1.1299013
## 0.005 1 500 1.330934 0.4975995 1.0525428
## 0.005 5 300 1.298819 0.5259602 1.0259547
## 0.005 5 500 1.240000 0.5420020 0.9752577
## 0.005 7 300 1.292105 0.5274668 1.0199990
## 0.005 7 500 1.231573 0.5460870 0.9674665
## 0.020 1 300 1.250333 0.5286831 0.9800080
## 0.020 1 500 1.233148 0.5383720 0.9647773
## 0.020 5 300 1.196448 0.5624542 0.9386733
## 0.020 5 500 1.184833 0.5708035 0.9290373
## 0.020 7 300 1.197474 0.5619141 0.9380596
## 0.020 7 500 1.185182 0.5702181 0.9280550
## 0.100 1 300 1.237243 0.5399428 0.9689625
## 0.100 1 500 1.241025 0.5393010 0.9722572
## 0.100 5 300 1.197560 0.5634529 0.9439971
## 0.100 5 500 1.197152 0.5639139 0.9438898
## 0.100 7 300 1.199562 0.5612095 0.9374184
## 0.100 7 500 1.199081 0.5616638 0.9372710
## 0.150 1 300 1.263423 0.5224939 0.9855267
## 0.150 1 500 1.267867 0.5222623 0.9899760
## 0.150 5 300 1.203555 0.5563481 0.9554399
## 0.150 5 500 1.203216 0.5566581 0.9551805
## 0.150 7 300 1.230460 0.5429069 0.9687746
## 0.150 7 500 1.229839 0.5434100 0.9683315
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 6
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
## 5, shrinkage = 0.02 and n.minobsinnode = 6.
Now we train a Cubist tree using the best selectionFunction on RMSE.
set.seed( 1027)
cubistControl <- trainControl(method = "boot", number = 30, selectionFunction = "best")
(cubistTune = caret::train( x = chemX_train_pp,
y = chemY_train,
method = "cubist",
tuneGrid = expand.grid( committees = c( 1, 10, 20, 50 ) ,
neighbors = c( 0, 1, 5, 8)
) ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )## Cubist
##
## 132 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (30 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.672066 0.3667729 1.2738195
## 1 1 1.640418 0.3993203 1.2220316
## 1 5 1.632140 0.3906536 1.2385862
## 1 8 1.639775 0.3857359 1.2487970
## 10 0 1.280367 0.5140740 1.0068784
## 10 1 1.221138 0.5555331 0.9461957
## 10 5 1.242895 0.5402182 0.9709419
## 10 8 1.251116 0.5349056 0.9803797
## 20 0 1.217417 0.5528367 0.9628783
## 20 1 1.147448 0.6005078 0.8840467
## 20 5 1.178367 0.5795569 0.9216508
## 20 8 1.186654 0.5738764 0.9316537
## 50 0 1.177035 0.5758809 0.9347189
## 50 1 1.106673 0.6223372 0.8561449
## 50 5 1.138743 0.6018281 0.8931441
## 50 8 1.147362 0.5962340 0.9040053
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 1.
Based on the above results, we summarize their training and test RMSE and \(R^2\) below. \(\color{red}{\text{ We also include the results from Assignment 7 (the Partial Least Squares Model)}}\) \(\color{red}{\text{and Assignment 8 (MARS).}}\) The joint results are tabulated below.
# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()
model_list = list( rpartTune, randomForestTune, gbmTune, cubistTune)
model_stats = data.frame()
for( modelObj in model_list )
{
pred_data <- predict( modelObj, newdata=chemX_test_pp )
output <- data.frame( modelName = modelObj$method ,
trainRMSE = getTrainPerf(modelObj)[1,1] ,
trainR2 = getTrainPerf(modelObj)[1,2] ,
testRMSE = caret::RMSE( pred_data, chemY_test) ,
testR2 = caret::R2( pred_data, chemY_test) ,
modelType = "Tree/Rule"
)
model_stats <- rbind(model_stats, output)
}Model Results Comparison
| modelName | trainRMSE | trainR2 | testRMSE | testR2 | modelType |
|---|---|---|---|---|---|
| cubist | 1.11 | 0.62 | 0.94 | 0.75 | Tree/Rule |
| gbm | 1.18 | 0.57 | 1.03 | 0.71 | Tree/Rule |
| rf | 1.23 | 0.60 | 1.15 | 0.65 | Tree/Rule |
| Mars | 1.22 | 0.58 | 1.20 | 0.60 | Non-Linear Regression |
| pls | 1.20 | 0.57 | 1.23 | 0.59 | Linear Regression |
| rpart2 | 1.52 | 0.36 | 1.56 | 0.38 | Tree/Rule |
We conclude that the best model is cubist. The order of model preference from best to worst is Cubist, stochastic gradient boosted trees, random forests, MARS, partial least squares and CART trees.
## Selecting by Overall
The top 3 predictors are manufacturing variables: M32, M17, M13. Seven of the top 10 predictors are manufacturing. This is promising because it suggests that improvements to operational processes will increase yield. The top 3 biological predictors are B03 at rank 4 and B12 at rank 7 and B02 at rank 10.
While the exact fraction of biological predictors in the top 10 will vary with the model building process, the conclusion that biological predictors are less important has now been validated across 3 types of models: Linear regression, non-linear regression and tree-based models.
When we compare top 10 variables from pls, MARS and Cubist, we see some commonalities and differences.
Predictor Importance Across 3 Model Types
| rank | PLS | MARS | Cubist |
|---|---|---|---|
| 1 | M32 | M32 | M32 |
| 2 | M13 | M09 | M17 |
| 3 | M36 | M13 | |
| 4 | M09 | B03 | |
| 5 | M17 | M29 | |
| 6 | B02 | M39 | |
| 7 | M33 | B12 | |
| 8 | M06 | M33 | |
| 9 | B06 | M09 | |
| 10 | B03 | B02 |
M32 is the top predictor across all models. This should be a robust finding and efforts to improve it should be the top recommendation.M13 is a top 3 predictor in two of the 3 model categories and should be improved.M17 is a top 5 and top 2 predictor in two of the 3 model categories and should be improved.We plot the optimal CART tree from 8.7a using the rpart.plot and partykit packages. These produce related but complementary information on the optimal tree which has maxdepth 7. We gain several insights from inspecting both diagrams below.
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
M32 is a crucial predictor to manage in the manufacturing process. If we can make M32 go down the right branch at node 1 (i.e. exceed its threshold of 0.195), then the worst case (at node 14) exceeds a yield of 39. This alone boosts the conditional mean yield to 42 - a net increase of 3 over the alternative branch.
B11 is a crucial predictor to avoid extreme low yield and to improve extreme high yields. It is relevant at nodes 2 and 17. At Node 2, we want to avoid worst case yields (Nodes 4,5) of 37 and 39. At Node 17, we want to a median yield above 43. This means that the biological ingredient associated with B11 needs to be high quality. We note that the role of B11 is somewhat inconsistent with the model findings of PLS, MARS and Cubist.
We conclude that , based on the RPART model, these are the key recommendations to business managers on how to improve yields or investigate the underlying cost/benefits.
We summarize all the R code used in this project in this appendix for ease of reading.
knitr::opts_chunk$set(echo = FALSE)
library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(caret)
library(randomForest)
library(party)
library(caret)
library(gbm)
library(Cubist)
library(rpart)
library(AppliedPredictiveModeling)
library(mlbench) # data for Friedman non-linear dataset
.code-bg {
background-color: lightgray;
}
set.seed(200)
simulated = mlbench.friedman1( 200, sd = 1 )
simulated = cbind( simulated$x, simulated$y)
simulated = as.data.frame( simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest( y ~ . , data = simulated, importance = TRUE , ntree = 1000 )
rfImp1 = varImp( model1, scale = FALSE )
df = tibble( varName = rownames(rfImp1), score = rfImp1$Overall)
df$varName = factor( df$varName, levels = df$varName[ order( df$score)])
df %>% ggplot(aes( x = varName, y = score ) ) + geom_bar(stat = "identity") + labs(title = "Variable Importance Plot - Model 1")
set.seed(2010)
simulated$duplicate1 = simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1 )
model2 <- randomForest( y ~ . , data = simulated, importance = TRUE , ntree = 1000 )
rfImp2 = varImp( model2, scale = FALSE )
df2 = tibble( varName = rownames(rfImp2), score = rfImp2$Overall)
df2$varName = factor( df2$varName, levels = df2$varName[ order( df2$score)])
df2 %>% ggplot(aes( x = varName, y = score ) ) + geom_bar(stat = "identity") + labs(title = "Variable Importance Plot - Model 2")
simulated$duplicate2 = simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate2, simulated$V1 )
model3 <- randomForest( y ~ . , data = simulated, importance = TRUE , ntree = 1000 )
rfImp3 = varImp( model3, scale = FALSE )
df3 = tibble( varName = rownames(rfImp3), score = rfImp3$Overall)
df3$varName = factor( df3$varName, levels = df3$varName[ order( df3$score)])
df3 %>% ggplot(aes( x = varName, y = score ) ) + geom_bar(stat = "identity") + labs(title = "Variable Importance Plot - Model 3")
df_combined = rbind( df, df2, df3)
df_combined$Model = c( rep("Model1", nrow(df)), rep("Model2", nrow(df2)), rep("Model3", nrow(df3)))
df_combined %>% ggplot(aes( x = varName, y = score ) ) +
geom_bar(stat = "identity") +
facet_wrap(~Model, ncol = 1) +
labs(title = "Variable Importance Plot with Extra Correlated Predictors")
set.seed( 10203)
cond_forest1 = cforest( y ~ ., data = simulated[, 1:11], control = cforest_unbiased(ntree = 1000) )
cond_forest2 = cforest( y ~ ., data = simulated[, 1:12], control = cforest_unbiased(ntree = 1000) )
cond_forest3 = cforest( y ~ ., data = simulated[, 1:13], control = cforest_unbiased(ntree = 1000) )
cond_impVar1 = varimp( cond_forest1)
cond_impVar2 = varimp( cond_forest2)
cond_impVar3 = varimp( cond_forest3)
cf_combined = tibble( varName = c(labels(cond_impVar1), labels(cond_impVar2), labels( cond_impVar3)) ,
score = c( cond_impVar1, cond_impVar2, cond_impVar3) ,
Model = c( rep("Model 1", length(cond_impVar1)),
rep("Model 2", length(cond_impVar2)),
rep("Model 3", length(cond_impVar3))
)
)
cdf3 = tibble( varName = labels(cond_impVar3), score = cond_impVar3 )
# Sort the variable name for the graph to show in score size order based on the third and biggest model.
cf_combined$varName = factor( cf_combined$varName, levels = cdf3$varName[ order( cdf3$score, decreasing = TRUE ) ] )
cf_combined %>% ggplot(aes( x = varName, y = score ) ) +
geom_bar(stat = "identity") +
facet_wrap(~Model, ncol = 1) +
labs(title = "Variable Importance Plot (Conditional Random Forest) w Correlates")
set.seed( 10203)
cond_impVar1 = varimp( cond_forest1, conditional = TRUE )
cond_impVar2 = varimp( cond_forest2, conditional = TRUE )
cond_impVar3 = varimp( cond_forest3, conditional = TRUE )
cf_combined = tibble( varName = c(labels(cond_impVar1), labels(cond_impVar2), labels( cond_impVar3)) ,
score = c( cond_impVar1, cond_impVar2, cond_impVar3) ,
Model = c( rep("Model 1", length(cond_impVar1)),
rep("Model 2", length(cond_impVar2)),
rep("Model 3", length(cond_impVar3))
)
)
cdf3 = tibble( varName = labels(cond_impVar3), score = cond_impVar3 )
# Sort the variable name for the graph to show in score size order based on the third and biggest model.
cf_combined$varName = factor( cf_combined$varName, levels = cdf3$varName[ order( cdf3$score, decreasing = TRUE ) ] )
cf_combined %>% ggplot(aes( x = varName, y = score ) ) +
geom_bar(stat = "identity") +
facet_wrap(~Model, ncol = 1) +
labs(title = "Variable Importance Plot (Conditional Random Forest) STROBL metric")
set.seed(10238)
gbmModel1 = gbm( y ~ . , data = simulated[, 1:11], distribution = "gaussian" , n.trees = 2000)
gbm_varimp1 = varImp(gbmModel1, numTrees = 2000 )
gbmModel2 = gbm( y ~ . , data = simulated[, 1:12], distribution = "gaussian" , n.trees = 2000)
gbm_varimp2 = varImp(gbmModel2, numTrees = 2000 )
gbmModel3 = gbm( y ~ . , data = simulated[, 1:13], distribution = "gaussian" , n.trees = 2000)
gbm_varimp3 = varImp(gbmModel3, numTrees = 2000 )
gbmdf3 = tibble( varName = labels(gbm_varimp3)[[1]] , score = gbm_varimp3$Overall )
gbm_combined = tibble( varName = c(labels(gbm_varimp1)[[1]],
labels(gbm_varimp2)[[1]],
labels(gbm_varimp3)[[1]] ) ,
score = c( gbm_varimp1$Overall, gbm_varimp2$Overall, gbm_varimp3$Overall) ,
Model = c( rep("Model 1", length(gbm_varimp1$Overall)),
rep("Model 2", length(gbm_varimp2$Overall)),
rep("Model 3", length(gbm_varimp3$Overall)) ) )
gbm_combined$varName = factor( gbm_combined$varName, levels = gbmdf3$varName[ order( gbmdf3$score, decreasing = TRUE ) ] )
gbm_combined %>% ggplot(aes( x = varName, y = score ) ) +
geom_bar(stat = "identity") +
facet_wrap(~Model, ncol = 1) +
labs(title = "Variable Importance Plot (Gradient Boosted Trees) Correlated Predictors")
set.seed(10738)
cubistModel1 = cubist( x=simulated[, 1:10], y = simulated[,11] , committees = 15 )
cubist_varimp1 = caret::varImp(cubistModel1 )
cubistModel2 = cubist( x=simulated[, c(1:10,12)], y = simulated[,11], committees = 15 )
cubist_varimp2 = caret::varImp(cubistModel2 )
cubistModel3 = cubist( x=simulated[, c(1:10,12,13)], y = simulated[,11] , committees = 15 )
cubist_varimp3 = caret::varImp(cubistModel3 )
cubistdf3 = tibble( varName = labels(cubist_varimp3)[[1]] , score = cubist_varimp3$Overall )
cubist_combined = tibble( varName = c(labels(cubist_varimp1)[[1]],
labels(cubist_varimp2)[[1]],
labels(cubist_varimp3)[[1]] ) ,
score = c( cubist_varimp1$Overall, cubist_varimp2$Overall, cubist_varimp3$Overall) ,
Model = c( rep("Model 1", length(cubist_varimp1$Overall)),
rep("Model 2", length(cubist_varimp2$Overall)),
rep("Model 3", length(cubist_varimp3$Overall)) ) )
cubist_combined$varName = factor( cubist_combined$varName, levels = cubistdf3$varName[ order( cubistdf3$score, decreasing = TRUE ) ] )
cubist_combined %>% ggplot(aes( x = varName, y = score ) ) +
geom_bar(stat = "identity") +
facet_wrap(~Model, ncol = 1) +
labs(title = "Variable Importance Plot (Cubist) Correlated Predictors")
set.seed( 102303)
x = pi/6 * seq(0,3)
rep_x = rep(x, 50 )
func_x = sin(rep_x) + 2.0 * runif(200)
func_x_low = sin(rep_x) + 1.0 * runif(200)
white_noise = rnorm(200, mean = mean(sin(rep_x)), sd = 5 )
df = tibble( y_hi = func_x, x = rep_x, w = white_noise )
df2 = tibble( y_low = func_x_low, x = rep_x, w = white_noise )
pairs(df)
pairs(df2)
rpart = rpart(y_hi ~ . , data = df)
varImp(rpart)
rpart2 = rpart(y_low ~ . , data = df2)
varImp(rpart2)
# response 8.3c
# ----------------------------
data(solubility)
set.seed(1029)
gbmModel10 = gbm.fit( x=solTrainXtrans, y = solTrainY,
distribution = "gaussian" ,
n.trees = 2000,
interaction.depth = 1,
bag.fraction = 0.1,
shrinkage = 0.1 ,
verbose = FALSE )
vImpModel10 = varImp(gbmModel10, numTrees = 2000 )
set.seed(1029)
gbmModel90 = gbm.fit( x=solTrainXtrans, y = solTrainY,
distribution = "gaussian" ,
n.trees = 2000,
interaction.depth = 1,
bag.fraction = 0.9,
shrinkage = 0.9 ,
verbose = FALSE )
vImpModel90 = varImp(gbmModel90, numTrees = 2000 )
vImpModel10 %>% arrange(desc(Overall)) %>% top_n(n=30) %>%
add_rownames(var = "Var") %>%
ggplot(aes(x=reorder(Var, Overall) ,y=Overall)) +
geom_bar(stat="identity") + coord_flip() +
labs(title="Learning .1 ,Bag Frac .1",
x = "Variable",
y = "Importance") -> p10
vImpModel90 %>% arrange(desc(Overall)) %>% top_n(n=30) %>%
add_rownames(var = "Var") %>%
ggplot(aes(x=reorder(Var, Overall) ,y=Overall)) +
geom_bar(stat="identity") + coord_flip() +
labs(title="Learning .9 ,Bag Frac .9",
x = "Variable",
y = "Importance") -> p90
plot_grid(p10, p90)
data(solubility)
set.seed(1029)
gbmModel10d7 = gbm.fit( x=solTrainXtrans, y = solTrainY,
distribution = "gaussian" ,
n.trees = 2000,
interaction.depth = 7,
bag.fraction = 0.1,
shrinkage = 0.1 ,
verbose = FALSE )
vImpModel10d7 = varImp(gbmModel10d7, numTrees = 2000 )
set.seed(1029)
gbmModel90d7 = gbm.fit( x=solTrainXtrans, y = solTrainY,
distribution = "gaussian" ,
n.trees = 2000,
interaction.depth = 7,
bag.fraction = 0.9,
shrinkage = 0.9 ,
verbose = FALSE )
vImpModel90d7 = varImp(gbmModel90d7, numTrees = 2000 )
vImpModel10 %>% add_rownames(var = "Var") -> vim10
vImpModel90 %>% add_rownames(var = "Var") -> vim90
vImpModel90d7 %>% add_rownames( var= "Var") -> vim90d7
vImpModel10d7 %>% add_rownames( var= "Var") -> vim10d7
vim10 %>% inner_join(vim10d7, by = "Var") %>% rename(Depth7 = Overall.y, Depth1 = Overall.x) -> vi10
vim90 %>% inner_join(vim90d7, by = "Var") %>% rename(Depth7 = Overall.y, Depth1 = Overall.x) -> vi90
vi10$Var = factor( vi10$Var, levels = vi10$Var[ order( vi10$Depth1, decreasing = FALSE ) ] )
vi90$Var = factor( vi90$Var, levels = vi90$Var[ order( vi90$Depth1, decreasing = FALSE ) ] )
vi10 %>%
arrange(desc(Depth1)) %>%
top_n(n=25) %>%
pivot_longer(!Var, names_to = "Depth", values_to = "Score") %>%
ggplot(aes(x=Var, y = Score , fill = Depth )) +
geom_bar(stat= "identity", alpha = 0.8 , position = "dodge") +
coord_flip() +
theme(legend.position = c(.8, 0.2)) +
labs(title = "Increased tree depth flattens slope",
subtitle = "Bagging 10%, Learning 10%",
y = "Importance") -> p10d7
vi90 %>%
arrange(desc(Depth1)) %>%
top_n(n=25) %>%
pivot_longer(!Var, names_to = "Depth", values_to = "Score") %>%
ggplot(aes(x=Var, y = Score , fill = Depth )) +
geom_bar(stat= "identity", alpha = 0.8 , position = "dodge") +
coord_flip() +
theme(legend.position = c(.8, 0.2)) +
labs(title = "Increased tree depth flattens slope",
subtitle = "Bagging 90%, Learning 90%",
y = "Importance") -> p90d7
plot_grid(p10d7, p90d7)
vi90$Var = factor( vi90$Var, levels = vi90$Var[ order( vi90$Depth7, decreasing = FALSE ) ] )
vi90 %>%
arrange(desc(Depth7)) %>%
top_n(n=25) %>%
ggplot(aes(x=Var, y = Depth7 )) +
geom_bar(stat= "identity", alpha = 0.8 , position = "dodge") +
coord_flip() +
scale_y_continuous(limits = c(0, 1600)) +
labs(title = "Depth 7",
subtitle = "Bagging 90%, Learning 90%",
y = "Importance") -> p90d7x
vi90$Var = factor( vi90$Var, levels = vi90$Var[ order( vi90$Depth1, decreasing = FALSE ) ] )
vi90 %>%
arrange(desc(Depth1)) %>%
top_n(n=25) %>%
ggplot(aes(x=Var, y = Depth1 )) +
geom_bar(stat= "identity", alpha = 0.8 , position = "dodge") +
coord_flip() +
scale_y_continuous(limits = c(0, 1600)) +
labs(title = "Depth 1",
subtitle = "Bagging 90%, Learning 90%",
y = "Importance") -> p90d1x
plot_grid(p90d1x, p90d7x)
data("ChemicalManufacturingProcess")
chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]
chemX %>%
rename_with(~ str_replace(., "ManufacturingProcess", "M"), starts_with("Manufacturing") ) %>%
rename_with(~ str_replace(., "BiologicalMaterial", "B"), starts_with("Biological") ) -> chemX3
set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )
chemX_test = chemX3[-train_indices , ]
chemX_train = chemX3[ train_indices , ]
chemY_test = chemY[-train_indices , ]
chemY_train = chemY[ train_indices , ]
# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_train_bv = chemX_train[, -nearZeroVar(chemX_train) ]
# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_test_bv = chemX_test[, -nearZeroVar(chemX_test) ]
chemX_train_bv %>% mutate(
across(c(M18, M20, M16, M26, M25, M27, M31,
M29, M43, M30, M01, M42, M44, M45,
M39, M38 ) , # Run a function across the columns listed in across()
~if_else( condition = ( .x == 0 ) , # anonymous function applied to each column in across() .x refers to current column argument
true = median( . ) , # we use median without removal of NA
false = as.numeric(.) ) ) # keep the existing value of it is nonzero
) -> chemX_train_nz
# Replace the outlier zeros with the median value for the variables below for test
chemX_test_bv %>% mutate(
across(c(M39, M42, M44, M45, M01 ) ,
~if_else( condition = ( .x == 0 ) ,
true = median( . ) ,
false = as.numeric(.) ) ) ) -> chemX_test_nz
# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))
# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc, chemX_train_nz )
# Becomes the final version of test data for prediction
chemX_test_pp = predict( preProcFunc, chemX_test_nz )
predictors = subset(ChemicalManufacturingProcess, select = -Yield)
yield = subset( ChemicalManufacturingProcess, select = "Yield")
set.seed(517)
trainingRows = createDataPartition( yield$Yield, p = 0.7, list = FALSE)
trainPredictors = predictors[trainingRows,]
trainYield = yield[trainingRows,]
testPredictors = predictors[-trainingRows, ]
testYield = yield[-trainingRows,]
pp <- preProcess(trainPredictors, method = c("BoxCox", "center", "scale", "knnImpute"))
ppTrainPredictors = predict(pp, trainPredictors)
set.seed(10393)
library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 15, by = 1 ) )
#cartGrid = data.frame( maxdepth = seq(1, 15, by = 1), .model = TRUE )
#cartControl <- trainControl(method = "cv", number = 10, selectionFunction = "best")
cartControl <- trainControl(method = "boot", number = 30, selectionFunction = "best")
(rpartTune = train( x= chemX_train_pp,
y = chemY_train,
method = "rpart2" ,
metric = "RMSE" ,
tuneGrid = cartGrid ,
trControl = cartControl ) )
set.seed(1929)
randomForestControl <- trainControl(method = "boot", number = 30, selectionFunction = "best")
(randomForestTune = caret::train( x = chemX_train_pp,
y = chemY_train,
method = "rf",
tuneGrid = expand.grid( mtry = seq( 2, 30, by = 2) ) ,
metric = "RMSE" ,
trControl = randomForestControl ,
ntree = 1000 ) )
set.seed( 1027)
gbmControl <- trainControl(method = "boot", number = 30, selectionFunction = "best")
(gbmTune = caret::train( x = chemX_train_pp,
y = chemY_train,
method = "gbm",
tuneGrid = expand.grid( n.trees = c(300, 500),
shrinkage = c( 0.005, 0.02, 0.1 , 0.15 ) ,
interaction.depth = c( 1, 5, 7) ,
n.minobsinnode = 6
) ,
verbose = FALSE,
metric = "RMSE" ,
trControl = gbmControl ) )
set.seed( 1027)
cubistControl <- trainControl(method = "boot", number = 30, selectionFunction = "best")
(cubistTune = caret::train( x = chemX_train_pp,
y = chemY_train,
method = "cubist",
tuneGrid = expand.grid( committees = c( 1, 10, 20, 50 ) ,
neighbors = c( 0, 1, 5, 8)
) ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )
# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()
model_list = list( rpartTune, randomForestTune, gbmTune, cubistTune)
model_stats = data.frame()
for( modelObj in model_list )
{
pred_data <- predict( modelObj, newdata=chemX_test_pp )
output <- data.frame( modelName = modelObj$method ,
trainRMSE = getTrainPerf(modelObj)[1,1] ,
trainR2 = getTrainPerf(modelObj)[1,2] ,
testRMSE = caret::RMSE( pred_data, chemY_test) ,
testR2 = caret::R2( pred_data, chemY_test) ,
modelType = "Tree/Rule"
)
model_stats <- rbind(model_stats, output)
}
model_stats %>%
add_row(modelName = "pls", trainRMSE = 1.203 , trainR2 = .566 , testRMSE = 1.22817 , testR2 = .5872 ,
modelType = "Linear Regression") %>%
add_row(modelName = "Mars", trainRMSE = 1.22 , trainR2 = .580 , testRMSE = 1.20 , testR2 = .600 ,
modelType = "Non-Linear Regression") ->model_stats2
model_stats2 %>% as_tibble() %>%
arrange( testRMSE) %>%
kable( digits = 2, caption = "Model Results Comparison") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
cubist_varimp = varImp(cubistTune, scale = TRUE)
cubist_varimp$importance %>% arrange(desc(Overall)) %>% top_n(n=20) %>%
add_rownames(var = "Var") %>%
ggplot(aes(x=reorder(Var, Overall) ,y=Overall)) +
geom_bar(stat="identity") + coord_flip() +
labs(title="Cubist Top 20 Variables",
x = "Variable",
y = "Importance")
predictor_rankings = tibble(
rank = c(1:10) ,
PLS = c( "M32", "M13", "M36" , "M09", "M17", "B02" , "M33", "M06" , "B06" , "B03" ) ,
MARS = c("M32", "M09", rep("", 8)) ,
Cubist = c("M32", "M17", "M13", "B03", "M29", "M39", "B12", "M33", "M09", "B02") )
predictor_rankings %>% kable(caption = "Predictor Importance Across 3 Model Types") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left" )
library(rpart.plot)
library(partykit)
rpart.plot(rpartTune$finalModel)
rpart_pt = as.party(rpartTune$finalModel)
plot(rpart_pt, gp = gpar(fontsize = 7 , col = "blue"),
inner_panel = node_inner( rpart_pt, fill = c("lightgreen", "tan" ) ) )