library(tidyverse)
library(imputeTS)
library(lubridate)
library(caret)
library(mlbench)
library(randomForest)
library(party)
library(partykit)
library(rJava)
library(RWeka)
library(rpart)
library(gbm)
library(xgboost)
library(Cubist)
library(rpart)
library(rpart.plot)
library(AppliedPredictiveModeling)
Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
Recreate the simulated data from Exercise 7.2:
Start R and use these commands to load the data
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
The random forest model does not significantly use predictors V6-V10, shown by their low overall importance and weights.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
Impotance score for V1 did decrease. As more predictors highly correlated to V1 are included, V1 importance decreases.
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
## Overall
## V1 4.91687329
## V2 6.52816504
## V3 0.58711552
## V4 7.04870917
## V5 2.03115561
## V6 0.14213148
## V7 0.10991985
## V8 -0.08405687
## V9 -0.01075028
## V10 0.09230576
## duplicate1 3.80068234
## duplicate2 1.87721959
Comparing the model created by the cforest function, the importances are not exact but do show a similar pattern of importance values in terms of magnitude.
cforestmodel <- cforest(y ~ ., data = simulated)
varimp(cforestmodel)
## V1 V2 V3 V4 V5 V6
## 5.42152778 5.79075433 -0.01425384 5.96294311 1.87652947 -0.06822257
## V7 V8 V9 V10 duplicate1 duplicate2
## 0.07696942 -0.17067028 0.02155362 -0.13376134 5.27935126 2.92224516
The same relative pattern occurs where V6-V10 are consistently low in terms of variable importance. There are differences in each model, however, such as the rpartModel having low importance for V3. Another example is the Cubist model having a higher importance for V6, at approximately 26, while V7-V10 are all less than 2. For all models (given that the data still contains the highly correlated variables to V1), V4 is the highest ranked in terms of importance. In some models, like the xgbModel, the importance of the duplicate variables significantly decrease (e.g. V1 has an overall variable importance of approximately 72, while duplicate 1 has a lower variable importance of approximately 30, and then duplicate 2 has an even lower variable importance at approximately 0.6)
rpartmodel <- rpart(y ~ ., data = simulated)
varImp(rpartmodel)
## Overall
## duplicate1 1.6165532
## duplicate2 0.3521296
## V1 1.5149230
## V10 0.5254491
## V2 2.0690787
## V3 0.5805123
## V4 2.9099218
## V5 2.3438469
## V6 0.3027516
## V7 0.5113927
## V9 0.2054104
## V8 0.0000000
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian", n.trees = 100)
varImp <- summary(gbmModel, n.trees = 100)
print(varImp)
## var rel.inf
## V4 V4 29.0834679
## V2 V2 23.2667156
## duplicate1 duplicate1 15.6028982
## V1 V1 12.9259358
## V5 V5 9.9657720
## V3 V3 8.6571358
## V9 V9 0.3081622
## V6 V6 0.1899125
## V7 V7 0.0000000
## V8 V8 0.0000000
## V10 V10 0.0000000
## duplicate2 duplicate2 0.0000000
#xgb is throwing a deprecation error from the source code in the package;
#suppressing these errors
invisible(capture.output({
xgbModel <- train(y ~ ., data = simulated, method = "xgbTree")
}))
varImp(xgbModel)
## xgbTree variable importance
##
## Overall
## V4 100.00000
## V2 76.40826
## duplicate1 59.28970
## V1 40.06618
## V5 31.60700
## V3 30.13565
## V6 1.38945
## duplicate2 1.03944
## V10 0.51353
## V7 0.06086
## V9 0.05165
## V8 0.00000
cubistModel <- train(y ~ ., data = simulated, method = "cubist")
varImp(cubistModel)
## cubist variable importance
##
## Overall
## V2 100.000
## V1 77.698
## V4 71.942
## V5 54.676
## V3 46.043
## duplicate1 35.971
## duplicate2 35.971
## V6 14.388
## V8 4.317
## V10 0.000
## V9 0.000
## V7 0.000
Use a simulation to show tree bias with different granularities.
From Applied Predictive Modeling, the text states that individual trees tend to suffer from selection bias, where “predictors with a higher number of distinct values are favored over more granular predictors”. This sentence is a bit difficulty to interpret, as it seems like granularity is implying “good” predictors even though the word granular may be interpreted to also mean distinct. As such, my understanding of this would be that this indicates that a predictor with many different values may be favored over more highly correlated predictors. To simulate this, I create a dataset with 200 outcome samples y, randomly sampling from 1-10 with decimal places allowed. I then create a column, duplicatey, which creates a highly correlated variable to the outcome y. This column duplicatey is rounded to the nearest 10s place. I then create another column, distinct1, which is uncorrelated to y but consists of many distinct randomly selected values. When modeling with rpart() and then viewing the variable importance, I can see that duplicatey is more important but the model still splits on distinct1 despite being uncorrelated. I then run the same simulation but do not round duplicatey. The result is a model that does not split on distinct1 at all. The importance of duplicatey also increases more, when compared to the increase in importance of distinct1. The second model also had a higher correlation between y and duplicatey, along with its higher number of distinct values. A final comparison can be made with a third model that also rounds the uncorrelated variable, distinct1. This decreases the number of distinct values, and as expected this also decreases the importance of distinct1 when creating the rpart() model and listing the variable importance.
set.seed(200)
simulatedData <- data.frame()
for(i in (1:200)){
simulatedData <- rbind(sample(1:10,1, replace = TRUE),simulatedData)
}
colname <- colnames(simulatedData)
simulatedData <- simulatedData %>% rename(y = colname)
simulatedData$duplicatey <- round(simulatedData$y + rnorm(200),-1)
cor(simulatedData$duplicatey, simulatedData$y)
## [1] 0.8185743
simulatedData$distinct1 <- rnorm(200)
cor(simulatedData$distinct1, simulatedData$y)
## [1] -0.04767912
rpartModelsim <- rpart(y ~ ., data = simulatedData)
varImp(rpartModelsim)
## Overall
## distinct1 0.07705602
## duplicatey 0.67006384
rpart.plot(rpartModelsim)
simulatedData <- data.frame()
for(i in (1:200)){
simulatedData <- rbind(sample(1:10,1, replace = TRUE),simulatedData)
}
colname <- colnames(simulatedData)
simulatedData <- simulatedData %>% rename(y = colname)
simulatedData$duplicatey <- simulatedData$y + rnorm(200)
cor(simulatedData$duplicatey, simulatedData$y)
## [1] 0.928973
simulatedData$distinct1 <- rnorm(200)
cor(simulatedData$distinct1, simulatedData$y)
## [1] 0.05930572
rpartModelsim <- rpart(y ~ ., data = simulatedData)
varImp(rpartModelsim)
## Overall
## distinct1 0.143018
## duplicatey 2.024346
rpart.plot(rpartModelsim)
simulatedData <- data.frame()
for(i in (1:200)){
simulatedData <- rbind(sample(1:10,1, replace = TRUE),simulatedData)
}
colname <- colnames(simulatedData)
simulatedData <- simulatedData %>% rename(y = colname)
simulatedData$duplicatey <- simulatedData$y + rnorm(200)
cor(simulatedData$duplicatey, simulatedData$y)
## [1] 0.9467756
simulatedData$distinct1 <- round(rnorm(200))
cor(simulatedData$distinct1, simulatedData$y)
## [1] -0.0233493
rpartModelsim <- rpart(y ~ ., data = simulatedData)
varImp(rpartModelsim)
## Overall
## distinct1 0.122513
## duplicatey 2.543658
rpart.plot(rpartModelsim)
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Gradient boosting is an iterative process, and the learning rate defines the fraction of each tree’s prediction added to the previous iteration’s predictive value. The purpose of the learning rate, aka shrinkage, is to prevent the algorithm from selecting the same predictors at each iteration. A learning rate of 0.9 would be a high fraction (90%) added to the previous iteration, and would result in less protection against selecting the same predictor. The bagging fraction refers to the fraction of training data used to build each tree. A bagging fraction of 0.9 would indicate 90% of the training data is used in each iteration. Since each iteration includes most of the training data, this again would select for the same dominant predictors as they are likely to be in each training dataset.
Since the learning rate was introduced to prevent overfitting, I would think that the model with 0.1 learning rate would be more predictive of other samples. The 0.1 bagging fraction would also help overfitting as only a fraction of the training data is used at each step, which may help the overall model in predicting other samples that may not look very similar to the training data as a whole.
c, How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing interaction depth, aka tree depth, would likely increase the slope of predictor importance for either model as a shallow depth would indicate fewer splits. If there are fewer splits, the splits will occur on only the most dominant variables and therefore lessen the importance of the other variables.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Load the Data:
data(ChemicalManufacturingProcess)
set.seed(63)
KNN imputation was used to fill in these missing values:
set.seed(63)
ChemicalManufacturingProcess_preProc <-
preProcess(ChemicalManufacturingProcess,
method = "knnImpute")
transformed_ChemMan <-
predict(ChemicalManufacturingProcess_preProc,
newdata = ChemicalManufacturingProcess)
df <- as.data.frame(transformed_ChemMan$Yield) %>%
rename(Yield = `transformed_ChemMan$Yield`)
Split the data into a training and a test set, pre-process the data:
set.seed(63)
smp_size <- floor(0.80 * nrow(ChemicalManufacturingProcess))
trainingDataindex <-
sample(seq_len(nrow(df)),
size = smp_size)
trainY <- df[trainingDataindex,]
testY <- df[-trainingDataindex,]
trainX <-
transformed_ChemMan[trainingDataindex,] %>%
select(-Yield,-BiologicalMaterial07)
testX <-
transformed_ChemMan[-trainingDataindex,] %>%
select(-Yield,-BiologicalMaterial07)
Train several Tree based models:
set.seed(63)
rpartTune <- train(x = trainX,
y = trainY,
method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"))
rpartTune
## CART
##
## 140 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 126, 126, 126, 126, 127, 125, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 0.8246060 0.3990210 0.6385586
## 2 0.8585395 0.3294589 0.6737085
## 3 0.8478948 0.3358086 0.6692808
## 4 0.8429965 0.3509759 0.6484788
## 5 0.7995522 0.4085217 0.6155064
## 6 0.7963984 0.4224709 0.6041594
## 7 0.7969954 0.4292998 0.5960959
## 8 0.8053248 0.4181225 0.5992211
## 9 0.8031675 0.4215811 0.6101907
## 10 0.8028446 0.4226631 0.6068626
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 6.
rfModel <- randomForest(trainX,trainY,
importance = TRUE,
mtry = 3,
ntrees = 1000)
rfModel
##
## Call:
## randomForest(x = trainX, y = trainY, mtry = 3, importance = TRUE, ntrees = 1000)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.4607105
## % Var explained: 55.76
gbmTune <- train(trainX, trainY,
method = "gbm",
tuneLength = 10,
trControl = trainControl(method = "cv"),
verbose = FALSE)
gbmTune
## Stochastic Gradient Boosting
##
## 140 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 125, 126, 126, 127, 128, 126, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 0.6619548 0.6347425 0.5207302
## 1 100 0.6553629 0.6453852 0.5216436
## 1 150 0.6560975 0.6508306 0.5313237
## 1 200 0.6440480 0.6594225 0.5244150
## 1 250 0.6528182 0.6568716 0.5305933
## 1 300 0.6506265 0.6573907 0.5286091
## 1 350 0.6481291 0.6615171 0.5266143
## 1 400 0.6516427 0.6570397 0.5279734
## 1 450 0.6568860 0.6540144 0.5349976
## 1 500 0.6523359 0.6611039 0.5319127
## 2 50 0.6464292 0.6481027 0.5116357
## 2 100 0.6426869 0.6478847 0.5085874
## 2 150 0.6468147 0.6488001 0.5172338
## 2 200 0.6549768 0.6451463 0.5238394
## 2 250 0.6519943 0.6470021 0.5184311
## 2 300 0.6531169 0.6470905 0.5186146
## 2 350 0.6506190 0.6488732 0.5176903
## 2 400 0.6488769 0.6499406 0.5178891
## 2 450 0.6470029 0.6526123 0.5150915
## 2 500 0.6470525 0.6523934 0.5156345
## 3 50 0.6381306 0.6432709 0.5035347
## 3 100 0.6251244 0.6551793 0.4984719
## 3 150 0.6106679 0.6694123 0.4852842
## 3 200 0.6090971 0.6715847 0.4832230
## 3 250 0.6088921 0.6701510 0.4819457
## 3 300 0.6083596 0.6702040 0.4818891
## 3 350 0.6089424 0.6703035 0.4815626
## 3 400 0.6082414 0.6712286 0.4817171
## 3 450 0.6070799 0.6723708 0.4814129
## 3 500 0.6064531 0.6725437 0.4811732
## 4 50 0.6214456 0.6760762 0.4914931
## 4 100 0.6241328 0.6686813 0.4983421
## 4 150 0.6203860 0.6767502 0.4971338
## 4 200 0.6198565 0.6783092 0.4925379
## 4 250 0.6163597 0.6802872 0.4888787
## 4 300 0.6145497 0.6828694 0.4873391
## 4 350 0.6137432 0.6834633 0.4862273
## 4 400 0.6138293 0.6829844 0.4865868
## 4 450 0.6134375 0.6834863 0.4859357
## 4 500 0.6130335 0.6840680 0.4854928
## 5 50 0.6116931 0.6811965 0.4968391
## 5 100 0.5979165 0.6877146 0.4907408
## 5 150 0.5972286 0.6928620 0.4869924
## 5 200 0.5940537 0.6973794 0.4848480
## 5 250 0.5948442 0.6953157 0.4846855
## 5 300 0.5943929 0.6941034 0.4845107
## 5 350 0.5933019 0.6947074 0.4824759
## 5 400 0.5934477 0.6948879 0.4834339
## 5 450 0.5927216 0.6952168 0.4831185
## 5 500 0.5926664 0.6952811 0.4830603
## 6 50 0.6174232 0.6728348 0.4913589
## 6 100 0.6119428 0.6840052 0.4920060
## 6 150 0.6048565 0.6905264 0.4895501
## 6 200 0.6032738 0.6934546 0.4907144
## 6 250 0.6005979 0.6960954 0.4881325
## 6 300 0.5988501 0.6986023 0.4873794
## 6 350 0.5978103 0.6995812 0.4868447
## 6 400 0.5980040 0.6998468 0.4872227
## 6 450 0.5975675 0.7001093 0.4865538
## 6 500 0.5967565 0.7011452 0.4862442
## 7 50 0.6416382 0.6577286 0.5216524
## 7 100 0.6264580 0.6817679 0.5123127
## 7 150 0.6277958 0.6784691 0.5140732
## 7 200 0.6302152 0.6756910 0.5160497
## 7 250 0.6295272 0.6772938 0.5135817
## 7 300 0.6284829 0.6794963 0.5123717
## 7 350 0.6290959 0.6793974 0.5124335
## 7 400 0.6294118 0.6790561 0.5122755
## 7 450 0.6290627 0.6794954 0.5121803
## 7 500 0.6291371 0.6797166 0.5125577
## 8 50 0.6847926 0.6168126 0.5448827
## 8 100 0.6758213 0.6227684 0.5232168
## 8 150 0.6710163 0.6285341 0.5202417
## 8 200 0.6682207 0.6335803 0.5204282
## 8 250 0.6665798 0.6362466 0.5219160
## 8 300 0.6658941 0.6369302 0.5226856
## 8 350 0.6658490 0.6371099 0.5239016
## 8 400 0.6661013 0.6374771 0.5245621
## 8 450 0.6657955 0.6376807 0.5246843
## 8 500 0.6663121 0.6377156 0.5250454
## 9 50 0.6311070 0.6628740 0.5054270
## 9 100 0.6210939 0.6779088 0.4958137
## 9 150 0.6204869 0.6771817 0.4969351
## 9 200 0.6169604 0.6812653 0.4939517
## 9 250 0.6121686 0.6855438 0.4892885
## 9 300 0.6123702 0.6848747 0.4900230
## 9 350 0.6118291 0.6865609 0.4909422
## 9 400 0.6111679 0.6869801 0.4910799
## 9 450 0.6100228 0.6881993 0.4905098
## 9 500 0.6102115 0.6880007 0.4908984
## 10 50 0.6460078 0.6526971 0.5061813
## 10 100 0.6161771 0.6890507 0.4863725
## 10 150 0.6143909 0.6915696 0.4825131
## 10 200 0.6044788 0.6998403 0.4798898
## 10 250 0.6004079 0.7028617 0.4774439
## 10 300 0.5990039 0.7048271 0.4765001
## 10 350 0.5959254 0.7080689 0.4739420
## 10 400 0.5945544 0.7096971 0.4730968
## 10 450 0.5934899 0.7105101 0.4723497
## 10 500 0.5932616 0.7105859 0.4727235
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
## 5, shrinkage = 0.1 and n.minobsinnode = 10.
cubistTuned <- train(trainX,trainY,
method = "cubist",
tuneLength = 10,
trControl = trainControl(method = "cv"))
cubistTuned
## Cubist
##
## 140 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 127, 128, 124, 126, 127, 124, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.6862313 0.5693033 0.5569406
## 1 5 0.6059314 0.6686309 0.4822351
## 1 9 0.6355792 0.6336424 0.5019384
## 10 0 0.6393939 0.6210108 0.4979464
## 10 5 0.5783206 0.6952543 0.4433245
## 10 9 0.6075965 0.6641991 0.4581421
## 20 0 0.6187116 0.6478953 0.4842196
## 20 5 0.5576969 0.7201651 0.4286916
## 20 9 0.5845153 0.6942945 0.4428382
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
The Cubist model gives the optimal resampling and test set performance, with an R-squared value of over 0.7.
set.seed(63)
rpartPred <- predict(rpartTune, newdata = testX)
postResample(pred = rpartPred, obs = testY)
## RMSE Rsquared MAE
## 0.7582571 0.4553510 0.6158285
rfPred <- predict(rfModel, newdata = testX)
postResample(pred = rfPred, obs = testY)
## RMSE Rsquared MAE
## 0.6128878 0.6099963 0.4976785
gbmPred <- predict(gbmTune, newdata = testX)
postResample(pred = gbmPred, obs = testY)
## RMSE Rsquared MAE
## 0.5395483 0.6381464 0.4225340
cubistPred <- predict(cubistTuned, newdata = testX)
postResample(pred = cubistPred, obs = testY)
## RMSE Rsquared MAE
## 0.4823869 0.7134893 0.3954121
It looks like Manufacturing Processes dominate the list of important predictors for the cubist model. Compared to the optimal linear and non-linear models, ManufacturingProcess32 remains the top predictor. Similar to the linear model, ManufacturingProcess09 is the second most important predictor, while similar to the non-linear model, ManufacturingProcess17 is the third most important predictor. Between the three models, there are 6 predictors that overlapped from the top ten list - ManufacturingProcess32, ManufacturingProcess09, ManufacturingProcess17, BiologicalMaterial03, BiologicalMaterial06, and BiologicalMaterial02. The Cubist model also puts a stronger weight on the top two variables in comparison to the other models, which have steadily declining variable importance among its top 10 predictors versus the Cubist model’s steep drop off.
varImp(cubistTuned)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09 71.910
## ManufacturingProcess17 53.933
## BiologicalMaterial12 34.831
## ManufacturingProcess33 33.708
## ManufacturingProcess04 32.584
## BiologicalMaterial03 31.461
## BiologicalMaterial06 23.596
## BiologicalMaterial08 19.101
## BiologicalMaterial02 17.978
## ManufacturingProcess13 16.854
## ManufacturingProcess10 15.730
## ManufacturingProcess29 15.730
## ManufacturingProcess25 14.607
## ManufacturingProcess39 13.483
## ManufacturingProcess31 11.236
## ManufacturingProcess18 11.236
## ManufacturingProcess20 11.236
## ManufacturingProcess14 10.112
## BiologicalMaterial01 8.989
Per the above modeling, the optimal single tree model was found at max depth = 6. Graphing the decision tree, we can see that the first split is done on ManufacturingProcess32, which has been consistently the most important predictor per all of our optimal models. Since the overall goal is to increase yield, this view of yield distribution in the terminal nodes is useful in displaying potential simplifications to the model. As all negative pathways are considered unfavorable, further investigation can be focused on ManufacturingProcess18 and ManufacturingProcess09, as they present the only splits that lead to both negative and positive outcomes. In other words, the split provided by BiologicalMaterial05 is unimportant as the split on BiologicalMaterial11 where BiologicalMaterial11 < -0.39 already leads to only negative yields. Since only the Manufacturing Processes can be altered, attention can be focused on ManufacturingProcess18, where values < 0.02 tend towards positive yields per the model. Similarly, ManufacturingProcess09 has a split at -0.47, where values greater than -0.47 all lead to terminal nodes with positive yields.
rpartTune
## CART
##
## 140 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 126, 126, 126, 126, 127, 125, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 0.8246060 0.3990210 0.6385586
## 2 0.8585395 0.3294589 0.6737085
## 3 0.8478948 0.3358086 0.6692808
## 4 0.8429965 0.3509759 0.6484788
## 5 0.7995522 0.4085217 0.6155064
## 6 0.7963984 0.4224709 0.6041594
## 7 0.7969954 0.4292998 0.5960959
## 8 0.8053248 0.4181225 0.5992211
## 9 0.8031675 0.4215811 0.6101907
## 10 0.8028446 0.4226631 0.6068626
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 6.
rpartTune$finalModel
## n= 140
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 140 145.8064000 0.03415424
## 2) ManufacturingProcess32< 0.191596 82 46.2908900 -0.50055150
## 4) BiologicalMaterial11< -0.3896263 37 12.5158600 -0.90821600
## 8) BiologicalMaterial05>=0.4725039 9 3.6129330 -1.48809900 *
## 9) BiologicalMaterial05< 0.4725039 28 4.9037900 -0.72182510
## 18) BiologicalMaterial05< 0.01429589 19 2.3237940 -0.88811590 *
## 19) BiologicalMaterial05>=0.01429589 9 0.9454191 -0.37076680 *
## 5) BiologicalMaterial11>=-0.3896263 45 22.5701100 -0.16536070
## 10) ManufacturingProcess18>=0.01991463 32 8.2799280 -0.44088230
## 20) ManufacturingProcess27>=0.001387709 24 5.8982750 -0.57831000 *
## 21) ManufacturingProcess27< 0.001387709 8 0.5685590 -0.02859893 *
## 11) ManufacturingProcess18< 0.01991463 13 5.8814530 0.51284610 *
## 3) ManufacturingProcess32>=0.191596 58 42.9249700 0.79011760
## 6) ManufacturingProcess09< -0.4721252 11 5.1234900 -0.13948520 *
## 7) ManufacturingProcess09>=-0.4721252 47 26.0709500 1.00768400
## 14) BiologicalMaterial03< 1.102207 36 14.6134900 0.79713410
## 28) ManufacturingProcess24>=-0.05764121 15 2.9820890 0.38547910 *
## 29) ManufacturingProcess24< -0.05764121 21 7.2738610 1.09117300
## 58) ManufacturingProcess21>=-0.2387217 13 3.8428740 0.85210230 *
## 59) ManufacturingProcess21< -0.2387217 8 1.4805730 1.47966400 *
## 15) BiologicalMaterial03>=1.102207 11 4.6384960 1.69675700 *
rpart.plot(rpartTune$finalModel)