Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson
8.1. Recreate the simulated data from Exercise 7.2:
library(mlbench)
library(randomForest)
library(caret)
library(data.table)
library(party)
library(tidyverse)
library(rpart)
library(rpart.plot)
library(AppliedPredictiveModeling)
library(missForest)
library(Cubist)
library(partykit)set.seed(200)
simu <- mlbench.friedman1(200, sd=1)
simu <- cbind(simu$x, simu$y)
simu <- as.data.frame(simu)
colnames(simu)[ncol(simu)] <- "y"- Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model_rf<- randomForest(y ~ ., data = simu,
importance = TRUE,
ntree = 1000)
model_rf##
## Call:
## randomForest(formula = y ~ ., data = simu, importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 6.754258
## % Var explained: 72.3
Below you will find the importance feature, and the overall score along with the variables
rf_imp <- varImp(model_rf, scale = FALSE)
rf_imp## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
From using the VarImpPlot feature you can view the variables in the plot, this will ensure an understanding of their importance/ranking
varImpPlot(model_rf) Here you will visualize V6 to V10 which do not appear important
- Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simu$dup1 <- simu$V1 + rnorm(200) * .1
cor(simu$dup1, simu$V1)## [1] 0.9460206
above you can see the correlation of V1 is 94%
model_rf2 <- randomForest(y ~ ., data = simu,
importance = TRUE,
ntree = 1000)
model_rf2##
## Call:
## randomForest(formula = y ~ ., data = simu, importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 6.922537
## % Var explained: 71.61
rf2_Imp <- varImp(model_rf2, scale = FALSE)
rf2_Imp## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## dup1 4.28331581
varImpPlot(model_rf2) Here we can see V1 importance going down significantly with a predictor (duplicate 1 variable ) added.
- Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
simu <- subset(simu, select = -dup1)
model_3 <- cforest(y ~ ., data = simu)
feature_imp <- varimp(model_3)
feature_imp## V1 V2 V3 V4 V5 V6
## 7.926955696 5.860766989 0.208635817 7.208444729 2.077012973 -0.032252040
## V7 V8 V9 V10
## 0.008656239 -0.133778236 -0.072364217 -0.016155247
From visualizing, cforest has V1-V5 as the important variables, and V6-V10 are low ranking features.
- Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
model_cub <- cubist(simu[,-11], simu$y)
model_cub##
## Call:
## cubist.default(x = simu[, -11], y = simu$y)
##
## Number of samples: 200
## Number of predictors: 10
##
## Number of committees: 1
## Number of rules: 1
summary(model_cub)##
## Call:
## cubist.default(x = simu[, -11], y = simu$y)
##
##
## Cubist [Release 2.07 GPL Edition] Tue May 04 23:07:11 2021
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 200 cases (11 attributes) from undefined.data
##
## Model:
##
## Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
##
## outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
##
##
## Evaluation on training data (200 cases):
##
## Average |error| 2.224012
## Relative |error| 0.55
## Correlation coefficient 0.84
##
##
## Attribute usage:
## Conds Model
##
## 100% V1
## 100% V2
## 100% V4
## 100% V5
##
##
## Time: 0.0 secs
We can see that Cubist will only uses highly important variables. Which are V1, V2, V4 and V5.
8.2 Use a simulation to show tree bias with different granularities.
tree_sim <- twoClassSim(200,
noiseVars = 10,
corrVar = 4,
corrValue = 0.8,
mislabel= 0)
fit <- rpart(Linear01 ~ ., tree_sim)
rpart.plot(fit)8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
- Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
From looking at the model on the right it focuses its importance on the first few predictors, then the model on the left which spreads its importance across multiple predictors. This is because as the training of the model goes up, the number of focuses predictors goes down.Therefore, the model on the right uses a higher bagging fraction, so this causes the data to have less variation between samples. Finally, the model also has a high shrinkage parameter, this mean the model can learn quicker vs. one has smaller shrinkage parameter.
- Which model do you think would be more predictive of other samples?
I would think the model on the left will perform better. For the reason that the left model uses more predictors, which might be more representative of the data.
- How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
From Increasing the interaction depth this could affect the predictors importance by impacting how the model learns. Therefore, the higher the interaction depth, the faster the learner works. This could have a negative impact on the predictors.
8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
- Which tree-based regression model gives the optimal resampling and test set performance?
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess
df_1 <- missForest(df)## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
df_2 <- df_1$ximp
head(df_2)## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 10.937 11.217 1.5341
## 2 0.000 0.000 1.5454
## 3 0.000 0.000 1.5457
## 4 0.000 0.000 1.5442
## 5 10.700 0.000 1.5409
## 6 12.000 0.000 1.5425
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 929.82 998.286 205.919
## 2 917.00 1032.200 210.000
## 3 912.00 1003.600 207.100
## 4 911.00 1014.600 213.300
## 5 918.00 1027.500 205.700
## 6 924.00 1016.800 208.900
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 177.53 177.73 43.00
## 2 177.00 178.00 46.57
## 3 178.00 178.00 45.07
## 4 177.00 177.00 44.92
## 5 178.00 178.00 44.96
## 6 178.00 178.00 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 9.098 9.404 45.49
## 2 9.491 9.887 0.00
## 3 9.216 9.565 0.00
## 4 9.238 9.620 0.00
## 5 8.912 9.268 0.00
## 6 9.182 9.912 0.00
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 5.4 3.59 12.33
## 2 3.0 0.00 3.00
## 3 4.0 1.00 4.00
## 4 5.0 2.00 5.00
## 5 8.0 4.00 18.00
## 6 9.0 1.00 1.00
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 0.014 0.018 11.6
## 2 0.100 0.150 11.1
## 3 0.000 0.000 12.0
## 4 0.000 0.000 10.6
## 5 0.000 0.000 11.0
## 6 0.000 0.000 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
data <- df_2[, 2:58]
target <- df_2[,1]
train <- createDataPartition(target, p=0.75)
train_pred <- data[train$Resample1,]
train_tar <- target[train$Resample]
test_pred <- data[-train$Resample1,]
test_target <- target[-train$Resample1]
control <- trainControl(method = "cv", number=10)Cubist
cube_model = train(train_pred, train_tar, method="cubist", metric =
"Rsquared", tuneLength=10, trainControl=control)
predict_cube = predict(cube_model, test_pred)
postResample(predict_cube, obs = test_target)## RMSE Rsquared MAE
## 1.0742298 0.7184455 0.8087956
Random Forest
rf_model <- train(train_pred, train_tar,method = "rf",tuneLength =
10,trControl = control, metric ="Rsquared", importance=TRUE )
predict_rf = predict(rf_model, test_pred)
postResample(predict_rf, obs = test_target)## RMSE Rsquared MAE
## 1.2334240 0.6479936 0.9771199
Single Tree
rt_model <- train(train_pred, train_tar,method ="rpart2", metric
="Rsquared", tuneLength=9, trControl=control)
predict_rt = predict(rt_model, test_pred)
postResample(predict_rt, obs = test_target)## RMSE Rsquared MAE
## 1.7911437 0.2064795 1.3719619
I believe the cubist model has the best RMSE.
- Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
plot(varImp(cube_model), top=10)From visualizing the graph above ManufacturingProcess32 is ranked the highest. I believe the cubist model shows us the top predictors are very similar with the results from the optimal linear and nonlinear models.
- Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
plot(as.party(rt_model$finalModel)) The ManufacturingProcess32 has a higher impact on the yield, this is because it’s related to the previous models. The biological process does not appear to have as much of an impact with the response yield.