Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
set.seed(10)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
# Print out the variable importance scores.
rfImp1
## Overall
## V1 8.90420548
## V2 6.70898998
## V3 0.86791755
## V4 7.62604729
## V5 2.23267325
## V6 0.06923985
## V7 0.06089436
## V8 -0.10874321
## V9 -0.07789474
## V10 -0.06803808
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
Answer:
From the above variable importance table we can see that the model significantly utilized variables V1 through V5. Whilst present, Variables V6 through V10 were not significantly utilized by the model. This is apparent by their low importance scores.
Â
Traditional Importance Measure
set.seed(10)
# Fit a random forest model using conditional inference trees.
cforest_model <- cforest(y ~ ., simulated)
# Print out the variable importance scores without conditional argument.
cforestImp3 <- varimp(cforest_model)
cforestImp3
## V1 V2 V3 V4 V5 V6
## 5.799253750 5.298015397 0.258691754 6.330561629 1.823288605 0.007444318
## V7 V8 V9 V10 duplicate1 duplicate2
## 0.208115563 -0.116038987 0.035354230 -0.074702917 4.430253665 4.557321703
Conditional Importance Measure
# Print out the variable importance scores using conditional argument.
cforestImp3Conditional <- varimp(cforest_model, conditional = TRUE)
cforestImp3Conditional
## V1 V2 V3 V4 V5 V6
## 2.59979292 4.27632052 0.03614943 5.61307920 1.18052467 -0.28216376
## V7 V8 V9 V10 duplicate1 duplicate2
## -0.10972358 -0.15616185 -0.16504915 -0.16416097 1.73207732 1.83901858
Answer:
The conditional importance measure dramatically decreases the importance of all the variables in the model when compared to the traditional measure.
Â
Generalized Boosted Model
set.seed(10)
# Fit a Generalized Boosted model.
grid = expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode = 5)
gbmModel = train(y ~ ., data = simulated, method = 'gbm', tuneGrid = grid, verbose = FALSE)
# Print out the variable importance scores.
gbmImp <- varImp(gbmModel)
gbmImp
## gbm variable importance
##
## Overall
## V4 100.0000
## V2 76.9717
## duplicate2 53.5461
## V5 40.5607
## V3 35.6464
## V1 28.2897
## duplicate1 22.7830
## V6 4.3984
## V7 2.3665
## V9 1.0731
## V8 0.3006
## V10 0.0000
Answer:
It appears the same pattern occurs for the GBM model - the V4 variable remains the most influential.
Cubist Model
set.seed(10)
# Fit a Cubist model using the cubist() function.
cubistModel <- cubist(simulated[,-11], simulated[, 11])
# Print out the variable importance scores.
cubistImp<-varImp(cubistModel)
cubistImp
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## V6 50
## V3 0
## V7 0
## V8 0
## V9 0
## V10 0
## duplicate1 0
## duplicate2 0
Answer:
The same pattern does not seem to occur for the Cubist model. Variables V1 through V6 hold the same importance.
Â
Fig. 8.24: A comparison of variable importance magnitudes for differing values of the bagging fraction and shrinkage parameters. Both tuning parameters are set to 0.1 in the left figure. Both are set to 0.9 in the right figure.
Use a simulation to show tree bias with different granularities.
set.seed(10)
# Simulate the data using the twoClassSim() function.
simulation <- twoClassSim(200, noiseVars = 6, corrVar = 4, corrValue = 0.8) %>%
mutate(TwoFactor1 = as.factor(round(TwoFactor1, 0)),
TwoFactor2 = as.factor(round(TwoFactor2, 0)))
# Create the tree and plot the data.
simulationTree <- rpart(Linear01 ~ ., simulation)
fancyRpartPlot(simulationTree, caption = NULL)
# Print out the variable importance scores.
varImp(simulationTree)
## Overall
## Class 0.12899309
## Corr1 0.56877057
## Corr2 0.80484357
## Corr3 0.45148582
## Corr4 0.31354494
## Linear02 0.09092615
## Linear04 0.57816796
## Linear05 0.30368897
## Linear06 0.34327812
## Linear07 0.21178385
## Linear08 0.21668061
## Linear09 0.08907005
## Linear10 0.64657871
## Noise1 0.15824703
## Noise2 0.25301676
## Noise3 0.42263110
## Noise4 0.64929630
## Noise5 0.85549633
## Noise6 0.24896270
## Nonlinear1 0.57689821
## Nonlinear2 0.32892122
## Nonlinear3 0.66215725
## TwoFactor1 0.46778884
## TwoFactor2 0.67971422
## Linear03 0.00000000
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Answer:
The model on the right has a higher learning and bagging rate, which means that a larger percentage of the data is used. Additionally, this results in a larger portion of the current predicted value being added to the previous iterations predicted value. According to the text, "The importance profile for boosting has a much steeper importance slope than the one for random forests. This is due to the fact that the trees from boosting are dependent on each other and hence will have correlated structures as the method follows by the gradient. Therefore many of the same predictors will be selected across the trees, increasing their contribution to the importance metric".
Answer:
Smaller learning and bagging rates will lead to less variance for new samples so I believe the model on the left would be more predictive of other samples.
Answer:
Increasing interaction depth would result in the slope of predictor importance becoming steeper due to the spreading out of importance predictors.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
set.seed(10)
data(ChemicalManufacturingProcess)
# Impute the missing values using KNN.
cmpImputed <- preProcess(ChemicalManufacturingProcess, 'knnImpute')
# Predict after imputation.
chemicalMPData <- predict(cmpImputed, ChemicalManufacturingProcess)
# Split the training data using an 80% training data split.
trainingData <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)
xTrainData <- chemicalMPData[trainingData, ]
yTrainData <- ChemicalManufacturingProcess$Yield[trainingData]
# Split the test data.
xTestData <- chemicalMPData[-trainingData, ]
yTestData <- ChemicalManufacturingProcess$Yield[-trainingData]
Â
Single Tree Model
set.seed(10)
# Define the R Part Single Tree model.
rPartSingleTree <- train(x = xTrainData,
y = yTrainData,
method = 'rpart',
tuneLength = 10,
trControl = trainControl(method = 'cv'))
# Run predict() and postResample() on the model.
rPartSingleTreePred <- predict(rPartSingleTree, newdata = xTestData)
rPartSingleTreePerformance <- postResample(pred = rPartSingleTreePred, obs = yTestData)
rPartSingleTreePerformance
## RMSE Rsquared MAE
## 0.2897274 0.9749795 0.2309521
Â
Random Forest Model
set.seed(10)
# Define the Random Forest model.
randomForest <- train(x = xTrainData,
y = yTrainData,
method = 'rf',
tuneLength = 10,
importance = TRUE,
trControl = trainControl(method = 'cv'))
# Run predict() and postResample() on the model.
randomForestPred <- predict(randomForest, newdata = xTestData)
randomForestPerformance <- postResample(pred = randomForestPred, obs = yTestData)
randomForestPerformance
## RMSE Rsquared MAE
## 0.06814128 0.99860581 0.05070449
Â
GBM Model
set.seed(10)
# Define the GBM model.
grid <- expand.grid(interaction.depth = seq(1, 6, by = 1),
n.trees = c(25, 50, 100, 200),
shrinkage = c(0.01, 0.05, 0.1, 0.2),
n.minobsinnode = c(5, 10, 15))
gbmModel <- train(x = xTrainData,
y = yTrainData,
method = 'gbm',
tuneGrid = grid,
trControl = trainControl(method = 'cv'),
verbose = FALSE)
# Run predict() and postResample() on the model.
gbmModelPred <- predict(gbmModel, newdata = xTestData)
gbmModelPerformance <- postResample(pred = gbmModelPred, obs = yTestData)
gbmModelPerformance
## RMSE Rsquared MAE
## 0.08243419 0.99776327 0.06678381
Â
rbind('Single Tree Model' = rPartSingleTreePerformance,
'Random Forest Model' = randomForestPerformance,
'GBM Model' = gbmModelPerformance) %>%
kable() %>% kable_styling()
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Single Tree Model | 0.2897274 | 0.9749795 | 0.2309521 |
| Random Forest Model | 0.0681413 | 0.9986058 | 0.0507045 |
| GBM Model | 0.0824342 | 0.9977633 | 0.0667838 |
Answer:
Based on the lowest RMSE value and the highest Rsquared value, the Random Forest model gives the optimal resampling and test set performance.
Â
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list?
varImp(randomForest)
## rf variable importance
##
## only 20 most important variables shown (out of 58)
##
## Overall
## Yield 100.000
## ManufacturingProcess33 5.433
## BiologicalMaterial03 5.425
## BiologicalMaterial11 5.354
## ManufacturingProcess30 4.390
## ManufacturingProcess38 4.386
## ManufacturingProcess05 4.363
## ManufacturingProcess12 4.156
## ManufacturingProcess40 4.093
## ManufacturingProcess02 4.092
## ManufacturingProcess11 3.983
## BiologicalMaterial09 3.978
## ManufacturingProcess14 3.953
## ManufacturingProcess19 3.828
## BiologicalMaterial10 3.685
## ManufacturingProcess27 3.659
## ManufacturingProcess44 3.645
## ManufacturingProcess06 3.603
## ManufacturingProcess39 3.598
## ManufacturingProcess09 3.592
vip(randomForest, aesthetics = list(color = 'red', fill='green'))
Answer:
The most important predictors in the optimal tree-based regression model (Random Forest) are shown above. As we can see, the process variables dominate the list.
How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
# Define and tune the SVM model.
svmModel <- train(x = xTrainData,
y = yTrainData,
method = 'svmRadial',
preProc = c('center', 'scale'),
tuneLength = 14,
trControl = trainControl(method = 'cv'))
varImp(svmModel)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 58)
##
## Overall
## Yield 100.00
## ManufacturingProcess32 39.43
## ManufacturingProcess13 36.09
## BiologicalMaterial06 31.25
## ManufacturingProcess17 30.28
## ManufacturingProcess36 28.51
## ManufacturingProcess09 27.98
## BiologicalMaterial03 26.77
## BiologicalMaterial12 26.22
## ManufacturingProcess06 24.76
## BiologicalMaterial02 23.26
## BiologicalMaterial11 22.08
## ManufacturingProcess31 20.56
## ManufacturingProcess33 19.15
## BiologicalMaterial08 18.34
## BiologicalMaterial04 17.97
## ManufacturingProcess11 16.75
## BiologicalMaterial01 15.42
## ManufacturingProcess29 14.38
## ManufacturingProcess12 13.94
vip(svmModel, aesthetics = list(color = 'red', fill='purple'))
Answer:
From the previous homework on linear and non linear models, the SVM model was found to be the optimal model. As we can see from the above, the process variables dominate the list the same way they do for the Random Forest model.
Â
rpartTree <- rpart(Yield ~., data = xTrainData)
fancyRpartPlot(rpartTree, caption = 'Distribution of Yield')
Answer:
The above tree diagram does provide additional knowledge about the biological or process predictors and their relationship with yield. The diagram shows us the importance of each of the variables. The higher up the tree, the more important the variable is. This helps us to determine which variables are influential in improving yield.