Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
8.1. Recreate the simulated data:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
library(caret)
model1 <- randomForest( y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
Overall
V1 8.732235404
V2 6.415369387
V3 0.763591825
V4 7.615118809
V5 2.023524577
V6 0.165111172
V7 -0.005961659
V8 -0.166362581
V9 -0.095292651
V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 - V10)?
From the above list, we can see that V6 - V10 were not significant in the model given their low “Overall” score.
[1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000
)
varImp(model2, scale = FALSE)
Overall
V1 5.69119973
V2 6.06896061
V3 0.62970218
V4 7.04752238
V5 1.87238438
V6 0.13569065
V7 -0.01345645
V8 -0.04370565
V9 0.00840438
V10 0.02894814
duplicate1 4.28331581
Comparing the results of the first model and the second model, we can see that duplicate1 is pretty significant and that the importance of V1 decreased.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
model3 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000
)
varImp(model3, scale = FALSE)
Overall
V1 4.91687329
V2 6.52816504
V3 0.58711552
V4 7.04870917
V5 2.03115561
V6 0.14213148
V7 0.10991985
V8 -0.08405687
V9 -0.01075028
V10 0.09230576
duplicate1 3.80068234
duplicate2 1.87721959
When adding a second duplicate, we see that the importance of V1 decreases again as well as the weight of duplicate1.
library(party)
simulated <- simulated[, c(1:11)]
model4 <- cforest(
y ~ .,
data = simulated
)
varimp(model4, conditional = TRUE)
V1 V2 V3 V4 V5
5.699856e+00 5.190114e+00 6.219633e-03 6.741766e+00 1.182547e+00
V6 V7 V8 V9 V10
-1.104739e-02 1.491985e-05 -8.383608e-03 -7.355064e-03 -2.342891e-02
Yes, they’re similar to the first model where V1, V2, V4, and V5 were much more important than the rest.
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Boosted Trees:
library(gbm)
global_tr_control <- trainControl(method = "cv", allowParallel = TRUE)
boost_tree_grid <- expand.grid(
.interaction.depth = seq(1, 7, by = 1),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(.01, .5, by = .05),
.n.minobsinnode = c(5:15)
)
boost_fit <- train(
simulated[, c(1:10)],
simulated$y,
method = "gbm",
tuneGrid = boost_tree_grid,
verbose = FALSE,
trControl = global_tr_control
)
varImp(boost_fit, scale = FALSE)
gbm variable importance
Overall
V4 9260.0
V1 8779.1
V2 7226.2
V5 3688.0
V3 2943.8
V6 416.8
V7 375.2
V10 224.0
V8 142.4
V9 141.4
Using Boosted Trees, we see that the the same variables are the most important.
Cubist
library(Cubist)
cubist_fit <- cubist(
simulated[, c(1:10)],
simulated$y
)
varImp(cubist_fit, scale = FALSE)
Overall
V1 50
V2 50
V4 50
V5 50
V3 0
V6 0
V7 0
V8 0
V9 0
V10 0
Again using the cubist model, we can see that the same variables are the most important. Here though the other variables importance was set to 0.
8.2. Use a simulation to show tree bias with different granularities.
We’ll create a series of numbers with different levels of grannularity where:
c1 is in the .1 range
c2 is in the .01 range
and so on until the 5th, most grannular level.
set.seed(19940211)
c1 <- sample(1:10 / 10, 2000, replace = TRUE)
c2 <- sample(1:100 / 100, 2000, replace = TRUE)
c3 <- sample(1:1000 / 1000, 2000, replace = TRUE)
c4 <- sample(1:10000 / 10000, 2000, replace = TRUE)
y <- c1 + c2 + c3 + c4 + rnorm(100, mean = 0, sd = 1.5)
train_tree_data <- data.frame(c1, c2, c3, c4, y)
sim_tree_fit <- rpart(
y ~ .,
data = train_tree_data
)
plot(as.party(sim_tree_fit))
Overall
c1 0.06361753
c2 0.17198751
c3 0.11809556
c4 0.13957752
The simulation above is fairly random and we can see from the returned data from varImp() that each variable is relatively similar in importance.
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The model on the right (bagging fraction = 0.9 and shrinkage = 0.9) focuses its importance on the first few predictors because the high learning rate causes the model to converge quickly on some of the most powerful independent predictors. Additionally, because of the high bagging fraction, the model is selecting very large subsets when bootstrapping meaning that the samples are not very diverse.
(b) Which model do you think would be more predictive of other samples?
I belive that the model on the left will be more predictive because the parameters selected would make it better at generalizing across diverse samples.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing the interaction depth would functionally increase the maximum number of nodes on the tree. This would increase the slope.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
sum(is.na(ChemicalManufacturingProcess[, -c(1)]))
[1] 106
ChemicalManufacturingProcess <- ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)]
# Filling NAs using KNN
knn_impute <- preProcess(
ChemicalManufacturingProcess[, -c(1)],
method = "knnImpute"
)
cmp_independent <- predict(
knn_impute,
ChemicalManufacturingProcess[, -c(1)]
)
cmp_dependent <- ChemicalManufacturingProcess[, c(1), drop = FALSE]
sum(is.na(cmp_independent[, -c(1)]))
[1] 0
set.seed(19940211)
# Partition the data into a sample of 80% of the full dataset
cmp_train_rows <- createDataPartition(
cmp_independent$BiologicalMaterial01,
p = 0.8,
list = FALSE
)
# Use the sample to create a training dataset
cmp_train_ind <- cmp_independent[cmp_train_rows, ]
cmp_train_dep <- cmp_dependent[cmp_train_rows]
# Use the sample to create a testing dataset
cmp_test_ind <- cmp_independent[-cmp_train_rows, ]
cmp_test_dep <- cmp_dependent[-cmp_train_rows]
(a) Which tree-based regression model gives the optimal resampling and test set performance? Single Tree
cmp_stree_fit <- train(
cmp_train_ind,
cmp_train_dep,
method = "rpart2",
tuneLength = 10,
trControl = global_tr_control
)
cmp_stree_pred <- predict(
cmp_stree_fit,
cmp_test_ind
)
postResample(
cmp_stree_pred,
cmp_test_dep
)
RMSE Rsquared MAE
1.4827649 0.5008663 1.1742636
Bagged Trees
library(ipred)
cmp_bag_fit <- ipredbagg(
cmp_train_dep,
cmp_train_ind
)
cmp_bag_pred <- predict(
cmp_bag_fit,
cmp_test_ind
)
postResample(
cmp_bag_pred,
cmp_test_dep
)
RMSE Rsquared MAE
1.2790524 0.6830667 0.9635714
Random Forest
library(randomForest)
cmp_randforest_fit <- randomForest(
cmp_train_ind,
cmp_train_dep,
importance = TRUE,
ntree = 1000
)
cmp_randforest_pred <- predict(
cmp_randforest_fit,
cmp_test_ind
)
postResample(
cmp_randforest_pred,
cmp_test_dep
)
RMSE Rsquared MAE
1.3094566 0.6734821 0.9844426
Boosted Trees
boosted_animals <- train(
cmp_train_ind,
cmp_train_dep,
method = "gbm",
tuneGrid = boost_tree_grid,
verbose = FALSE,
trControl = global_tr_control
)
boosted_pred <- predict(
boosted_animals,
cmp_test_ind
)
postResample(
boosted_pred,
cmp_test_dep
)
RMSE Rsquared MAE
1.3748034 0.5648721 1.0478236
Cubist
cmp_cubist_fit <- cubist(
cmp_train_ind,
cmp_train_dep
)
cmp_cubist_pred <- predict(
cmp_cubist_fit,
cmp_train_ind
)
postResample(
cmp_cubist_pred,
cmp_test_dep
)
RMSE Rsquared MAE
NA 0.004342594 NA
Our bagged tree provided the best results with an R2 of 0.683 and the lowest RMSE and MAE of 1.28 and 1.17, respectively. The majority of the other models had similar RMSE and MAE scores but most had significantly lower R2 scores with cubist having one of less than 0.01.
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Overall
BiologicalMaterial03 0.9034354
ManufacturingProcess32 0.8108736
ManufacturingProcess17 0.7102317
BiologicalMaterial11 0.7071977
BiologicalMaterial12 0.7003613
ManufacturingProcess31 0.6280937
ManufacturingProcess09 0.6146938
ManufacturingProcess06 0.5380744
BiologicalMaterial05 0.5254365
BiologicalMaterial09 0.4977530
In the bagged tree model, the biological material and manufacturing process predictors both comprised about half of the top 10. In our previous PLS model, the manufacturing process dominated both the PLS and KNN models that we determined were the best in the other models.
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?