library(mlbench)
set.seed(200)
library(caret)
library(AppliedPredictiveModeling)
library(randomForest)
library(dplyr)
library(party)
library(gbm)
library(rpart)
library(rpart.plot)
library(ggplot2)
Recreate the simulated data from Exercise 7.2:
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
a) Fit a random forest model to all of the predictors, then estimate the variable importance scores: Did the random forest model significantly use the uninformative predictors (V6 – V10)?
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
The variable importance falls off steeply after V5. The importance for the uninformative predictors are close to zero while the informative predictor scores are substantial.
b) Now add an additional predictor that is highly correlated with one of the informative predictors. Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
simulated_plus1predictor <- simulated
simulated_plus1predictor$duplicate1 <- simulated_plus1predictor$V1 + rnorm(200) * 0.1
cor(simulated_plus1predictor$duplicate1, simulated_plus1predictor$V1)
## [1] 0.9460206
model_rf_plus1 <- randomForest(
y ~ ., data = simulated_plus1predictor,
importance = TRUE,
ntree = 1000
)
rfImp_plus1 <- varImp(model_rf_plus1, scale = FALSE)
rfImp_plus1_sorted <- rfImp_plus1 %>% arrange(desc(Overall))
rfImp_plus1_sorted
## Overall
## V4 7.04752238
## V2 6.06896061
## V1 5.69119973
## duplicate1 4.28331581
## V5 1.87238438
## V3 0.62970218
## V6 0.13569065
## V10 0.02894814
## V9 0.00840438
## V7 -0.01345645
## V8 -0.04370565
Here we see the addition of a correlated variable dropped the importance of V1 from 8.73 to 5.69, with now V4 taking top spot. The more correlated variables we add the importance of them as a set will keep on dropping.
simulated_plus2predictors <- simulated_plus1predictor
simulated_plus2predictors$duplicate2 <- simulated_plus2predictors$V1 + rnorm(200) * 0.01
cor(simulated_plus2predictors$duplicate2, simulated_plus2predictors$V1)
## [1] 0.999376
model_rf_plus2 <- randomForest(
y ~ ., data = simulated_plus2predictors,
importance = TRUE,
ntree = 1000
)
rfImp_plus2 <- varImp(model_rf_plus2, scale = FALSE)
rfImp_plus2_sorted <- rfImp_plus2 %>% arrange(desc(Overall))
rfImp_plus2_sorted
## Overall
## V4 7.09158234
## V2 6.78792845
## V1 4.18035786
## duplicate2 4.06559574
## duplicate1 2.95441997
## V5 1.99621666
## V3 0.64037145
## V6 0.15136139
## V10 0.06924451
## V7 0.02166139
## V9 -0.05078724
## V8 -0.08876124
We see that V1s duplicate further drops to 4.18 and is now diluted into duplicate 1 and 2. We see that random forests are sensitive to multicollinearity
c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
model_cforest <- cforest(
y ~ ., data = simulated_plus2predictors,
controls = cforest_unbiased(ntree = 1000)
)
cforest_imp <- varimp(model_cforest)
cforest_imp_conditional <- varimp(model_cforest, conditional = TRUE)
Regular
sort(cforest_imp, decreasing = TRUE)
## V4 V2 duplicate1 V1 duplicate2 V5
## 7.037023814 5.842185785 3.953648578 3.813500996 2.423641256 1.568077773
## V3 V7 V6 V9 V8 V10
## 0.023476911 0.013065210 0.002165501 -0.014321460 -0.022004251 -0.040804439
Conditional
sort(cforest_imp_conditional, decreasing = TRUE)
## V4 V2 duplicate1 V1 V5 duplicate2
## 5.587500324 4.573214880 1.357034540 1.335761172 1.097710024 1.056913796
## V6 V7 V9 V3 V8 V10
## 0.025747278 0.009038659 0.004086155 0.001745259 0.001492007 -0.015866157
In random forest, both duplicates show relatively high importance, with duplicate2 higher than duplicate1.In regular cforest, duplicate1 is more important than duplicate2, but both are lower than in random forest.In conditional cforest, the importance of both duplicates drops substantially, reflecting the method’s adjustment for correlation between predictors.
d)
Boosted
model_boosted1 <- train(
y ~ ., data = simulated,
method = "gbm",
verbose = FALSE,
trControl = trainControl(method = "cv")
)
boosted_imp1 <- varImp(model_boosted1)$importance
boosted_imp1 <- boosted_imp1[order(boosted_imp1$Overall, decreasing = TRUE), , drop = FALSE]
print(boosted_imp1)
## Overall
## V4 100.0000000
## V1 87.8605104
## V2 72.4122436
## V5 46.5770853
## V3 26.6947227
## V6 1.2998739
## V7 1.2198369
## V9 0.4203124
## V8 0.3256926
## V10 0.0000000
model_boosted2 <- train(
y ~ ., data = simulated_plus1predictor,
method = "gbm",
verbose = FALSE,
trControl = trainControl(method = "cv")
)
boosted_imp2 <- varImp(model_boosted2)$importance
boosted_imp2 <- boosted_imp2[order(boosted_imp2$Overall, decreasing = TRUE), , drop = FALSE]
print(boosted_imp2)
## Overall
## V4 100.00000000
## V2 72.95352598
## duplicate1 65.02855908
## V1 32.68979890
## V5 32.37670927
## V3 23.35722711
## V9 5.40107326
## V7 1.82152089
## V6 0.74658255
## V8 0.01413698
## V10 0.00000000
model_boosted3 <- train(
y ~ ., data = simulated_plus2predictors,
method = "gbm",
verbose = FALSE,
trControl = trainControl(method = "cv")
)
boosted_imp3 <- varImp(model_boosted3)$importance
boosted_imp3 <- boosted_imp3[order(boosted_imp3$Overall, decreasing = TRUE), , drop = FALSE]
print(boosted_imp3)
## Overall
## V4 100.00000000
## V2 73.89107909
## V1 59.89924903
## V5 37.35815544
## V3 28.17178089
## duplicate1 21.00899753
## duplicate2 11.50615917
## V9 1.50397970
## V7 1.38884765
## V6 0.77833787
## V8 0.01272447
## V10 0.00000000
Similar to the other models adding highly correlated predictors reduces the importance of v1 and pushes its duplicates into the top.While v1 drops as a result, it only drops one place below V2 when adding the correlated duplicates.
model_cubist1 <- train(
y ~ ., data = simulated,
method = "cubist",
trControl = trainControl(method = "cv")
)
cubist_imp1 <- varImp(model_cubist1)$importance
cubist_imp1 <- cubist_imp1[order(cubist_imp1$Overall, decreasing = TRUE), , drop = FALSE]
print(cubist_imp1)
## Overall
## V1 100.00000
## V2 77.30496
## V4 70.92199
## V5 67.37589
## V3 58.15603
## V6 19.85816
## V7 0.00000
## V8 0.00000
## V9 0.00000
## V10 0.00000
model_cubist2 <- train(
y ~ ., data = simulated_plus1predictor,
method = "cubist",
trControl = trainControl(method = "cv")
)
cubist_imp2 <- varImp(model_cubist2)$importance
cubist_imp2 <- cubist_imp2[order(cubist_imp2$Overall, decreasing = TRUE), , drop = FALSE]
print(cubist_imp2)
## Overall
## V2 100.00000
## V1 89.51613
## V4 80.64516
## V3 67.74194
## duplicate1 59.67742
## V5 50.00000
## V6 25.00000
## V7 0.00000
## V8 0.00000
## V9 0.00000
## V10 0.00000
model_cubist3 <- train(
y ~ ., data = simulated_plus2predictors,
method = "cubist",
trControl = trainControl(method = "cv")
)
cubist_imp3 <- varImp(model_cubist3)$importance
cubist_imp3 <- cubist_imp3[order(cubist_imp3$Overall, decreasing = TRUE), , drop = FALSE]
print(cubist_imp3)
## Overall
## V2 100.0000000
## V1 79.7101449
## V4 77.5362319
## duplicate2 70.2898551
## V3 58.6956522
## V5 47.8260870
## duplicate1 34.7826087
## V6 17.3913043
## V8 7.2463768
## V10 0.7246377
## V7 0.0000000
## V9 0.0000000
Boosted trees show the largest drop in v1 importance with multiple duplicates , more than random forest and cforest. Also, the top predictor v4 stays stable, unlike in cubist where V1 leads sometimes. Also in cubist duplicates displace v1 less dramatically , and several low predictors remain near zero rather than being spread out.Boosted also seem to redistribute importance more evenly among duplicates while cubist gives one duplicate higher priority and the other lower.
Use a simulation to show tree bias with different granularities.
Lets first simulate a dataset with 5 predictors that vary in granularity, a mix of categorical and continuous with varying variances.
n <- 200
v1 <- factor(sample(c("A", "B"), n, replace = TRUE))
v2 <- factor(sample(letters[1:3], n, replace = TRUE))
v3 <- factor(sample(letters[1:5], n, replace = TRUE))
v4 <- rnorm(n, mean = 5, sd = 0.5)
v5 <- rnorm(n, mean = 5, sd = 2)
y <- rnorm(n, mean = as.numeric(v1) + as.numeric(v2) + as.numeric(v3) + v4 + v5, sd = 1)
sim_gran_cont <- data.frame(y, v1, v2, v3, v4, v5)
We can then fit a regression tree, plot it and look at its first split and importance and compare it to the granularity
tree_model_cont <- rpart(y ~ ., data = sim_gran_cont, method = "anova")
rpart.plot(tree_model_cont, type = 2, extra = 101, fallen.leaves = TRUE)
We see that the predictor with the highest number of possibly distinct
values V5 gets chosen for the first split. We can estimate how often
this occurs by repeating the simulation with these predictors
n_sim <- 100
first_split <- character(n_sim)
for(i in 1:n_sim){
v1 <- factor(sample(c("A", "B"), n, replace = TRUE))
v2 <- factor(sample(letters[1:3], n, replace = TRUE))
v3 <- factor(sample(letters[1:5], n, replace = TRUE))
v4 <- rnorm(n, mean = 5, sd = 0.5)
v5 <- rnorm(n, mean = 5, sd = 2)
y <- rnorm(n, mean = as.numeric(v1) + as.numeric(v2) + as.numeric(v3) + v4 + v5, sd = 1)
sim_data <- data.frame(y, v1, v2, v3, v4, v5)
tree_model <- rpart(y ~ ., data = sim_data, method = "anova")
first_split[i] <- as.character(tree_model$frame$var[1])
}
first_split_table <- table(first_split)
first_split_table
## first_split
## v3 v5
## 2 98
In a 100 simulations we see V5 get chosen almost every single time. V5 being a continuous variable with a large variance gives it a lot of potential to reduce the rss when used as the first split and will always win out for the root compared to the other predictors with lesser distinct values.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The high learning rate and bag fraction for the model on the right cause each boosting step to make big updates using pretty much the the entire dataset (90 percent). This causes the model to acquire the strongest predictors quicker and tries to fit the residual variation using the strong predictors and doesn’t leave much for the weaker predictors to explain. In the left side, the smaller values mean that the model is being trained on smaller subsamples each time and each tree captures only a tiny portion of the underlying residual structure. This causes the model to spread its importance across more predictors as more of them are able to contribute meaningful information incrementally over the large amount of small updates.
(b)
Which model do you think would be more predictive of other samples?
The model on the left would probably be more predictive of other samples because the smaller learning rate and smaller samples force it to learn the structure more slowly and more generally. It’s not honing in on the strongest predictors right away, and it’s letting a wider range of variables contribute in incremental updates. That usually leads to a more generalized model that isn’t making huge jumps or over-fitting looking at just the training data. The model on the right is much more aggressive, and while it might fit the training data faster, it’s more likely to over fit and not perform as well on new samples.
(c)
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing the interaction depth would let the trees capture more complexity as they are allowed more splits or more nodes , and when that is allowed to happen I think the importance would get spread across the variables and allow the smaller predictors to contribute more . If that happens the slope would get shallower if we are looking at it as it is in the figures, would get steeper as the falloff becomes a lot less drastic since other predictors would come into play when the splits go deeper.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data("ChemicalManufacturingProcess")
trans <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
chemicalData <- predict(trans, ChemicalManufacturingProcess)
trainIndex <- createDataPartition(chemicalData$Yield, p = 0.8, list = FALSE)
chemTrain <- chemicalData[trainIndex, ]
chemTest <- chemicalData[-trainIndex, ]
trainX <- chemTrain[, -which(names(chemTrain) == "Yield")]
trainY <- chemTrain$Yield
testX <- chemTest[, -which(names(chemTest) == "Yield")]
testY <- chemTest$Yield
(a)
Which tree-based regression model gives the optimal resampling and test set performance?
We can try all the models that we have tried through this assingnment
treeModel_chem <- rpart(Yield ~ ., data = chemTrain, method = "anova")
rfModel_chem <- randomForest(Yield ~ ., data = chemTrain, importance = TRUE, ntree = 1000)
cforestModel_chem <- cforest(Yield ~ ., data = chemTrain, controls = cforest_unbiased(ntree=1000))
boostedModel_chem <- train(Yield ~ ., data = chemTrain, method = "gbm", verbose = FALSE, trControl = trainControl(method = "cv"))
cubistModel_chem <- train(Yield ~ ., data = chemTrain, method = "cubist", trControl = trainControl(method = "cv"))
testX <- chemTest[, -which(names(chemTest) == "Yield")]
testY <- chemTest$Yield
evaluate <- function(model, testX, testY, type = c("caret","base","cforest")) {
type <- match.arg(type)
preds <- switch(type,
caret = predict(model, newdata = testX),
base = predict(model, newdata = testX),
cforest = as.numeric(predict(model, newdata = testX, OOB=FALSE, type="response")))
data.frame(RMSE = RMSE(preds, testY), Rsquared = R2(preds, testY))
}
perf_list <- list(
RegressionTree = evaluate(treeModel_chem, testX, testY, "base"),
RandomForest = evaluate(rfModel_chem, testX, testY, "base"),
CForest = evaluate(cforestModel_chem, testX, testY, "cforest"),
Boosted = evaluate(boostedModel_chem, testX, testY, "caret"),
Cubist = evaluate(cubistModel_chem, testX, testY, "caret")
)
perf_table <- do.call(rbind, perf_list)
perf_table
## RMSE Rsquared
## RegressionTree 0.7755147 0.4964955
## RandomForest 0.6177840 0.6342400
## CForest 0.6917245 0.5699876
## Boosted 0.5904234 0.6509749
## Cubist 0.5016777 0.7575507
Cubist has the best test set performance with the lowest RMSE at 0.52 and R squared close to random forests top 0.70, and regression tree performs the works
(b)
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
cubist_imp <- varImp(cubistModel_chem)$importance
cubist_imp_sorted <- cubist_imp[order(cubist_imp$Overall, decreasing = TRUE), , drop = FALSE]
print(head(cubist_imp_sorted,10))
## Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess29 65.04854
## ManufacturingProcess13 46.60194
## BiologicalMaterial02 40.77670
## ManufacturingProcess27 36.89320
## ManufacturingProcess04 33.98058
## BiologicalMaterial03 31.06796
## ManufacturingProcess33 27.18447
## ManufacturingProcess17 24.27184
## ManufacturingProcess31 23.30097
We see that once again Process variables dominate the list 8 to 2/ Now to compare it with our optimal linear and and non linear models from prior homework assignments
Our best performing Linear Lasso model had variable importance order
Our best performing non linear SVM model had the order of
We see that the Cubist model emphasizes process variables even more strongly than previous models going 8:2 compared to Linear lassos 6:4 or Non linear SVMs 7:3. It also introduces several new process predictors that weren’t top-ranked in Lasso or SVM like ManufacturingProcess39, 04, 37, 16 to replace biological predictors. Manufacturing 32,13 and 9 retain their top 5 spots across all three model showing their strong influence on yield.
(c)
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
treeModel_chem <- rpart(Yield ~ ., data = chemTrain, method = "anova")
rpart.plot(treeModel_chem, type = 3, extra = 101, fallen.leaves = TRUE)
We see the first split at Manufacturing process 32 as expected. Here it
shows us how yield is affected in further splits with Biological
material 11 and Manufacturing process 13, After the second split another
biological factor only appears in a tiny portion of a split while the
rest of the tree is dominated by manufacturing processes on either
sides. This is a great visualisation that helps us understand how each
of the predictors affect the yield and how wide their influence is and
on which side of the yield line.