Recreate the simulated data from Exercise 7.2
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"Fit a random fores model to all the predictors, then estimate the variable importance scores:
rf_model_1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rf_imp_1 <- varImp(rf_model_1, scale = FALSE)
rf_imp_1 %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(Variable = rowname) %>%
arrange(desc(Overall)) %>%
kable() %>%
kable_styling()| Variable | Overall |
|---|---|
| V1 | 8.7322354 |
| V4 | 7.6151188 |
| V2 | 6.4153694 |
| V5 | 2.0235246 |
| V3 | 0.7635918 |
| V6 | 0.1651112 |
| V7 | -0.0059617 |
| V10 | -0.0749448 |
| V9 | -0.0952927 |
| V8 | -0.1663626 |
Did the random forest model significantly use the uninformative predictors (V6-V10)?
No, the model did not as the scores are very close to zero indicating they are uninformative.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
rf_model_2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rf_imp_2 <- varImp(rf_model_2, scale = FALSE)
rf_imp_2 %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(Variable = rowname) %>%
arrange(desc(Overall)) %>%
kable() %>%
kable_styling()| Variable | Overall |
|---|---|
| V4 | 7.0475224 |
| V2 | 6.0689606 |
| V1 | 5.6911997 |
| duplicate1 | 4.2833158 |
| V5 | 1.8723844 |
| V3 | 0.6297022 |
| V6 | 0.1356906 |
| V10 | 0.0289481 |
| V9 | 0.0084044 |
| V7 | -0.0134564 |
| V8 | -0.0437056 |
The importance of v1 has decreased since adding the additional correlated predictor. When adding however, both were significant predictors in the decision trees which caused a decrease in significance.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Stobl et al. (2007). Do these importance show the same pattern as the traditional random forest model?
rf_model_3 <- cforest(y ~ ., data = simulated)
rf_imp_3 <- varImp(rf_model_3, conditional = TRUE)
rf_imp_3 %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(Variable = rowname) %>%
arrange(desc(Overall)) %>%
kable() %>%
kable_styling()| Variable | Overall |
|---|---|
| V4 | 6.0471707 |
| V2 | 4.8021627 |
| duplicate1 | 1.9703660 |
| V1 | 1.8986240 |
| V5 | 0.9850544 |
| V3 | 0.0229993 |
| V9 | 0.0004516 |
| V10 | -0.0074653 |
| V7 | -0.0104328 |
| V8 | -0.0104863 |
| V6 | -0.0119652 |
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
# Training it without the duplicate predictor
gbm_grid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 100),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode = 10)
gbm_model_1 <- train(y ~ ., data = select(simulated, -duplicate1), method="gbm", tuneGrid = gbm_grid, verbose = FALSE)
gbm_imp_1 <- varImp(gbm_model_1)
gbm_imp_1$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(Variable = rowname) %>%
arrange(desc(Overall)) %>%
kable() %>%
kable_styling()| Variable | Overall |
|---|---|
| V4 | 100.0000000 |
| V1 | 94.3492687 |
| V2 | 85.3573190 |
| V5 | 37.3645989 |
| V3 | 31.2757450 |
| V6 | 3.2412743 |
| V7 | 0.9391656 |
| V9 | 0.5040653 |
| V10 | 0.1698201 |
| V8 | 0.0000000 |
The boosted tree model was not succesful as it did not show uninformative predictors.
# With duplicate predictor
gbm_model_2 <- train(y ~ ., data = simulated, method="gbm", tuneGrid = gbm_grid, verbose = FALSE)
gbm_imp_2 <- varImp(gbm_model_2)
gbm_imp_2$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(Variable = rowname) %>%
arrange(desc(Overall)) %>%
kable() %>%
kable_styling()| Variable | Overall |
|---|---|
| V4 | 100.000000 |
| V2 | 81.847724 |
| V1 | 54.988209 |
| V5 | 43.118057 |
| duplicate1 | 39.179131 |
| V3 | 33.518028 |
| V6 | 2.929334 |
| V8 | 2.052144 |
| V7 | 2.045898 |
| V10 | 1.305226 |
| V9 | 0.000000 |
The duplicate becomes an important variables as importance of v1 decreases.
# Without the duplicate predictor
cubist_model_1 <- train(y ~ ., data = select(simulated, -duplicate1), method="cubist")
cubist_imp_1 <- varImp(cubist_model_1)
cubist_imp_1$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(Variable = rowname) %>%
arrange(desc(Overall)) %>%
kable() %>%
kable_styling()| Variable | Overall |
|---|---|
| V1 | 100.00000 |
| V2 | 75.69444 |
| V4 | 68.05556 |
| V3 | 58.33333 |
| V5 | 55.55556 |
| V6 | 15.27778 |
| V7 | 0.00000 |
| V8 | 0.00000 |
| V9 | 0.00000 |
| V10 | 0.00000 |
The cubist model does not show unimportant variables.
cubist_model_2 <- train(y ~ ., data = simulated, method="cubist")
cubist_imp_2 <- varImp(cubist_model_2)
cubist_imp_2$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(Variable = rowname) %>%
arrange(desc(Overall)) %>%
kable() %>%
kable_styling()| Variable | Overall |
|---|---|
| V2 | 100.00000 |
| V1 | 89.51613 |
| V4 | 80.64516 |
| V3 | 67.74194 |
| duplicate1 | 59.67742 |
| V5 | 50.00000 |
| V6 | 25.00000 |
| V7 | 0.00000 |
| V8 | 0.00000 |
| V9 | 0.00000 |
| V10 | 0.00000 |
The cubist model also exhibits a similar behavior as the random forest model, but is not as pronounced.
Use a simulation to show tree bias with different granularities.
The simulated dataset generated will be done by a non-linear function then the tree based models will be used to train. MSE on the training set and test set will also be examined.
set.seed(42)
n_sample <- 350
nonlinear_function <- function(x){
sin(1.25 * x) + 2 * cos(.25*x)
}
x <- runif(n_sample, 1, 25)
f_of_x <- nonlinear_function(x)
noise <- rnorm(n_sample, 0, 2)
y <- f_of_x + noise
df <- data.frame(y=y, x=x)
in_train <- createDataPartition(df$y, p = .8, list = FALSE, times = 1)
train_df <- df[in_train,]
test_df <- df[-in_train,]
results <- data.frame(Granularity = c(NA), MSE = c(NA), data = c(NA)) %>% na.omit()
get_mse <- function(model, data){
y_hat <- predict(model, data)
mse <- mean((y_hat - data$y)^2)
return(mse)
}
for(depth in seq(1:10)){
rtree_model <- rpart(y ~ x, data = train_df, control=rpart.control(maxdepth=depth))
results <- rbind(results, data.frame(Granularity = depth, MSE = get_mse(rtree_model, train_df), data = "Training"))
results <- rbind(results, data.frame(Granularity = depth, MSE = get_mse(rtree_model, test_df), data = "Test"))
}
ggplot(results, aes(Granularity, MSE, color = data, group = data)) +
geom_line() +
geom_point() +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom", legend.title = element_blank())As the granularity increases, the MSE on the training set decreases. The MSE on the test set initially begins to decline then increases again.
In stocastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1. and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The importance spreads out over more predictors and the higher learning rate will focus the importance on a smaller set of variables.
Which model do you think would be more predictive of other samples?
The model on the left will generalize while the model on the right will overfit the training data.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
The importance spreads across more predictors as the interaction depth would increase therefore the slope decreases.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Which tree-based regression model gives the optimal resampling and test set performance?
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Reproducible
set.seed(42)
knn_model <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df <- predict(knn_model, ChemicalManufacturingProcess)
df <- df %>%
select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
pls_model <- train(
Yield ~ ., data = train_df, method = "pls",
center = TRUE,
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
pls_predictions <- predict(pls_model, test_df)
pls_in_sample <- pls_model$results[pls_model$results$ncomp == pls_model$bestTune$ncomp,]
results <- data.frame(t(postResample(pred = pls_predictions, obs = test_df$Yield))) %>%
mutate("In Sample RMSE" = pls_in_sample$RMSE,
"In Sample Rsquared" = pls_in_sample$Rsquared,
"In Sample MAE" = pls_in_sample$MAE,
"Model"= "PLS")
pls_model## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 128, 129, 130, 129, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8824790 0.3779221 0.6711462
## 2 1.1458456 0.4219806 0.7086431
## 3 0.7363066 0.5244517 0.5688553
## 4 0.8235294 0.5298005 0.5933120
## 5 0.9670735 0.4846010 0.6371199
## 6 0.9959036 0.4776684 0.6427478
## 7 0.9119517 0.4986338 0.6200233
## 8 0.9068621 0.5012144 0.6293371
## 9 0.8517370 0.5220166 0.6163795
## 10 0.8919356 0.5062912 0.6332243
## 11 0.9173758 0.4934557 0.6463164
## 12 0.9064125 0.4791526 0.6485663
## 13 0.9255289 0.4542181 0.6620193
## 14 1.0239913 0.4358371 0.6944056
## 15 1.0754710 0.4365214 0.7077991
## 16 1.1110579 0.4269065 0.7135684
## 17 1.1492855 0.4210485 0.7222868
## 18 1.1940639 0.4132534 0.7396357
## 19 1.2271867 0.4079005 0.7494818
## 20 1.2077102 0.4022859 0.7470327
## 21 1.2082648 0.4026711 0.7452969
## 22 1.2669285 0.3987044 0.7634170
## 23 1.3663033 0.3970188 0.7957514
## 24 1.4531634 0.3898475 0.8243034
## 25 1.5624265 0.3820102 0.8612935
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.