Assignment
Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
Libraries
library(party)## Warning: package 'party' was built under R version 4.1.2
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.1.2
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.1.2
library(partykit)## Warning: package 'partykit' was built under R version 4.1.2
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
library(dplyr)## Warning: package 'dplyr' was built under R version 4.1.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gbm)## Loaded gbm 2.1.8
library(ipred)
library(rpart)## Warning: package 'rpart' was built under R version 4.1.2
library(rpart.plot)## Warning: package 'rpart.plot' was built under R version 4.1.2
Exercises
8.1
Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"Part A
Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 – V10)?
library(randomForest)## Warning: package 'randomForest' was built under R version 4.1.2
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(y ~., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1No, the random forest model above did not assign significant importance to those features.
Part B
Now add an additional predictor that is highly correlated with one of the informative predictors. For Example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) *.1
cor(simulated$duplicate1, simulated$V1)## [1] 0.9396216
Fit another random forest on this data. Did the importance score for V1 change? What happens if we add another correlated feature?
model2 <- randomForest(y ~., data = simulated,importance = TRUE,ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2Yes, the importance score for V1 was reduced after including an additional, highly correlated feature. Let’s see what happens when we add yet another.
simulated$duplicate2 <- simulated$V1 + rnorm(200) *.1model3 <- randomForest(y ~., data = simulated,importance = TRUE,ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3As expected, adding another correlated feature brings down the importance of V1 further.
Part C
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same patterns as the traditional random forest model?
model4 <- cforest(y ~ ., data = simulated %>%
select(-c('duplicate1','duplicate2')))
rfImp4 <- varimp(model4, conditional = TRUE)
rfImp4## V1 V2 V3 V4 V5 V6
## 6.29693233 5.31985191 0.08627462 5.91248630 1.45798478 -0.16535778
## V7 V8 V9 V10
## -0.12086303 -0.50833336 -0.28273187 -0.16748484
The feature importance for the inference trees model closely resembles the feature importance values for the traditional trees model when adding additional correlated features. The first traditional model that we tried, we saw heavy reliance on feature: V1. When adding correlated features, this dropped to around 5.
# trying again, but with duplicated features
model5 <- cforest(y ~ ., data = simulated)
rfImp5 <- varimp(model5, conditional = TRUE)
rfImp5## V1 V2 V3 V4 V5 V6
## 2.72969800 4.97706847 0.01936393 5.28607283 1.44927112 -0.14961263
## V7 V8 V9 V10 duplicate1 duplicate2
## -0.16061885 -0.32586285 -0.21777891 -0.12420236 1.08914433 1.26690049
Interestingly, when dealing with highly correlated features, the inference tree model seems to punish much more strongly than the traiditional trees approach.
Part D
Repeat this process with different tree models, such as boosted and cubist.
# Boosted Trees
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode=10)
set.seed(100)
gbmTune <- train(y ~ ., data = simulated,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
rfImp6 <- varImp(gbmTune$finalModel, scale = FALSE)
rfImp6With boosted trees, the scale of the feature importance values are totally different. But relatively speaking, we see a similar trend in how the model is treating the corrleated features. V1, Duplicated1, and Duplicated2 are all being punished.
# Cubist
set.seed(100)
cubistTuned <- train(y ~ ., data = simulated, method = "cubist")
rfImp7 <- varImp(cubistTuned$finalModel, scale = FALSE)
rfImp7Interestingly, the cubist model does not punish the correlated features much at all. In fact, V1 has the highest feature importance, and duplicate1 actually has some importance as well.
8.2
Use a simulation to show tree bias with different granularities.
According to the text, random forest models tend to over value features with higher number of distinct values. We can test this by creating a loop to make multiple random forest models. We will create datasets where one feature has only 2 vaues, and another feature with 10 values. We will then create the target variable to be correlated with the first feature. Supposedly, the model would stil choose the feature with 10 distinct values.
important_features <- tibble()
for(i in 1:100){
sim <- tibble(X2 = sample(1:2, 500, replace=TRUE))
sim$X10 <- sample(1:10, 500, replace=TRUE)
sim$Y <- sim$X2 * rnorm(100, mean=0, sd=1)
model <- randomForest(Y ~ ., data = sim)
# Most important variable
important_feature <- varImp(model) %>% slice_max(order_by = Overall) %>% row.names()
important_features <- rbind(important_features, important_feature)
}
table(important_features)## important_features
## X10 X2
## 98 2
Almost every time, the feature with 10 distinct values was the most important. Even though we created the target vriable to be correlated with the first feature with only two distinct values.
8.3
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Part A
Why does the model on the right focus its iportance on just the first few predictors, whereas the model on the left spreads importance across more predictors?
Because the model on the right uses a much more aggressive learning rate, as well as using a much more inclusive bagging rate.
Learning Rate: when set to .9, the model has effectively no regularization. This allows the model to attempt to optimize the problem at each tree-stage with no penelization, which causes certain features to be picked up focused on more.
Bagging fraction: because the bagging fraction is larger, the samples used in each tree are more alike, which causes less variance when selecting important features.
Part B
Which model would you think is more preictive of other samples?
I would assume that the model with the smaller learning rate and smaller bagging fraction would perform better on new samples (the model on the left). Placing too much importance on just a few features in the training data seems to be an indicator of over-fitting.
Part C
How would increasing interaction depth affect the slope of the predictor importance for either model in Fig. 8.24?
Increasing the interaction depth allows for more splits and nodes in a tree. More splits means more features interacting with eachother. If we were to increase the interaction depth, then we would expect in both instances that the features currently at (or near) 0 importance would have an opportunity to grow. The slow would become much more linear, and less skewed as it currently is.
I would expect that the slopes would decrease (i.e. less variables would dominate) as the interaction depth was increased because the model allows more variables to be utilized in the trees.
8.7
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Preparing Data
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")# impute missing valuues
knn_impute_preproc <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
imputed <- predict(knn_impute_preproc, ChemicalManufacturingProcess)# Train test split
set.seed(42)
index <- createDataPartition(imputed$Yield, p=0.7, list=FALSE)
train <- imputed[index, ]
test <- imputed[-index, ]# Scaling
train_scale <- preProcess(train, method = c("center", "scale"))
test_scale <- preProcess(test, method = c("center", "scale"))## Warning in preProcess.default(test, method = c("center", "scale")): These
## variables have zero variances: BiologicalMaterial07
train_prep <- predict(train_scale, train)
test_prep <- predict(test_scale, test)Testing Models
train_y <- train_prep$Yield
train_x <- train_prep %>% select(-c("Yield"))
test_y <- test_prep$Yield
test_x <- test_prep %>% select(-c("Yield"))Random Forest
set.seed(100)
cartTune <- train(train_x, train_y,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv"))## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
cartPred <- predict(cartTune, test_x)
postResample(cartPred, test_y)## RMSE Rsquared MAE
## 0.8260632 0.3574677 0.6248479
Boosted Trees
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(train_x, train_y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPred <- predict(gbmTune, test_x)
postResample(gbmPred, test_y)## RMSE Rsquared MAE
## 0.6060616 0.6476292 0.4803647
Bagged Trees
set.seed(100)
baggedTree <- ipredbagg(train_y, train_x)
baggedPred <- predict(baggedTree, test_x)
postResample(baggedPred, test_y)## RMSE Rsquared MAE
## 0.6296861 0.6325148 0.5043197
Part A
Which tree based model gives the best test performance?
From the above, we can see that of the three models tested (Random Forest, Boosted Trees, and Bagged Trees), that the boosted trees model performed the best.
Part B
Which predictors are most important in the optimal tree based model? Do either the biological processes or mechanical processes dominate the list? How do the top 10 important predictors compare to the top 10 predictors for linear models?
varImp(gbmTune, scale = FALSE)## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 1047.53
## ManufacturingProcess09 343.85
## BiologicalMaterial12 259.74
## ManufacturingProcess13 237.51
## BiologicalMaterial03 210.57
## ManufacturingProcess17 173.17
## ManufacturingProcess31 164.43
## ManufacturingProcess06 148.38
## ManufacturingProcess05 115.43
## ManufacturingProcess20 112.89
## ManufacturingProcess15 103.82
## ManufacturingProcess28 102.40
## BiologicalMaterial09 96.61
## BiologicalMaterial05 89.47
## BiologicalMaterial08 88.89
## ManufacturingProcess39 85.41
## BiologicalMaterial06 81.59
## ManufacturingProcess04 75.26
## ManufacturingProcess01 75.09
## ManufacturingProcess11 73.99
The manufacturing features dominate the boosted trees feature importance rankings. This is in contrast to the linear models we evaluated previously, which had a much more even split.
Part C
Plot the optimal single tree with the distribution of yield in the terminals. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
single_tree = rpart(Yield~., data=train_prep)plot(as.party(single_tree), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))It is difficult to ascertain anything from the above view with respect to the relationship between mechanical and biological features to their yield. I can see that yield is typically higher when BM12 or BM06 are above -.66 and .66 respectively.