Assignment

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

Libraries

library(party)
## Warning: package 'party' was built under R version 4.1.2
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.1.2
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.1.2
library(partykit)
## Warning: package 'partykit' was built under R version 4.1.2
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gbm)
## Loaded gbm 2.1.8
library(ipred)
library(rpart)
## Warning: package 'rpart' was built under R version 4.1.2
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.2

Exercises

8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Part A

Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 – V10)?

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.2
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y ~., data = simulated,importance = TRUE,ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)

rfImp1

No, the random forest model above did not assign significant importance to those features.

Part B

Now add an additional predictor that is highly correlated with one of the informative predictors. For Example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) *.1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216

Fit another random forest on this data. Did the importance score for V1 change? What happens if we add another correlated feature?

model2 <- randomForest(y ~., data = simulated,importance = TRUE,ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)

rfImp2

Yes, the importance score for V1 was reduced after including an additional, highly correlated feature. Let’s see what happens when we add yet another.

simulated$duplicate2 <- simulated$V1 + rnorm(200) *.1
model3 <- randomForest(y ~., data = simulated,importance = TRUE,ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE)

rfImp3

As expected, adding another correlated feature brings down the importance of V1 further.

Part C

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same patterns as the traditional random forest model?

model4 <- cforest(y ~ ., data = simulated %>%
  select(-c('duplicate1','duplicate2')))

rfImp4 <- varimp(model4, conditional = TRUE)

rfImp4
##          V1          V2          V3          V4          V5          V6 
##  6.29693233  5.31985191  0.08627462  5.91248630  1.45798478 -0.16535778 
##          V7          V8          V9         V10 
## -0.12086303 -0.50833336 -0.28273187 -0.16748484

The feature importance for the inference trees model closely resembles the feature importance values for the traditional trees model when adding additional correlated features. The first traditional model that we tried, we saw heavy reliance on feature: V1. When adding correlated features, this dropped to around 5.

# trying again, but with duplicated features

model5 <- cforest(y ~ ., data = simulated)

rfImp5 <- varimp(model5, conditional = TRUE)

rfImp5
##          V1          V2          V3          V4          V5          V6 
##  2.72969800  4.97706847  0.01936393  5.28607283  1.44927112 -0.14961263 
##          V7          V8          V9         V10  duplicate1  duplicate2 
## -0.16061885 -0.32586285 -0.21777891 -0.12420236  1.08914433  1.26690049

Interestingly, when dealing with highly correlated features, the inference tree model seems to punish much more strongly than the traiditional trees approach.

Part D

Repeat this process with different tree models, such as boosted and cubist.

# Boosted Trees

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode=10)
set.seed(100)

gbmTune <- train(y ~ ., data = simulated,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

rfImp6 <- varImp(gbmTune$finalModel, scale = FALSE)

rfImp6

With boosted trees, the scale of the feature importance values are totally different. But relatively speaking, we see a similar trend in how the model is treating the corrleated features. V1, Duplicated1, and Duplicated2 are all being punished.

# Cubist

set.seed(100)
cubistTuned <- train(y ~ ., data = simulated, method = "cubist")
rfImp7 <- varImp(cubistTuned$finalModel, scale = FALSE)

rfImp7

Interestingly, the cubist model does not punish the correlated features much at all. In fact, V1 has the highest feature importance, and duplicate1 actually has some importance as well.

8.2

Use a simulation to show tree bias with different granularities.

According to the text, random forest models tend to over value features with higher number of distinct values. We can test this by creating a loop to make multiple random forest models. We will create datasets where one feature has only 2 vaues, and another feature with 10 values. We will then create the target variable to be correlated with the first feature. Supposedly, the model would stil choose the feature with 10 distinct values.

important_features <- tibble()

for(i in 1:100){
  sim <- tibble(X2 = sample(1:2, 500, replace=TRUE))
  sim$X10 <- sample(1:10, 500, replace=TRUE)
  sim$Y <- sim$X2 * rnorm(100, mean=0, sd=1)
  
  model <- randomForest(Y ~ ., data = sim)
  
  # Most important variable
  important_feature <- varImp(model) %>% slice_max(order_by = Overall) %>% row.names()
  
  important_features <- rbind(important_features, important_feature)
  
  
}

table(important_features)
## important_features
## X10  X2 
##  98   2

Almost every time, the feature with 10 distinct values was the most important. Even though we created the target vriable to be correlated with the first feature with only two distinct values.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Part A

Why does the model on the right focus its iportance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

Because the model on the right uses a much more aggressive learning rate, as well as using a much more inclusive bagging rate.

  • Learning Rate: when set to .9, the model has effectively no regularization. This allows the model to attempt to optimize the problem at each tree-stage with no penelization, which causes certain features to be picked up focused on more.

  • Bagging fraction: because the bagging fraction is larger, the samples used in each tree are more alike, which causes less variance when selecting important features.

Part B

Which model would you think is more preictive of other samples?

I would assume that the model with the smaller learning rate and smaller bagging fraction would perform better on new samples (the model on the left). Placing too much importance on just a few features in the training data seems to be an indicator of over-fitting.

Part C

How would increasing interaction depth affect the slope of the predictor importance for either model in Fig. 8.24?

Increasing the interaction depth allows for more splits and nodes in a tree. More splits means more features interacting with eachother. If we were to increase the interaction depth, then we would expect in both instances that the features currently at (or near) 0 importance would have an opportunity to grow. The slow would become much more linear, and less skewed as it currently is.

I would expect that the slopes would decrease (i.e. less variables would dominate) as the interaction depth was increased because the model allows more variables to be utilized in the trees.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Preparing Data

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
# impute missing valuues

knn_impute_preproc <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))

imputed <- predict(knn_impute_preproc, ChemicalManufacturingProcess)
# Train test split
set.seed(42)

index <- createDataPartition(imputed$Yield, p=0.7, list=FALSE) 
train <- imputed[index, ]
test <- imputed[-index, ]
# Scaling

train_scale <- preProcess(train, method = c("center", "scale"))
test_scale <- preProcess(test, method = c("center", "scale"))
## Warning in preProcess.default(test, method = c("center", "scale")): These
## variables have zero variances: BiologicalMaterial07
train_prep <- predict(train_scale, train)
test_prep <- predict(test_scale, test)

Testing Models

train_y <- train_prep$Yield
train_x <- train_prep %>% select(-c("Yield"))

test_y <- test_prep$Yield
test_x <- test_prep %>% select(-c("Yield"))

Random Forest

set.seed(100)

cartTune <- train(train_x, train_y,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
cartPred <- predict(cartTune, test_x)

postResample(cartPred, test_y)
##      RMSE  Rsquared       MAE 
## 0.8260632 0.3574677 0.6248479

Boosted Trees

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)
##      RMSE  Rsquared       MAE 
## 0.6060616 0.6476292 0.4803647

Bagged Trees

set.seed(100)

baggedTree <- ipredbagg(train_y, train_x)
 
baggedPred <- predict(baggedTree, test_x)

postResample(baggedPred, test_y)
##      RMSE  Rsquared       MAE 
## 0.6296861 0.6325148 0.5043197

Part A

Which tree based model gives the best test performance?

From the above, we can see that of the three models tested (Random Forest, Boosted Trees, and Bagged Trees), that the boosted trees model performed the best.

Part B

Which predictors are most important in the optimal tree based model? Do either the biological processes or mechanical processes dominate the list? How do the top 10 important predictors compare to the top 10 predictors for linear models?

varImp(gbmTune, scale = FALSE)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 1047.53
## ManufacturingProcess09  343.85
## BiologicalMaterial12    259.74
## ManufacturingProcess13  237.51
## BiologicalMaterial03    210.57
## ManufacturingProcess17  173.17
## ManufacturingProcess31  164.43
## ManufacturingProcess06  148.38
## ManufacturingProcess05  115.43
## ManufacturingProcess20  112.89
## ManufacturingProcess15  103.82
## ManufacturingProcess28  102.40
## BiologicalMaterial09     96.61
## BiologicalMaterial05     89.47
## BiologicalMaterial08     88.89
## ManufacturingProcess39   85.41
## BiologicalMaterial06     81.59
## ManufacturingProcess04   75.26
## ManufacturingProcess01   75.09
## ManufacturingProcess11   73.99

The manufacturing features dominate the boosted trees feature importance rankings. This is in contrast to the linear models we evaluated previously, which had a much more even split.

Part C

Plot the optimal single tree with the distribution of yield in the terminals. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

single_tree  = rpart(Yield~., data=train_prep)
plot(as.party(single_tree), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))

It is difficult to ascertain anything from the above view with respect to the relationship between mechanical and biological features to their yield. I can see that yield is typically higher when BM12 or BM06 are above -.66 and .66 respectively.