DATA624: Homework 9

library(tidyverse)
library(party)
library(mlbench)
library(randomForest)
library(caret)
library(gbm)
library(rpart)
library(AppliedPredictiveModeling)
library(Cubist)
library(rpart.plot)

Objective

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Part A

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

kableExtra::kable(rfImp1)
Overall
V1 8.7322354
V2 6.4153694
V3 0.7635918
V4 7.6151188
V5 2.0235246
V6 0.1651112
V7 -0.0059617
V8 -0.1663626
V9 -0.0952927
V10 -0.0749448

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No, the random forest model does not significantly use predictors V6 to V10.

Part B

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

set.seed(123)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9504983

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

kableExtra::kable(rfImp2)
Overall
V1 5.1832605
V2 6.3186107
V3 0.6972065
V4 6.9525399
V5 1.9866611
V6 0.2338298
V7 -0.0067128
V8 -0.0837440
V9 -0.0209328
V10 -0.0696097
duplicate1 4.3054797
set.seed(123)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9504983
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)

kableExtra::kable(rfImp3)
Overall
V1 4.1374642
V2 6.4273743
V3 0.5204054
V4 7.2625004
V5 1.9475777
V6 0.2383798
V7 0.0275598
V8 -0.0475072
V9 -0.0186150
V10 -0.0231890
duplicate1 3.2627027
duplicate2 2.8812822

The importance score for V1 was reduced with the addition of duplicated1. The addition of another correlated variable to V1 further reduces the importance.

Part C

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varImp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model4 <- cforest(y~., data = simulated)
varImp(model4, conditional = FALSE) 
##                 Overall
## V1          4.604711928
## V2          5.754812650
## V3          0.063192997
## V4          7.278888213
## V5          1.851751818
## V6          0.044066429
## V7          0.053314917
## V8          0.004184564
## V9         -0.017524552
## V10        -0.024138736
## duplicate1  3.427396047
## duplicate2  1.928271229
varImp(model4, conditional = TRUE)
##                 Overall
## V1          1.695390053
## V2          4.610713807
## V3          0.047851832
## V4          5.926964145
## V5          1.341996707
## V6          0.023093544
## V7          0.010071784
## V8         -0.007191415
## V9          0.006679198
## V10        -0.011566382
## duplicate1  1.339780765
## duplicate2  0.743524464

The variable importances differ between the conditional and nonconditional parameter. It should be noted that the uninformative features, V6-V10, contribute very little if at all to the model outcome.

Part D

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted

The pattern of the uniformative features continued as most received and importance factor of about 0. A notable occurrence is that duplicate2 did not receive any weighting in variable importance.

set.seed(123)

model5 <- gbm(y~., data = simulated, distribution = "gaussian")

summary.gbm(model5)

##                   var   rel.inf
## V4                 V4 29.767165
## V2                 V2 22.645801
## V1                 V1 15.792289
## duplicate1 duplicate1 12.558058
## V5                 V5 11.557021
## V3                 V3  7.254762
## V6                 V6  0.424903
## V7                 V7  0.000000
## V8                 V8  0.000000
## V9                 V9  0.000000
## V10               V10  0.000000
## duplicate2 duplicate2  0.000000

Cubist

The cubist model had similar variable importance ranking with the gbm model. The less correlated features were assigned little importance. The variable duplicate2 was also given a value of 0.

model6 <- cubist(x = simulated[,-11], y = simulated$y, committees = 100)

varImp(model6)
##            Overall
## V1            50.0
## V3            46.0
## V2            58.0
## V4            47.5
## V5            30.5
## duplicate1    25.5
## V8             7.5
## V6             3.5
## V7             0.0
## V9             0.0
## V10            0.0
## duplicate2     0.0

8.2

Use a simulation to show tree bias with different granularities.

small <- sample(0:10, 1000, replace = T)
medium <- sample(0:100, 1000, replace = T)
large <- sample(0:1000, 1000, replace = T)

y <- small + medium + large + rnorm(200)
var(large)
## [1] 87808.23
sim.data <- data.frame(y, small, medium, large)

sim.tree <- rpart(y~., data = sim.data)
sim.tree.imp <- varImp(sim.tree, scale = FALSE)

kableExtra::kable(sim.tree.imp)
Overall
large 4.2620776
medium 0.3133978
small 0.0219418

The decision tree mostly used the large variable, which has the biggest variance in number selection. The medium variable was used somewhat to make decisions, whereas the small variable that had the smallest variance was hardly used for splits. This shows that there is a selection bias amongst tree models that favor predictors with larger variance.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Part A

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The reason why the model on the right focuses its importance on just the first few of predictors and the model on the left spreads importance across more predictors is due to how the tuning parameters, bagging fraction and learning rate are assigned. A lower bagging fraction implies that less data is to be selected for training the subsequent tree whereas a higher bagging fraction would do the opposite. The learning rate affects the model fitting process by taking incremental steps toward a local minimum. The model on the left spreads the importance over more variables because it had a slower learning rate and used less data for subsequent trees which allowed the extraction of more information from other variables. The model on the right had a faster learning rate and used more data, forcing the convergence of the model to focus on a handful of predictors.

Part B

Which model do you think would be more predictive of other samples?

I would think that the model on the left to be more predictive of other samples as it is less likely to have overfit to the training set.

Part C

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth would provide the tree with more iterations to learn from other predictors, lowering the slope of the predictor importance as more predictors will hold importance.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data("ChemicalManufacturingProcess")

imputed.knn <- preProcess(ChemicalManufacturingProcess,
           method = "knnImpute",
           k = sqrt(nrow(ChemicalManufacturingProcess))
           )

imputed.data <- predict(imputed.knn, ChemicalManufacturingProcess)

near_zero <- nearZeroVar(imputed.data)

imputed.data <- imputed.data[, -near_zero]

set.seed(115)
train_test_split <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.7, list = FALSE)

train.data <- imputed.data[train_test_split,]
test.data <- imputed.data[-train_test_split,]

Part A

Which tree-based regression model gives the optimal resampling and test set performance?

Single

set.seed(123)
single <- train(Yield ~ .,
                  data = train.data,
                  method = "rpart",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
pred1 <- predict(single, newdata = test.data)

rpart.metrics <- postResample(pred = pred1, obs = test.data$Yield)

rpart.metrics
##      RMSE  Rsquared       MAE 
## 0.8037771 0.3632390 0.6024717

Bagged

set.seed(123)
bagged <- train(Yield ~ .,
                  data = train.data,
                  method = "treebag",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))


pred2 <- predict(bagged, newdata = test.data)

bagged.metrics <- postResample(pred = pred2, obs = test.data$Yield)

bagged.metrics
##      RMSE  Rsquared       MAE 
## 0.7385620 0.3566364 0.5416379

Boosted

set.seed(123)
gb <- gbm(Yield~., data = train.data, distribution = "gaussian")


pred3 <- predict(gb, newdata = test.data)
## Using 100 trees...
boosted.metrics <- postResample(pred = pred3, obs = test.data$Yield)

boosted.metrics
##      RMSE  Rsquared       MAE 
## 0.7108198 0.3972903 0.5259861

Random Forest

set.seed(123)
rf <- train(Yield ~ .,
                  data = train.data,
                  method = "rf",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))


pred4 <- predict(rf, newdata = test.data)

rf.metrics <- postResample(pred = pred4, obs = test.data$Yield)

rf.metrics
##      RMSE  Rsquared       MAE 
## 0.6381682 0.4644105 0.4773300

Cubist

cubist.model <- train(Yield ~ .,
                  data = train.data,
                  method = "cubist",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))


pred5 <- predict(cubist.model, newdata = test.data)

cubist.metrics <- postResample(pred = pred5, obs = test.data$Yield)

cubist.metrics
##      RMSE  Rsquared       MAE 
## 0.5267067 0.6506320 0.3761447

Comparison

The best model across all metrics on the testing set data is the cubist model.

kableExtra::kable(rbind(rpart.metrics,
                        bagged.metrics,
                        boosted.metrics,
                        rf.metrics,
                        cubist.metrics))
RMSE Rsquared MAE
rpart.metrics 0.8037771 0.3632390 0.6024717
bagged.metrics 0.7385620 0.3566364 0.5416379
boosted.metrics 0.7108198 0.3972903 0.5259861
rf.metrics 0.6381682 0.4644105 0.4773300
cubist.metrics 0.5267067 0.6506320 0.3761447

Part B

The manufacturing process variables are the most important for this model. This differs from the previous linear and nonlinear models that had a 50 / 50 split within the top 10 between the biological material and manufacturing process.

varImp(cubist.model) 
## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess17 100.000
## ManufacturingProcess32  86.598
## ManufacturingProcess39  57.732
## ManufacturingProcess09  52.577
## ManufacturingProcess29  34.021
## BiologicalMaterial02    29.897
## ManufacturingProcess13  27.835
## ManufacturingProcess31  21.649
## BiologicalMaterial03    20.619
## BiologicalMaterial06    19.588
## BiologicalMaterial12    14.433
## ManufacturingProcess19  13.402
## ManufacturingProcess01  13.402
## BiologicalMaterial09    12.371
## ManufacturingProcess15  11.340
## ManufacturingProcess33  11.340
## ManufacturingProcess04  10.309
## BiologicalMaterial01     9.278
## BiologicalMaterial04     8.247
## ManufacturingProcess06   7.216

Part C

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

The single tree plot provides useful information about the data. The plot shows that the most decisive split occurs with the root node ManufacturingProcess32. The model shows that lower yeild is associated with BiologicalMaterial12, ManufacturingProcess18, BiologicalMaterial04, and BiologicalMaterial11. Higher Yields are associated with ManufacturingProcess31, BiologicalMater05, and ManufacturingProcess17.

rpart.model <-  rpart(Yield~., data = train.data)

rpart.plot(rpart.model)