Homework 9: Trees and Rules

Instructions

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

Packages

library(AppliedPredictiveModeling)
library(dplyr)
library(forecast)
library(ggplot2)
library(tidyr)
library(mice)
library(corrplot)
library(MASS)
library(earth)
library(RANN)

8.1

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Variance Importance:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, 
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
order(-rfImp1)
##  [1]  1  4  2  5  3  6 10  7  9  8
rfImp1
##         Overall
## V1   8.84289608
## V2   6.74508245
## V3   0.67830653
## V4   7.75934674
## V5   2.23628276
## V6   0.11429887
## V7   0.03724747
## V8  -0.05349642
## V9  -0.04495617
## V10  0.03863205

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

Based on the rfImp1, uninformative predictors V6 – V10 were not significantly used in the random forest model. Predictors (in order) 1, 4, 2, 5, 3 were significantly used in the random forest model.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

#copy of original data
simulated2 <- simulated
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)
## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Variance Importance:

model2 <- randomForest(y ~ ., data = simulated2, importance = TRUE, 
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
order(-rfImp2)
##  [1]  4  2  1 11  5  3  6  7 10  9  8
rfImp2
##                 Overall
## V1          6.008319352
## V2          6.308908170
## V3          0.571604465
## V4          7.187015958
## V5          2.131040245
## V6          0.211304611
## V7          0.025100355
## V8         -0.116980037
## V9         -0.003679481
## V10         0.024878337
## duplicate1  3.618101735

Adding a highly correlated shifted the importance score for all predictors. Predictor V1 was demoted from being the most important to third most important along side it’s highly correlated duplicate at fourth place.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Variance Importance Orders:

library(party)
model3 <- cforest(y ~., data = simulated)
order(-varimp(model3, conditional = FALSE))
##  [1]  1  4  2  5  7  8  3  9  6 10
order(-varimp(model3, conditional = TRUE))
##  [1]  4  1  2  5  6  3  7  9  8 10

No, the varimp importance for the cforest model of the simulated data are not in the same order as the importance of the randomForest model.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Cubist

Variable Importance:

library(Cubist)
simulated_x <- subset(simulated, select = -c(y))
cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)
order(-varImp(cubist_model))
##  [1]  1  3  4  2  5  6  7  8  9 10

Boosted Trees

Variable Importance:

library(gbm)
set.seed(624)
gbm_model <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(gbm_model)

##     var    rel.inf
## V4   V4 30.5402762
## V1   V1 26.9937619
## V2   V2 22.6797905
## V5   V5 11.6078356
## V3   V3  7.4358856
## V9   V9  0.4541585
## V6   V6  0.2882918
## V7   V7  0.0000000
## V8   V8  0.0000000
## V10 V10  0.0000000

It looks like the Cubist model has V1 as the most important variable, and the Bagged Trees has V4 has the most important predictor. The uninformative predictors V6 – V10 were not significantly used in both models.

8.2

Use a simulation to show tree bias with different granularities.

Variance of each predictor:

set.seed(624)
#samples for predictors
low <- sample(0:50, 500, replace = T)
medium <- sample(0:500, 500, replace = T)
high <- sample(0:5000, 500, replace = T)
#response
y <- low + medium + high + rnorm(250)
#check variance of predictors
var(low)
## [1] 221.9536
var(medium)
## [1] 21752.82
var(high)
## [1] 2016913

Variable importance:

sim_data <- data.frame(low, medium, high, y)
diff_gran_model <- randomForest(y ~., data = sim_data, importance = TRUE, ntree = 1000)
varImp(diff_gran_model, scale=FALSE)
##           Overall
## low     -19996.77
## medium   15806.52
## high   3605673.96

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

#load data
data(ChemicalManufacturingProcess)
set.seed(624)

imputed_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
full_data <- predict(imputed_data, ChemicalManufacturingProcess)

index_chem <- createDataPartition(full_data$Yield , p=.8, list=F)
train_chem <-  full_data[index_chem,] 
test_chem <- full_data[-index_chem,]
train_predictors <- train_chem[-c(1)]
test_predictors <-  test_chem[-c(1)]

(a) Which tree-based regression model gives the optimal re sampling and test set performance?

Random Forest

set.seed(624)
#fit the model
rf_model <- randomForest(train_predictors, train_chem$Yield, importance = TRUE, ntrees = 1000)
rf_model
## 
## Call:
##  randomForest(x = train_predictors, y = train_chem$Yield, importance = TRUE,      ntrees = 1000) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 15
## 
##           Mean of squared residuals: 0.3711346
##                     % Var explained: 59.85
rfPred <- predict(rf_model, newdata = test_predictors)
postResample(pred = rfPred, obs = test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 0.6715320 0.7323224 0.4828902

Boosted Trees

set.seed(624)
gbm_model <- gbm.fit(train_predictors, train_chem$Yield, distribution = "gaussian")
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.9239             nan     0.0010    0.0004
##      2        0.9231             nan     0.0010    0.0006
##      3        0.9224             nan     0.0010    0.0007
##      4        0.9217             nan     0.0010    0.0006
##      5        0.9211             nan     0.0010    0.0006
##      6        0.9203             nan     0.0010    0.0006
##      7        0.9196             nan     0.0010    0.0007
##      8        0.9190             nan     0.0010    0.0006
##      9        0.9183             nan     0.0010    0.0007
##     10        0.9175             nan     0.0010    0.0005
##     20        0.9111             nan     0.0010    0.0007
##     40        0.8980             nan     0.0010    0.0006
##     60        0.8865             nan     0.0010    0.0004
##     80        0.8743             nan     0.0010    0.0005
##    100        0.8628             nan     0.0010    0.0002
gbm_model
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 46 predictors of which 7 had non-zero influence.
gbmPred <- predict(gbm_model, newdata = test_predictors)
postResample(pred = gbmPred, obs = test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 1.1097582 0.5627084 0.8634997

Cubist

set.seed(624)
cube_model <- cubist(train_predictors, train_chem$Yield)
cube_model
## 
## Call:
## cubist.default(x = train_predictors, y = train_chem$Yield)
## 
## Number of samples: 144 
## Number of predictors: 46 
## 
## Number of committees: 1 
## Number of rules: 2
cubePred <- predict(cube_model, newdata = test_predictors)
postResample(pred = cubePred, obs = test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 0.6724398 0.6903309 0.5159054

The Random Forest model provides optimal metrics, with the lowest RMSE of 0.67, and an \(R^2\) value of 0.73.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Top 10:
ManufacturingProcess32, BiologicalMaterial06, BiologicalMaterial03, ManufacturingProcess13, ManufacturingProcess36, BiologicalMaterial11, ManufacturingProcess09, BiologicalMaterial08, ManufacturingProcess28, ManufacturingProcess11

head(varImp(rf_model),10)
##                          Overall
## BiologicalMaterial01    2.166860
## BiologicalMaterial03   11.767253
## BiologicalMaterial05    3.656294
## BiologicalMaterial06   12.943335
## BiologicalMaterial08    8.546350
## BiologicalMaterial09    7.019902
## BiologicalMaterial10    2.524260
## BiologicalMaterial11    9.396847
## ManufacturingProcess01  6.146426
## ManufacturingProcess02  4.739992

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart)
library(rpart.plot)

rpart_tree <- rpart(Yield ~., data = train_chem)

rpart.plot(rpart_tree)

Yes, this view of the data provides additional knowledge.

Using the plot we can see the root node is the ManufacturingProcess32 predictor and is then split between two Biological Material predictors. Seven of the ten terminal nodes are predicted by Manufacturing Process variables and remaining three by Biological Material.