Unit 8 Exercises

8.1. Recreate the simulated data from Exercise 7.2:

set.seed(100)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Note: random forest variable importance will vary according to random seeding.

set.seed(100)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)    
rfImp1
##         Overall
## V1   5.73660298
## V2   3.92494012
## V3   1.45523257
## V4  10.79182838
## V5   1.44422311
## V6  -0.08051592
## V7  -0.06736586
## V8  -0.13788112
## V9  -0.03100011
## V10 -0.06996542

Did the random forest model significantly use the uninformative predictors (V6 - V10)?

No. Variable V6 had a minor weighting and variables V7-V10 actually had a negative contribution overall (), supporting the hypothesis that they were no more than noise.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9441238

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

set.seed(100)
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)
vnames <- c('V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'duplicate1')

names(rfImp1) <- "Original"
rfImp1$Variable <- factor(rownames(rfImp1), levels = vnames)

names(rfImp2) <- "Extra"
rfImp2$Variable <- factor(rownames(rfImp2), levels = vnames)

rfImps <- merge(rfImp1, rfImp2, all = TRUE)
rownames(rfImps) <- rfImps$Variable
rfImps$Variable <- NULL

kable(xtable(round(rfImps,2)))
Original Extra
V1 5.74 4.12
V2 3.92 3.98
V3 1.46 1.30
V4 10.79 10.31
V5 1.44 1.25
V6 -0.08 -0.06
V7 -0.07 0.03
V8 -0.14 -0.13
V9 -0.03 -0.08
V10 -0.07 -0.07
duplicate1 NA 3.02

In the creation of a slight variant of variable V1, duplicate1 shows strong importance (fourth overall), and V1’s relative importance drops significantly.

set.seed(100)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)  
rfImp3
##                 Overall
## V1          3.069710755
## V2          4.071916984
## V3          1.234706296
## V4         10.577595820
## V5          1.326638831
## V6         -0.104978608
## V7         -0.001583938
## V8         -0.083543494
## V9         -0.002506525
## V10        -0.087554083
## duplicate1  2.168019224
## duplicate2  2.140195676

A second duplicate further diminishes V1’s relative importance. In fact, the new duplicate edges the original slightly in importance.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

cforest() is an implementation of the random forest and bagging ensemble algorithms utilizing conditional inference trees as base learners. Besides the standard version, a conditional version is available, that adjusts for correlations between predictor variables. If conditional = TRUE, the importance of each variable is computed by permuting within a grid defined by the covariates that are associated (with 1 - p-value greater than threshold) to the variable of interest. The resulting variable importance score is conditional in the sense of beta coefficients in regression models, but represents the effect of a variable in both main effects and interactions. See Strobl et al. (2008) for details.

set.seed(100)
cforest1 <- party::cforest(y ~ ., data = simulated[, c(1:11)],
                    controls = cforest_control(ntree = 1000))
set.seed(100)
cforest2 <- party::cforest(y ~ ., data =  simulated[, c(1:12)],
                    controls = cforest_control(ntree = 1000))
cfImps1 <- party::varimp(cforest1) #conditional = FALSE is default
cfImps2 <- party::varimp(cforest2)
cfImps3 <- party::varimp(cforest1, conditional = TRUE)
cfImps4 <- party::varimp(cforest2, conditional = TRUE)

Under the conditional inference method, V5 sees its importance greatly diminish, with V1, V2 and V4 shown to have the greatest cotribution.

cfImps1 <- data.frame(Original = cfImps1,
                      Variable = factor(names(cfImps1), levels = vnames))
cfImps2 <- data.frame(Extra = cfImps2,
                      Variable = factor(names(cfImps2), levels = vnames))
cfImps3 <- data.frame(CondInf = cfImps3,
                      Variable = factor(names(cfImps3), levels = vnames))
cfImps4 <- data.frame("CondInf Extra" = cfImps4,
                      Variable = factor(names(cfImps4), levels = vnames))

cfImps <- merge(cfImps1, cfImps2, all = TRUE)
cfImps <- merge(cfImps, cfImps3, all = TRUE)
cfImps <- merge(cfImps, cfImps4, all = TRUE)
rownames(cfImps) <- cfImps$Variable
cfImps$Variable <- factor(cfImps$Variable, levels = vnames)
cfImps <- cfImps[order(cfImps$Variable),]
cfImps$Variable <- NULL

kable(xtable(round(cfImps,2)))
Original Extra CondInf CondInf.Extra
V1 6.30 4.27 1.86 0.75
V2 3.75 3.71 1.30 1.34
V3 0.46 0.39 0.11 0.09
V4 12.43 12.25 3.84 3.91
V5 1.49 1.19 0.48 0.51
V6 -0.09 -0.05 0.00 0.00
V7 -0.02 0.01 -0.01 0.00
V8 -0.05 -0.07 -0.02 0.00
V9 -0.03 0.01 -0.01 0.01
V10 -0.05 -0.07 -0.05 -0.04
duplicate1 NA 2.71 NA 0.69

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

The bagged tree models fail to capture that V6-V10 are noise, and award heavy weight to V3. There is far greater degree of evenness among the ten variables. When the duplicate measure is introduced, it ranks high in terms of importance, in effect taking ‘contribution’ from each other variable.

set.seed(100)
bagFit1 <- bagging(y ~ ., data = simulated[, c(1:11)], nbag = 50)
set.seed(100)
bagFit2 <- bagging(y ~ ., data = simulated[, c(1:12)], nbag = 50)

bagImp1 <- varImp(bagFit1)
names(bagImp1) <- "Original"
bagImp1$Variable <- factor(rownames(bagImp1), levels = vnames)

bagImp2 <- varImp(bagFit2)
names(bagImp2) <- "Extra"
bagImp2$Variable <- factor(rownames(bagImp2), levels = vnames)

bagImps <- merge(bagImp1, bagImp2, all = TRUE)
rownames(bagImps) <- bagImps$Variable
bagImps$Variable <- NULL

kable(xtable(round(bagImps,2)))
Original Extra
V1 2.34 2.19
V2 2.14 2.06
V3 1.91 1.83
V4 2.06 1.93
V5 2.08 1.85
V6 0.73 0.61
V7 0.85 0.80
V8 0.64 0.57
V9 0.80 0.67
V10 0.89 0.76
duplicate1 NA 2.07

The Cubist model (with 100 committees) fairs well, with strong differential between level of V1-5 and that of the V6-10 noise variables. The duplicate variable factors midway between those two levels, with minimal impact on the importance of others.

set.seed(100)
cbFit1 <- cubist(x = simulated[, c(1:10)], 
                 y = simulated$y, 
                 committees = 100)
cbImp1 <- varImp(cbFit1)
names(cbImp1) <- "Original"
cbImp1$Variable <- factor(rownames(cbImp1), levels = vnames)

set.seed(100)
cbFit2 <- cubist(x = simulated[, c(1:10, 12)], 
                 y = simulated$y, 
                 committees = 100)
cbImp2 <- varImp(cbFit2)
names(cbImp2) <- "Extra"
cbImp2$Variable <- factor(rownames(cbImp2), levels = vnames)

cbImp <- merge(cbImp1, cbImp2, all = TRUE)
rownames(cbImp) <- cbImp$Variable
cbImp$Variable <- NULL

kable(xtable(round(cbImp,2),
             caption = "Variable importance scores for part (d) simulation using Cubist.",
             label = "T:varImpSimulation5"))
Original Extra
V1 65.5 64.0
V2 56.5 56.5
V3 38.5 42.0
V4 53.5 54.0
V5 28.0 27.5
V6 0.0 0.0
V7 0.5 1.0
V8 1.5 1.0
V9 5.0 6.0
V10 1.0 2.5
duplicate1 NA 12.5

8.2. Use a simulation to show tree bias with different granularities.

One issue with CART random forest is a bias in variable selection, specifically a preference for continuous over categorical variables. Predictors that are more granular, having more potential split points, have a greater chance of being used towards the top of a tree to partition, even if the predictor haslittle-to-no relationship with the response.

We test this hypothesis of tree bias by constructing an experiment with three variables, a response variable with a direct relationship to a randomly selected binary category. The third variable is a ‘granular’ numeric of a normal distribution, randomly chosen within eight standard deviations of zero.

We construct an rpart() tree, noticing the first split is largely dependent on the standard deviation of the response variables. If the standard deviation is high (>4), the model with take its first split on the granular noise variable, demonstrating tree bias.

It is worthy to note that conditional inference trees (Hothorn et al. 2006) are one technique to overcome this. These trees are fit by performing a hypothesis test at each candidate split (across an exhaustive list) and generating p-values.

predictors.all <- data.frame(Predictor=as.character())

for (i in 1:200 ) {
  # Setting a 
  set.seed(i)
  gender.no <- sample(2, 200, replace=TRUE)
  X1 <- ifelse(gender.no == 1, 'Female', 'Male')
  Y <- round(gender.no + rnorm(200,mean=0,sd=4))
  
  set.seed(500+i)
  X2 <- rnorm(200,mean=0,sd=4)
  
  simData <- data.frame(Y=Y,X1=X1,X2=X2)
  
  #Setting depth at only 1 to determine 
  rpart <- rpart(Y~X1+X2, data=simData, control=rpart.control(maxdepth=1))
  predictor <- data.frame(Predictor=rownames(rpart$splits)[1])
  predictors.all <- rbind(predictors.all, predictor)
}
treeBiasTable <- table(predictors.all)
kable(
        xtable(table(predictors.all$Predictor),
        caption = "Frequency of predictor selection for tree bias simulation.",
        label = "T:treeBiasTable"),
        include.colnames=FALSE
)
V1
X2 112
X1 83

8.3. Stochastic Gradient Boosting

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has higher bagging fraction and learning ratio.

The bagging fraction tuning parameter gives the fraction of the training data constituting the sample used for decision tree construction. As this fraction approaches 1.0, the more similar each bootstrap of samples becomes with each other, leading to the prevalence of more dominant features. The less the stochastic element of the method (the larger the bagging fraction) the fewer predictors will be identified as important.

As the learning rate increases towards 1, the model becomes more greedy, that is, choosing the optimal weak learner at each stage. As greediness increases, the model will be more likely to identify fewer predictors related to the response. Since the fraction of the current predicted value added to the previous iteration’s predicted value is high, there is no effective ‘reset’ which would result in a larger mix of possibly important predictors.

(b) Which model do you think would be more predictive of other samples?

The model on the left, with low bagging ration and learning ratio, is likely to have better performance than the model on the right. The stochastic gradient boosting model gains predictive power through the aggregation of weak predictors.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction depth also relatively affects the variable importance metric. As tree depth increases, variable importance is likely to be spread over more predictors increasing the length of the horizontal lines in the importance figure.

8.7 Chemical Manufacturing Process

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Data Loading

data(ChemicalManufacturingProcess)

predictors <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select="Yield")

Data Splitting

#70/30 Training/Test Split
set.seed(100)
trainingRows <- createDataPartition(yield$Yield, 
                                    p = 0.7, 
                                    list = FALSE)

trainPredictors <- predictors[trainingRows,]
trainYield <- yield[trainingRows,]

testPredictors <- predictors[-trainingRows,]
testYield <- yield[-trainingRows,]

Data Pre-Processing

#Pre-process trainPredictors and apply to trainPredictors and testPredictors
pp <- caret::preProcess(trainPredictors,method=c("BoxCox","center","scale","medianImpute"))
ppTrainPredictors <- predict(pp,trainPredictors)
ppTestPredictors <- predict(pp,testPredictors)

#Identify and remove NZV
nzvpp <- nearZeroVar(ppTrainPredictors)
ppTrainPredictors <- ppTrainPredictors[-nzvpp]
ppTestPredictors <- ppTestPredictors[-nzvpp]

#Identify and remove highly correlated predictors
predcorr = cor(ppTrainPredictors)
highCorrpp <- findCorrelation(predcorr)
ppTrainPredictors <- ppTrainPredictors[, -highCorrpp]
ppTestPredictors <- ppTestPredictors[, -highCorrpp]

#Set-up trainControl
set.seed(517)
ctrl <- trainControl(method = "boot", number = 25)

(a) Which tree-based regression model gives the optimal resampling and test set performance?

We fit the data to four models:

  1. Recursive Partitioning to construct decision trees
set.seed(300)
rpartGrid <- expand.grid(maxdepth= seq(1,10,by=1))
rpartChemTune <- train(x = ppTrainPredictors, y = trainYield,
                       method = "rpart2",
                       metric = "Rsquared",
                       tuneGrid = rpartGrid,
                       trControl = ctrl)
  1. Random Forest
set.seed(300)

rfGrid <- expand.grid(mtry=seq(2,38,by=3))

rfChemTune <- train(x = ppTrainPredictors, y = trainYield,
                    method = "rf",
                    tuneGrid = rfGrid,
                    metric = "Rsquared",
                    importance = TRUE,
                    trControl = ctrl)
  1. Generalized Boosted Regression
set.seed(300)
gbmGrid <- expand.grid(interaction.depth=seq(1,6,by=1),
                       n.trees=c(25,50,100,200),
                       shrinkage=c(0.01,0.05,0.1,0.2),
                       n.minobsinnode=5)

gbmChemTune <- train(x = ppTrainPredictors, y = trainYield,
                     method = "gbm",
                     metric = "Rsquared",
                     verbose = FALSE,
                     tuneGrid = gbmGrid,
                     trControl = ctrl)
  1. Cubist Rule-based Model
set.seed(300)
cubistGrid <- expand.grid(committees = c(1, 5, 10, 20, 50, 100), 
                          neighbors = c(0, 1, 3, 5, 7))

cubistChemTune <- train(x = ppTrainPredictors, y = trainYield,
                        method = "cubist", 
                        verbose = FALSE,
                        metric = "Rsquared",
                        tuneGrid = cubistGrid,
                        trControl = ctrl)
round(rpartChemTune$results$Rsquared[best(rpartChemTune$results, "Rsquared", maximize = TRUE)],2)
## [1] 0.39
round(rfChemTune$results$Rsquared[best(rfChemTune$results, "Rsquared", maximize = TRUE)],2)
## [1] 0.56
round(gbmChemTune$results$Rsquared[best(gbmChemTune$results, "Rsquared", maximize = TRUE)],2)
## [1] 0.56
round(cubistChemTune$results$Rsquared[best(cubistChemTune$results, "Rsquared", maximize = TRUE)],2)
## [1] 0.6
cubistChemTune$results$neighbors[best(cubistChemTune$results, "Rsquared", maximize = TRUE)]
## [1] 1
cubistChemTune$results$committees[best(cubistChemTune$results, "Rsquared", maximize = TRUE)]
## [1] 100

We observe that the Cubist model gives the best goodness of fit, as measured by R-squared. The Random Forest and GBM models are second.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

plot(cubistChemTune,metric="Rsquared")

cubistChemImp <- varImp(cubistChemTune, scale = FALSE)
bookTheme()
plot(cubistChemImp, top=15, scales = list(y = list(cex = 0.8)))

According to the Cubist algorithm, the top-ten most important variables in order are: ManufacturingProcess32 ManufacturingProcess17 ManufacturingProcess13 BiologicalMaterial06 ManufacturingProcess09 ManufacturingProcess33 BiologicalMaterial03 ManufacturingProcess28 BiologicalMaterial11 ManufacturingProcess06

This shows significant overlap with the MARS algorithm of last chapter.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Here, we plot the decision tree for the Recursive Partitioning Trees. As expected, it splits the data from the top with the most important variable ManufacturingProcess32. Higher (smaller) values of process 32 are associated with larger (smaller) yields. However, the tree revelas the caveat that lower values of process 32 may be counter-acted with a corresponding lower value of process 13.

plot(as.party(rpartChemTune$finalModel),gp=gpar(fontsize=11))