Exercise 8.1

Recreate the simulated data from Exercise 7.2

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6-V10)?

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Based on the variable importance result, the model did not use predictors V6-V10 significantly

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

It appears that a presence of another predictor that is highly correlated with existing informative predictor as V1, decreases the V1’s overall importance to the model. It splits the rate of importance between the 2 predictors

model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

The variable importance calculation under conditional argument is both different and same in way. The new calculation is slightly different from traditional. They appear to be lower, yet they are the same in treating uninformative variables. The cforest model do not use them significantly

library(party)

cforest.fit <- cforest(y ~ ., data=simulated)
varimp(cforest.fit, conditional=T)
##            V1            V2            V3            V4            V5 
##  1.8986239733  4.8021626697  0.0229993405  6.0471706526  0.9850544288 
##            V6            V7            V8            V9           V10 
## -0.0119652434 -0.0104327775 -0.0104862941  0.0004516316 -0.0074652543 
##    duplicate1 
##  1.9703660210

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Yes, the same pattern occurs of boosted trees and cubist models picking up V1-V5 with most variable importance, they also picked up highly correlated predictors.

# boosted
library(gbm)
gbm.fit <- gbm(y ~ ., data=simulated, distribution='gaussian')
summary(gbm.fit)

##                   var     rel.inf
## V4                 V4 29.51540812
## V2                 V2 22.84996360
## V1                 V1 16.89960674
## duplicate1 duplicate1 11.68667800
## V5                 V5 11.08585033
## V3                 V3  7.86610014
## V6                 V6  0.09639307
## V7                 V7  0.00000000
## V8                 V8  0.00000000
## V9                 V9  0.00000000
## V10               V10  0.00000000
# cubist
library(Cubist)
cubist.fit <- cubist(x=simulated[,-11], y=simulated$y, committees = 100)
varImp(cubist.fit)
##            Overall
## V3            43.5
## V1            52.5
## V2            59.5
## duplicate1    27.5
## V4            46.0
## V8             4.0
## V5            27.0
## V6            10.0
## V10            1.0
## V7             0.0
## V9             0.0

Exercise 8.2

Use a simulation to show tree bias with different granularities.

The simulation below confirms selection bias when model favors variables that has more distinct values(X5) , and continouos (X4) according to variable importance calculation, even though other variables are more correlated with the response (X1, X2)

library(ggcorrplot)
library(ggplot2)
library(rpart)
library(rpart.plot)
set.seed(10)

# informative variable, low variance, binary
X1 <- rep(2:3, each = 2, times = 100)

# informative variable, higher variance
X2 <- rep(11:15, each = 2, times = 40)

# variable with missing values
X3 <- rep(6:9, each = 2, times = 50); i <- X3%%2==0; X3[i]<-NA

# continous,  less granular
X4 <- sample((200:700)/11, 400, replace = F)

# random, distinct values
X5 <- rnorm(100,mean=0,sd=3)

# outcome
Y <- X1 + X2 + rnorm(10,mean=10,sd=4)

df <- as.data.frame(cbind(X1, X2, X3, X4, X5, Y))

corr<- round(cor(df),2)
ggcorrplot(corr, lab = TRUE)

tree.fit <- rpart(Y ~ ., data = df)

varImp(tree.fit)
##      Overall
## X1 0.6624162
## X2 1.5759974
## X4 0.7934827
## X5 1.5049578
## X3 0.0000000
rpart.plot(tree.fit)

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The extreme tuning parameters introduces regularization on the right model; they optimally shrink the predictors such that only a fraction will be considered important in the final model

(b) Which model do you think would be more predictive of other samples?

The model on the left with least bagging and shrinkage parameters would likely perform better in predictions due to having more predictors in play while reducing variance; it would have greater ability to generalize on unseen samples

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

With added interaction depths, the left model will spread predictor importance values and right model would have more predictors considered; the slope calculations will differ, likely decreases as the influence of the variables change in each model.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(ipred)
library(randomForest)

set.seed(111)
# Data load, imputation, and split

data(ChemicalManufacturingProcess)

df <- ChemicalManufacturingProcess

zero.var <- nearZeroVar(df)
df <- df[,-zero.var]

df.chem <- knnImputation(df)

X <- as.data.frame(df.chem %>% select(-Yield))
Y <- df.chem %>% select(Yield)

train_index <- sample(1:nrow(df.chem), nrow(df.chem)*.75)

x.train <- X[train_index,]
y.train <- df.chem$Yield[train_index]

x.test <- X[-train_index,]
y.test <- df.chem$Yield[-train_index]

# Models

# single tree
tree.mod <- rpart(y.train ~ ., data = x.train)

# randomforest tree
rf.mod <- randomForest(y.train ~ ., data = x.train)

# bagging
bag.mod <- bagging(y.train ~ ., data = x.train)

# boosted
gbm.mod <- gbm(y.train ~ ., data=x.train, distribution='gaussian')

# cubist
cubist.mod <- cubist(x=x.train, y=y.train, committees = 100)


# Predictions and Performance

tree.mod.pred <- predict(tree.mod, newdata = x.test)
rf.mod.pred <- predict(rf.mod, newdata = x.test)
bag.mod.pred <- predict(bag.mod, newdata = x.test)
gbm.mod.pred <- predict(gbm.mod, newdata = x.test, n.trees=100)
cubist.mod.pred <- predict(cubist.mod, newdata = x.test)

single.RMSE = RMSE(tree.mod.pred, y.test)
rf.RMSE = RMSE(rf.mod.pred, y.test)
bag.RMSE = RMSE(bag.mod.pred, y.test)
gbm.RMSE = RMSE(gbm.mod.pred, y.test)
cubist.RMSE = RMSE(cubist.mod.pred, y.test)

rbind(single.RMSE, rf.RMSE, bag.RMSE, gbm.RMSE, cubist.RMSE)
##                  [,1]
## single.RMSE 1.3912913
## rf.RMSE     1.0314752
## bag.RMSE    1.0865266
## gbm.RMSE    1.0474965
## cubist.RMSE 0.9235986

(a) Which tree-based regression model gives the optimal resampling and test set performance? Among the trees and rule-based models, Cubist exhibits best predictive performance against the chemical processing test data.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Similar to linear and non-linear models, manufacturing processes predictors seem to have more impact to the tree and rule-based models. The importance of Manufacturing processes such as #32, #13, # 17 consistenty being among top 10

varImp(cubist.mod)
##                        Overall
## ManufacturingProcess32    41.0
## ManufacturingProcess13    28.5
## ManufacturingProcess31     9.0
## ManufacturingProcess17    16.5
## BiologicalMaterial12      13.5
## ManufacturingProcess39     9.5
## BiologicalMaterial06      18.0
## BiologicalMaterial02      18.5
## ManufacturingProcess21     3.5
## BiologicalMaterial09       6.0
## ManufacturingProcess09    15.0
## ManufacturingProcess28     9.5
## BiologicalMaterial03      13.0
## ManufacturingProcess34     4.5
## BiologicalMaterial10       4.5
## BiologicalMaterial04       3.0
## BiologicalMaterial05       6.0
## ManufacturingProcess20     5.0
## BiologicalMaterial01       3.5
## ManufacturingProcess05     3.0
## ManufacturingProcess04    13.0
## ManufacturingProcess18     2.5
## ManufacturingProcess19     2.5
## ManufacturingProcess01     2.0
## ManufacturingProcess29    16.5
## ManufacturingProcess33    11.0
## ManufacturingProcess11     5.5
## ManufacturingProcess25     5.5
## ManufacturingProcess15     5.0
## ManufacturingProcess27     5.0
## ManufacturingProcess30     4.5
## ManufacturingProcess26     4.5
## BiologicalMaterial08       4.5
## ManufacturingProcess16     4.0
## ManufacturingProcess10     4.0
## ManufacturingProcess42     3.5
## ManufacturingProcess24     3.5
## ManufacturingProcess07     2.5
## ManufacturingProcess45     2.5
## BiologicalMaterial11       2.5
## ManufacturingProcess02     2.5
## ManufacturingProcess35     2.5
## ManufacturingProcess14     2.5
## ManufacturingProcess37     1.0
## ManufacturingProcess38     0.5
## ManufacturingProcess44     0.5
## ManufacturingProcess03     0.0
## ManufacturingProcess06     0.0
## ManufacturingProcess08     0.0
## ManufacturingProcess12     0.0
## ManufacturingProcess22     0.0
## ManufacturingProcess23     0.0
## ManufacturingProcess36     0.0
## ManufacturingProcess40     0.0
## ManufacturingProcess41     0.0
## ManufacturingProcess43     0.0

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

The visuals confirms variable dominance of manufacturing processes predictors, but also biological processes, although not dominant, are important in higher split suggesting influence in arriving at the terminal nodes

library(partykit)
tree.mod2 <- as.party(tree.mod)
plot(tree.mod2)