8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a.

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors(V6 – V10)?

rfImp1

From the table, we can see the variable importance scores for V6 to V10 is very small. The model didn’t significantly use the uninformative predictors(V6 – V10).

b.

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

set.seed(100)
model2 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

After adding another predictor that is highly correlated to V1, the importance score for V1 decrease, but the total importance score of V1 and the newly add predictor increase.

simulated2<-simulated
simulated2$duplicate2 <- simulated$V1 + rnorm(200) * .01
cor(simulated2$duplicate2, simulated2$V1)
## [1] 0.9994584
model3 <- randomForest(y ~ ., data = simulated2,importance = TRUE,ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3

I tried to add another predictor which also highly corrected to V1. We can see the important score for V1 continue to decline, but the total number of important score for these three correlated predictors increase drastically.

c.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

set.seed(100)
bagCtrl<-cforest_control(mtry=ncol(simulated)-1)
bagSimulated<-cforest(y~.,data =simulated,controls = bagCtrl)
bagSimulated
## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  500 
## 
## Response:  y 
## Inputs:  V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, duplicate1 
## Number of observations:  200
varimp(bagSimulated)
##           V1           V2           V3           V4           V5 
##  3.840518590  7.445780832  0.047691944 10.319701342  2.236977179 
##           V6           V7           V8           V9          V10 
## -0.013885087  0.063017675 -0.040109462 -0.007222512  0.026651870 
##   duplicate1 
##  5.458303058

Compare to the traditional model we use, the cforest function model with conditional argument shows the same pattern of the importance, but the importance numbers have some small changes. For example, V1’s importance score decrease from 5.727115518 to 4.06414449.

c.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

# boosted trees
gbmSimulated<-gbm(y~.,data =simulated,distribution = "gaussian")
gbmSimulated
## gbm(formula = y ~ ., distribution = "gaussian", data = simulated)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 11 predictors of which 8 had non-zero influence.
varImp(gbmSimulated, numTrees = 100)
# Cubist
cubistSimulated<-cubist(simulated[,-11],simulated$y,data =simulated,committees=100)
cubistSimulated
## 
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y, committees =
##  100, data = simulated)
## 
## Number of samples: 200 
## Number of predictors: 11 
## 
## Number of committees: 100 
## Number of rules per committee: 1, 1, 4, 1, 4, 1, 4, 1, 6, 1, 4, 2, 3, 2, 3, 2, 3, 2, 3, 2 ...
varImp(cubistSimulated)

When we use Boosted Trees and Cubist model, the pattern is not the same. For boosted trees model, V4 is the most important variable. The importance score is greater than the total number of V 1 and duplicated1, which is highly correlated with V1. For the Cubist Model, V2 become the most important variable.

8.2

Use a simulation to show tree bias with different granularities.

X1 <- rep(1:2,each=100)
X2 <- rnorm(200,mean=0,sd=2)
X3 <- rnorm(200,mean=2,sd=7)
X4 <- rep(3:4,each=100)
y<-X1*X4+rnorm(200,mean=0,sd=5)
df<-data.frame(cbind(X1,X2,X3,X4))
head(df)
Simulation<-rpart(y~.,data=df)
rpart.plot(Simulation)

varImp(Simulation)

Trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors. As our simulation data show, X1 and X4 determine the value of y, but have a lower importance score because X2 and X3 contain more distinct data.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient.Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudesof these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

a.


Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?


Bag fraction is the fraction of the training set observations randomly selected to propose the next tree in the expansion.If the number is 0.9, which means 90% of the training sample at each iteration. When the number is high, each iteration the trees are seeing the same dataset-they will likely split similarly.

The learning rate parameter in Gradient Boosting shrinks the contribution of each new base model, which also means controls the fraction of the predictions of each tree being added.A higher learning rate increases the dependent / correlation structure. More of the same predictors will be selected among the trees.

Those combination is the reason why the right plot with higher bagging fraction and learning rate focus on just the first few of predictors and have a steeper variable importance.

b.


Which model do you think would be more predictive of other samples?


Learning rate reduces the size of incremental steps and thus penalizes the importance of each consecutive iteration. It is an effective way to control overfitting of the model. I will prefer the model with smaller/slower learning rate and guess it might have a lower error rate.

c.

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

data(solubility)

depthGrid1 <- expand.grid(n.trees=100, interaction.depth=1, shrinkage=0.1, n.minobsinnode=10)
depth1 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = depthGrid1, verbose = FALSE)

depthGrid2 <- expand.grid(n.trees=100, interaction.depth=6, shrinkage=0.1, n.minobsinnode=10)
depth2 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = depthGrid2, verbose = FALSE)

varImp(depth1)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 228)
## 
##                    Overall
## NumCarbon         100.0000
## MolWeight          65.7843
## SurfaceArea2       36.8174
## NumAromaticBonds   19.6755
## NumChlorine        19.0671
## SurfaceArea1       14.0323
## HydrophilicFactor  12.3080
## FP044               6.8117
## FP172               5.2298
## NumNonHAtoms        2.7371
## FP059               2.6971
## NumMultBonds        2.6190
## FP112               2.4667
## NumRotBonds         2.1721
## FP147               2.0676
## FP135               1.7993
## FP206               1.0069
## NumOxygen           1.0040
## NumHydrogen         0.9203
## FP124               0.7820
plot(varImp(depth1))

varImp(depth2)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 228)
## 
##                   Overall
## NumCarbon         100.000
## MolWeight          70.607
## SurfaceArea1       31.125
## SurfaceArea2       28.867
## HydrophilicFactor  19.189
## NumNonHAtoms        8.388
## NumAromaticBonds    7.735
## NumHalogen          6.476
## NumMultBonds        5.918
## FP044               4.499
## NumChlorine         3.366
## FP172               3.109
## NumHydrogen         2.735
## FP161               2.436
## NumRotBonds         2.366
## NumOxygen           2.270
## FP075               1.907
## FP059               1.798
## FP063               1.772
## NumBonds            1.546
plot(varImp(depth2))

Interaction depth (Maximum nodes per tree) is the number of splits it has to perform on a tree.I choose learning rate 0.1 as an example. When we change the interaction depth from 1 to 6, we can see the predictors of importance spread out and the slope looks less steep.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data(ChemicalManufacturingProcess)
chemical<-data.frame(ChemicalManufacturingProcess)
chemical<-kNN(chemical) %>%
         select(Yield:ManufacturingProcess45)

indx<-createDataPartition(chemical$Yield,p=0.8,list=FALSE)
chemical_train<-chemical[indx,]
chemical_test<-chemical[-indx,]
chemical_train

a.

Which tree-based regression model gives the optimal resampling and test set performance?

Single Tree:

set.seed(200)
treeModel<-rpart(Yield~.,data=chemical_train)
treeModel
## n= 144 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 144 450.441400 40.16215  
##    2) ManufacturingProcess32< 158.5 81 118.560800 39.14173  
##      4) BiologicalMaterial11< 145.565 45  45.217840 38.59111  
##        8) ManufacturingProcess39>=7.15 33  22.485820 38.22515  
##         16) ManufacturingProcess09< 44.975 15   4.950040 37.69800 *
##         17) ManufacturingProcess09>=44.975 18   9.893844 38.66444 *
##        9) ManufacturingProcess39< 7.15 12   6.158625 39.59750 *
##      5) BiologicalMaterial11>=145.565 36  42.646000 39.83000  
##       10) ManufacturingProcess11< 9.95 25  16.257860 39.37880 *
##       11) ManufacturingProcess11>=9.95 11   9.731473 40.85545 *
##    3) ManufacturingProcess32>=158.5 63 139.097700 41.47413  
##      6) ManufacturingProcess13>=33.45 45  59.576080 40.96267  
##       12) ManufacturingProcess29< 20.8 33  34.219650 40.55273  
##         24) BiologicalMaterial05>=16.865 23  15.216690 40.18304  
##           48) ManufacturingProcess21< -0.35 8   2.104350 39.54250 *
##           49) ManufacturingProcess21>=-0.35 15   8.079373 40.52467 *
##         25) BiologicalMaterial05< 16.865 10   8.630010 41.40300 *
##       13) ManufacturingProcess29>=20.8 12   4.560200 42.09000 *
##      7) ManufacturingProcess13< 33.45 18  38.320960 42.75278 *
treePre<-predict(treeModel,chemical_test)
treeResult<-postResample(treePre,chemical_test$Yield)
treeResult
##      RMSE  Rsquared       MAE 
## 2.0888204 0.1850943 1.5768327

Bagged Tree:

set.seed(200)
baggedModel<-bagging(Yield~.,data=chemical_train)
baggedModel
## 
## Bagging regression trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Yield ~ ., data = chemical_train)
baggedPre<-predict(baggedModel,chemical_test)
baggedResult<-postResample(baggedPre,chemical_test$Yield)
baggedResult
##      RMSE  Rsquared       MAE 
## 1.6126005 0.4373037 1.2674998

Random Forest

set.seed(200)
rfModel<-randomForest(Yield~.,data=chemical_train)
rfModel
## 
## Call:
##  randomForest(formula = Yield ~ ., data = chemical_train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 19
## 
##           Mean of squared residuals: 1.123794
##                     % Var explained: 64.07
rfPre<-predict(rfModel,chemical_test)
rfResult<-postResample(rfPre,chemical_test$Yield)
rfResult
##      RMSE  Rsquared       MAE 
## 1.6497909 0.4070138 1.2681745

Gradient Boosting

set.seed(200)
gbmModel<-gbm(Yield~.,data =chemical_train,distribution = "gaussian",cv.folds = 5)
gbmModel
## gbm(formula = Yield ~ ., distribution = "gaussian", data = chemical_train, 
##     cv.folds = 5)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## The best cross-validation iteration was 85.
## There were 57 predictors of which 32 had non-zero influence.
# optimum number of trees
gbm.perf(gbmModel, method = "cv")

## [1] 85
gbmPre<-predict(gbmModel,chemical_test,trees=100)
## Using 85 trees...
gbmResult<-postResample(gbmPre,chemical_test$Yield)
gbmResult
##      RMSE  Rsquared       MAE 
## 1.6526674 0.4161624 1.3105641

Cubist

set.seed(200)
cubistModel<-cubist(chemical_train[,2:58],chemical_train$Yield,data =simulated,committees=100)
cubistModel
## 
## Call:
## cubist.default(x = chemical_train[, 2:58], y =
##  chemical_train$Yield, committees = 100, data = simulated)
## 
## Number of samples: 144 
## Number of predictors: 57 
## 
## Number of committees: 100 
## Number of rules per committee: 1, 6, 1, 7, 1, 6, 3, 1, 1, 1, 1, 1, 8, 2, 8, 2, 2, 4, 2, 4 ...
cubistPre<-predict(cubistModel,chemical_test)
cubistResult<-postResample(cubistPre,chemical_test$Yield)
cubistResult
##      RMSE  Rsquared       MAE 
## 1.2456660 0.6645186 0.9761990

Compare the model

result<-data.frame(rbind(treeResult,baggedResult,rfResult,gbmResult,cubistResult))
rownames(result)<-c("Decision Tree","Bagged Tree","Random Forest","Gradient Boosting","Cubist")
result

Cubist has the highest R squared, 0.6645186 and lowest RMSE, 1.245666, so Cubist has the optimal resampling and test set performance.

b.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The optimal tree-based regression model is Cubist model, so I will use this model to check the variables importance.

varImp(cubistModel)

I use PLS model(linear model) to compare with Cubist Model:

#PLS model 
ctrl<-trainControl(method = "cv",number=10)
set.seed(100)
chemicalTune <- train(x=chemical_train[,2:58],
                y=chemical_train$Yield, 
                method='pls',
                metric='Rsquared',
                tuneLength=20,
                trControl=ctrl,
                preProcess=c('center', 'scale')
                )

varImp(chemicalTune)
## Warning: package 'pls' was built under R version 3.6.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess36   75.02
## ManufacturingProcess13   68.87
## ManufacturingProcess33   67.89
## ManufacturingProcess09   64.10
## BiologicalMaterial02     57.29
## BiologicalMaterial08     56.63
## BiologicalMaterial06     56.16
## ManufacturingProcess06   55.92
## ManufacturingProcess12   52.47
## ManufacturingProcess17   51.91
## ManufacturingProcess04   50.68
## BiologicalMaterial03     50.21
## BiologicalMaterial11     49.61
## BiologicalMaterial12     49.43
## BiologicalMaterial01     49.27
## BiologicalMaterial04     47.62
## ManufacturingProcess28   45.49
## ManufacturingProcess11   38.60
## ManufacturingProcess02   38.31

For Cubist model,the most important predictor is ManufacturingProcess17. From the list, we can see process variables still dominate the top 5 of the list and the total number of importance is still higher than biological variables. At the same time, if we count the top 10 of the list, biological variables is more than process variables.

Compare to the PLS model(Linear model), the number and weights of process variables in the Cubist model decrease, and biological variables gradually take the place of process variables.

c.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpart.plot(treeModel, nn=TRUE)

I choose decision tree model to plot the graph. ManufacturingProcess32 is the most important variable and has the highest distribution of yield. We can also notice that manufacturer processes have a much larger impacting role towards yield then biological factors.