Tree and Rules -KJ8.1-8.3

getwd()

## [1] "T:/00-624 HH Predictive Analytics/HMWK9 Tree HH/HH HW9Tree"

library(ggplot2)
library(mice)

## 
## Attaching package: 'mice'

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(ipred)
library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

library(partykit)

## Loading required package: libcoin

## 
## Attaching package: 'partykit'

## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

library(rpart)
library(vip)

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

8.1

Kuhn and Johnson 8.1) Recreate the simulated data from Exercise 7.2

library(mlbench)
library(caret)

## Loading required package: lattice

set.seed(200)

simulated <- mlbench.friedman1(200, sd=1)
simulated <- cbind(simulated$x,simulated$y)
simulated <- as.data.frame(simulated)

colnames(simulated) [ncol(simulated)] <- "y"
head(simulated)

##          V1        V2         V3         V4         V5         V6        V7
## 1 0.5337724 0.6478064 0.85078526 0.18159957 0.92903976 0.36179060 0.8266609
## 2 0.5837650 0.4381528 0.67272659 0.66924914 0.16379784 0.45305931 0.6489601
## 3 0.5895783 0.5879065 0.40967108 0.33812728 0.89409334 0.02681911 0.1785614
## 4 0.6910399 0.2259548 0.03335447 0.06691274 0.63744519 0.52500637 0.5133614
## 5 0.6673315 0.8188985 0.71676079 0.80324287 0.08306864 0.22344157 0.6644906
## 6 0.8392937 0.3862983 0.64618857 0.86105431 0.63038947 0.43703891 0.3360117
##          V8         V9       V10        y
## 1 0.4214081 0.59111440 0.5886216 18.46398
## 2 0.8446239 0.92819306 0.7584008 16.09836
## 3 0.3495908 0.01759542 0.4441185 17.76165
## 4 0.7970260 0.68986918 0.4450716 13.78730
## 5 0.9038919 0.39696995 0.5500808 18.42984
## 6 0.6489177 0.53116033 0.9066182 20.85817

As before, we used the mlbench to simulate the data.

8.1A

Fit a random forest model to all of the predictors, then estimate the variable importance scores.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(200)
model1 <- randomForest(y ~ ., data=simulated,importance=TRUE,ntree=1000)

rfImp1 <- varImp(model1, scale=FALSE)
rfImp1

##          Overall
## V1   8.605365900
## V2   6.831259165
## V3   0.741534943
## V4   7.883384091
## V5   2.244750293
## V6   0.136054182
## V7   0.055950944
## V8  -0.068195812
## V9   0.003196175
## V10 -0.054705900

vip(model1, color = 'red', fill='lightBlue') + 
  ggtitle('Model1, Randome Forest')

## Warning in vip.default(model1, color = "red", fill = "lightBlue"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

The random forest model has ranked by importance V1, V4, V2, V5, V3 as the most important variables. It does not show that the variables V6- V10, which are the UN-uniformed predictors in the data as important variables at all

8.1B

Now add an additional predictor that is highly correlated with one of the informative predictors. Fit another random forest model to these data.

set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200)*.1+0.9
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9497025

Now we add another variable, which shows 94.9% correlations to V1 variable

model2 <- randomForest(y ~ ., data=simulated,importance=TRUE,ntree=1000)
rfImp2 <- varImp(model2, scale=FALSE)
rfImp2

##                Overall
## V1          6.00709780
## V2          6.05937899
## V3          0.58465293
## V4          6.86363294
## V5          2.19939891
## V6          0.10898039
## V7          0.06104207
## V8         -0.04059204
## V9          0.06123662
## V10         0.09999339
## duplicate1  4.43323167

grid.arrange(vip(model1, color = 'red', fill='lightBlue') + 
  ggtitle('Model1 Random Forest Var Imp'),
  vip(model2, color = 'green', fill='Blue') + 
  ggtitle('Model2 Random Forest Var Imp'), ncol = 2)

## Warning in vip.default(model1, color = "red", fill = "lightBlue"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

## Warning in vip.default(model2, color = "green", fill = "Blue"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

A: When we rerun the random forest model with this additional high correlation variable into it. We can see that random forest has re-arranged the variables . now V1 and it’s duplicate variable has contributed only half of their importance each as before, meaning that together they contribute to their original importance. The new order of importin’s ranking are now shifted into V4, V2, V1, V1 duplicated.

Q: What happens when you add another predictor that is also highly correlated with V1?

set.seed(200)
simulated$duplicate2 <- simulated$V1 + rnorm(200)*.1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9497025

model3 <- randomForest(y ~ ., data=simulated, importance=TRUE, ntree=1000) 
rfImp3 <- varImp(model3, scale=FALSE) 
rfImp3

##                Overall
## V1          4.84857115
## V2          6.50242654
## V3          0.51808831
## V4          7.21616819
## V5          2.11631632
## V6          0.22066363
## V7         -0.02066815
## V8         -0.06964769
## V9         -0.02908959
## V10        -0.03105942
## duplicate1  2.84168163
## duplicate2  2.94318033

grid.arrange(vip(model1, color ='red', fill='lightBlue') + ggtitle('Model1 Random Forest Var Imp'), 
             vip(model2, color ='green', fill='Blue') + ggtitle('Model2 Random Forest Var Imp'),
             vip(model3, color ='white',fill='green') + ggtitle ('Model3 Random Forest Var Imp'), 
             ncol =3)

## Warning in vip.default(model1, color = "red", fill = "lightBlue"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

## Warning in vip.default(model2, color = "green", fill = "Blue"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

## Warning in vip.default(model3, color = "white", fill = "green"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

Answer: Further when we add a third variable, named duplicate 2, which is essentially another duplicate of V1. Then we run the model again in random forest, we can see that things all these three variables (V1, duplicate 1, duplicate 2 ) are essentially the same variable, random forest takes that into consideration. Together all these three variables importance are cut by 2/3, in other words, all three of them in total combinedly explains the original V1 variable importance. Now the new order of importance is V4, view 2, V1, V5, duplicate 2, duplicate one.

8.1C

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al (2007). Do these importances show the same patterns the traditional random forest model?

library(party)
set.seed(200)
simulated_removedups <- subset(simulated, select=-c(duplicate1,duplicate2))

model4 <- cforest(y ~., data=simulated_removedups)
rfImp4 <- varimp(model4)
sort(rfImp4)

##          V8         V10          V9          V6          V7          V3 
## -0.30008835 -0.24685120 -0.04506062  0.07895189  0.21504772  0.23458171 
##          V5          V2          V4          V1 
##  2.13338717  6.21307352  7.55930190  8.04166864

vip(model4, color = 'red', fill='yellow') + 
  ggtitle('Model4, cFOREST')

## Warning in vip.default(model4, color = "red", fill = "yellow"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

 grid.arrange(vip(model1, color = 'red', fill='lightBlue') +   ggtitle('Model1 RF'), 
  vip(model2, color = 'green', fill='Blue') +   ggtitle('Model2 RF'),  
   vip(model4 ,color = 'red', fill='yellow')+  ggtitle ('Model4, cForest'), ncol = 3)

## Warning in vip.default(model1, color = "red", fill = "lightBlue"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

## Warning in vip.default(model2, color = "green", fill = "Blue"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

## Warning in vip.default(model4, color = "red", fill = "yellow"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

Answer: The cforest package is applied to the model first, without the conditional argument. We can see that in this case C forest performs exactly the same as Model 1 in random forest.

model4b <- cforest(y ~ ., data = simulated, ntree = 100)  ## correct data is simulated (not the remove dup)
# model4b <- cforest(y ~ ., data = simulated_removedups, ntree = 100) ## wronf

# Conditional variable importance
cfImp4c <- varimp(model4b, conditional = TRUE)

# Un-conditional variable importance
cfImp4u <- varimp(model4b, conditional = FALSE)

old.par <- par(mfrow=c(1, 2))

barplot(sort(cfImp4c),horiz = TRUE, main = 'Conditional, CRF', col = 'white')
barplot(sort(cfImp4u),horiz = TRUE, main = 'Un-Conditional, CRF', col = 'grey')

Answer: Next we apply the conditional argument into the Cforest model. Then we can see that in the unconditional Model, the rank of the variables are the same as the original random forest. It means that with the highly correlated duplicated variables of V1. Unconditional CRF model does not differentiate them, rather, treat them as if they are not correlated.

When we apply the conditional argument into the cforest model, it rearranges the V1, and reduces its importance, which means it has taken into consideration that there are two other variables that are correlated with V1. So when there are high correlated variables, we should use the conditional argument to correct this correlation, which will yield a better model in performance.

8.1D

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

set.seed(200)
library(gbm)

## Loaded gbm 2.1.8

gbmGrid <- expand.grid(interaction.depth =seq(1,6, by=2),
            n.trees=seq(100,500,by=10),
          shrinkage=c(0.01,0.1),
          n.minobsinnode = c(5, 10, 20))
gbmTune <- train(y ~ ., 
                 data=simulated,## data should be the one with the dups
            method="gbm",          tuneGrid = gbmGrid,          verbose=FALSE)

rfImp5 <- varImp(gbmTune)
rfImp5

## gbm variable importance
## 
##            Overall
## V4         100.000
## V2          81.674
## V1          60.642
## V5          41.524
## V3          37.052
## duplicate1  36.838
## V6           6.480
## V7           3.158
## V10          2.879
## V9           1.926
## V8           1.799
## duplicate2   0.000

vip(gbmTune, color = 'green', fill='purple') + 
  ggtitle('Model5 gbm')

## Warning in vip.default(gbmTune, color = "green", fill = "purple"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

In the GBM model, the variables rank in such order V4, V1, V2,V5, V3. This is the same as the random forest model that we first apply to.

set.seed(200)
library(Cubist)
cubistTuned <- train(y ~ ., 
                     data=simulated,## data should be the one with the dups
                     # data=simulated_removedups, 
                     method="cubist")

rfImp6 <- varImp(cubistTuned)
rfImp6

## cubist variable importance
## 
##            Overall
## V1          100.00
## V2           79.17
## V4           68.06
## V3           58.33
## V5           52.08
## V6           25.69
## V7            0.00
## V9            0.00
## duplicate1    0.00
## duplicate2    0.00
## V10           0.00
## V8            0.00

vip(cubistTuned, color = 'green', fill='orange') + 
  ggtitle('Model6 Cubist')

## Warning in vip.default(cubistTuned, color = "green", fill = "orange"): Arguments
## `width`, `alpha`, `color`, `fill`, `size`, and `shape` have all been deprecated
## in favor of the new `mapping` and `aesthetics` arguments. They will be removed
## in version 0.3.0.

In the cubist model, the variables rank in such order V1, V2,v4, v3,v5. This is a little different from the other models in terms of ranking. But interestingly, V1 ranks first, which makes me want to investigate further into unconditional vs conditional as below.

model6c <- cubist(x = simulated_removedups[, names(simulated_removedups)[names(simulated_removedups) != 'y']], 
                 y = simulated_removedups[,c('y')])
# data=simulated,## data should be the one with the dups
           
# Conditional variable importance
cfImp6c <- varImp(model6c, conditional = TRUE)

# Un-conditional variable importance
cfImp6u <- varImp(model6c, conditional = FALSE)

old.par <- par(mfrow=c(1, 2))
barplot((t(cfImp6c)),horiz = TRUE, main = 'Conditional, Cubist', col = 'white')
barplot((t(cfImp6u)),horiz = TRUE, main = 'Un-Conditional, Cubist', col = 'grey')

answer: Then we applied cubist function into the model, same as before, we specified the conditional argument as true or false. But Despite that, the cubist model provide the same result, which means it does not take into consideration the highly correlated variables, and does not weight them.

SUMMARY 8.1

Summary from above, when there are highly correlated variables in the model, the see forest method is the best in rank and discount the relative importance of the variables. When there are two highly correlated variables in there, it will discount both of them by half. When there are three almost identical variables in there, it will discount each of them by 1 third period so the combined strength for all of these three variables equals to the original one variables. This is a much more accurate model so far.

8.2

8.2A

Use a simulation to show tree bias with different granularities.

X_categorical <- rep(1:2,each=100)
Y <- X_categorical + rnorm(200,mean=0,sd=4)
set.seed(103)
X_continuous <- rnorm(200,mean=0,sd=2)
simData <- data.frame(Y=Y,X_categorical=X_categorical,X_continuous=X_continuous)

8.2B

Plot frequency of predictor selection for tree bias simulation.

boxplot( 
  Y~X_categorical,
  main="Tree Bias , Cichotomized Categorical Predictior",
  xlab="YesVsNo",   ylab="Y" )

plot(
  X_continuous,   Y, 
  main="Tree Bias , Continous Predictor",
  xlab="X_continuous",   ylab="Y",   pch=19 )

Answer: We used 2 extreme cases to demonstrate the granularity. The continuous predictor put why outcome variable on the scatter point, and the dichotomize variable which splits the outcome into either one categories. As we can see the continuous variable X has a true relationship with Y, demonstrated by the points. Then the dichotomize variable X force the Y outcome into either yes or no category comp, which overly simplifies the relationship between predictors and outcome. So when we have categorical variable with low levels of categories, we always have to consider and take into account that it may not be truly reflective of the nature of

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient.

Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Fig 8.24.

8.3A)

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors? Answer: The reason that the right figure only focus on a few most important variables is because It has a higher back in fraction an learning rate, both at .09 ( compared with both at .01 as the left figure. )

The concept of tuning parameter, or sometimes called penalty parameters, controls how quickly and how much the amount of tuning or shrinkage of data is toward it’s go. The right side figure is a demonstration of high tuning parameter, which results much quicker conclusion of important variables, or space the variable importance more dramatically and more quickly, then the left figure ( smaller tuning parameters.

8.3B)

Which model do you think would be more predictive of other samples?

Answer: In terms of pretty more predictive of other samples, I would say that the left figure does a better job in that. right sided figure ignores many subtle differences in the variables.

However with that being said, it also depends on the sample size of both training and testing samples. When the sample size is relatively small, the left side of figure will produce a model that is overly complicated and will not give us the results as needed. In that case, the right side of figure will likely yield some results, although not very precise, but at least provide more clue than scattered information like on the left side figure.

The right side of figure, although less accurate than the left side of figure, provides more distinctive tuning methods. So at the early phase of tuning, or on the top of the tree, then the right side figure is more useful. Ask tuning is more and more defined to the detailed level, then the left side of figure provides a more accurate and detailed picture of the overall variable performance. So it is a balance of accuracy and tuning performance.

Also please note that the right side figure takes much longer time to generate than the left side fiture

8.3C)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Answer:I expect that as interaction death is increased, the slope of predictor importance for both models will increase. It is the same concept of the bagging function and learning rate. Below is an example using the solubility data as an example to demonstrate the predictor importance by GBM model, by depth in 2 versus depth 10.

library(AppliedPredictiveModeling)
data(solubility)
library(gbm)

library(kableExtra)
# we are using the solubility data as an example to demonstrate the predictor importance

g1 <- expand.grid(n.trees=100, interaction.depth=2, shrinkage=0.5, n.minobsinnode=10)
g2 <- expand.grid(n.trees=100, interaction.depth=10, shrinkage=0.5,n.minobsinnode=10)

model.gbm1 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = g1, verbose = FALSE)
model.gbm2 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = g2, verbose = FALSE)

var.imp <- cbind(varImp(model.gbm1)[[1]],varImp(model.gbm2)[[1]])

colnames(var.imp) <- c("Depth2", "Depth10")

kable(var.imp[order(-var.imp$Depth1),][1:15,])  ## ordered descending by Depth10 VR

	Depth2	Depth10
NumCarbon	100.0000000	100.000000
MolWeight	64.4192159	53.506418
SurfaceArea2	11.5004931	50.368383
SurfaceArea1	57.7285345	29.757434
NumAromaticBonds	2.8845372	23.007572
HydrophilicFactor	20.7339636	9.616236
NumNonHAtoms	1.0182743	8.248071
NumMultBonds	2.9758465	5.501253
FP081	0.4693215	4.336724
NumAtoms	1.5129207	3.431398
FP072	0.0000000	2.980783
FP009	0.2959903	2.956194
NumHydrogen	3.2084460	2.925829
NumRotBonds	0.7485593	2.875424
FP135	2.2597325	2.865310