624_9

8.1.

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##         Overall
## V1   8.83890885
## V2   6.49023056
## V3   0.67583163
## V4   7.58822553
## V5   2.27426009
## V6   0.17436781
## V7   0.15136583
## V8  -0.03078937
## V9  -0.02989832
## V10 -0.08529218

Did the random forest model significantly use the uninformative predictors (V6 – V10)?
Random forest deemed variables 1, 2, 4, and 5 as most important while giving little significance to V6-V10.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          6.29780744
## V2          6.08038134
## V3          0.58410718
## V4          6.93924427
## V5          2.03104094
## V6          0.07947642
## V7         -0.02566414
## V8         -0.11007435
## V9         -0.08839463
## V10        -0.00715093
## duplicate1  3.56411581

Yes, importance score for V1 drops when we add another predictor that is highly correlated with V1.

(c) Use the cforest function in the `party` package to fit a random forest model using conditional inference trees. The party package function `varimp` can calculate predictor importance. The `conditional` argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

# controls <- cforest_control(ntree = 1000)
cf <- cforest(y ~ ., data = simulated[,1:11])
cfImp1 <- varimp(cf)
ccfImp1 <- varimp(cf,conditional=T)

cf <- cforest(y ~ ., data = simulated)
cfImp2 <- varimp(cf)
ccfImp2 <- varimp(cf,conditional=T)

df1 <- data.frame(rfImp1)
df2 <- data.frame(rfImp2)
df3 <- data.frame(cfImp1)
df4 <- data.frame(cfImp2)
df5 <- data.frame(ccfImp1)
df6 <- data.frame(ccfImp2)

res <- cbind(df1$Overall,df2$Overall,
             df3$cfImp1,df4$cfImp2,
             df5$ccfImp1,df6$ccfImp2)
colnames(res) <- c('rfImp1','rfImp2','cfImp1','cfImp2','ccfImp1','ccfImp2')
rownames(res) <- c('v1','v2','v3','v4','v5','v6','v7','v8','v9','v10','duplicated')
res <- data.frame(res)
res['duplicated','rfImp1'] <- 'n/a'
res['duplicated','cfImp1'] <- 'n/a'
res['duplicated','ccfImp1'] <- 'n/a'
res

##                         rfImp1      rfImp2              cfImp1       cfImp2
## v1            8.83890884687881  6.29780744    8.24398027002184  6.997928079
## v2            6.49023055671384  6.08038134    6.25754168767664  5.721403952
## v3           0.675831632065687  0.58410718   0.238955295737507  0.176865757
## v4            7.58822552962725  6.93924427    7.11821066817076  6.679439805
## v5            2.27426008558579  2.03104094    2.08822910203085  2.164754408
## v6           0.174367811427058  0.07947642 -0.0339077123981816  0.002175835
## v7           0.151365830084575 -0.02566414   0.132461148719195  0.158646126
## v8         -0.0307893674609138 -0.11007435  -0.203131687868431 -0.072300812
## v9         -0.0298983168753266 -0.08839463  -0.042280789046421 -0.025558718
## v10        -0.0852921822465674 -0.00715093  -0.233782680168299 -0.100624738
## duplicated                 n/a  3.56411581                 n/a  4.358041616
##                       ccfImp1     ccfImp2
## v1           6.16756829289277  3.71232062
## v2            5.3518538178137  4.81247129
## v3          0.153015077857874  0.02150080
## v4            6.1535726781082  5.80857084
## v5           1.45791688301616  1.48899900
## v6         -0.263060836697065 -0.26393405
## v7         -0.240956852289023 -0.03673253
## v8         -0.384942286104933 -0.33199344
## v9         -0.244790640262602 -0.24349702
## v10         -0.27229862015385 -0.27427204
## duplicated                n/a  1.82777831

Yes, cforest importances show similar pattern, adding a duplicated variable seem to always result in the drop in V1 importance.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

set.seed(200)
bag1 <- bagging(y ~ ., simulated[,1:11], nbag = 50)
bag2 <- bagging(y ~ ., simulated, nbag = 50)
(bag1Imp <- varImp(bag1))

##       Overall
## V1  1.8013841
## V10 0.7482432
## V2  2.2479822
## V3  1.1596239
## V4  2.6335381
## V5  2.5472188
## V6  0.9449452
## V7  0.9698789
## V8  0.6338984
## V9  0.7085276

(bag2Imp <- varImp(bag2))

##              Overall
## duplicate1 1.5253939
## V1         1.7771897
## V10        0.6422615
## V2         1.9752538
## V3         1.1272768
## V4         2.6417971
## V5         2.3159016
## V6         0.7537269
## V7         0.9205175
## V8         0.4518462
## V9         0.4889416

boost1 <- gbm(y ~ ., simulated[,1:11], n.trees = 100, distribution = "gaussian")
boost2 <- gbm(y ~ ., simulated, n.trees = 100, distribution = "gaussian")
(boost1Imp <- varImp(boost1))

## Error in 1:n.trees: argument of length 0

(boost2Imp <- varImp(boost2))

## Error in 1:n.trees: argument of length 0

library(Cubist)
cubist1 <- cubist(x=simulated[,1:10], y=simulated$y, committees = 100)
cubist2 <- cubist(x=simulated, y=simulated$y, committees = 100)
(cubist1Imp <- varImp(cubist1))

##     Overall
## V1     71.5
## V3     47.0
## V2     58.5
## V4     48.0
## V5     33.0
## V6     13.0
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0

(cubist2Imp <- varImp(cubist2))

##            Overall
## y               50
## V1               0
## V2               0
## V3               0
## V4               0
## V5               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0
## duplicate1       0

8.2.

Use a simulation to show tree bias with different granularities.
Per Kuhn and Johnson, trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors that have more potential splits. The below simulates two variables, one with 3 distinct categories and greater variance, second is just random continuous data which is more granular as it has more split points. Although the variable x1 is related to target variable (only differs by random noise), more granular variable is given more importance.

set.seed(200)
x1 <- rep(1:3, each=100) # 3 categories, 100 records for each
e1 <- rnorm(300,mean=0, sd=3) # random error to be added to target variable
y <- x1+e1 # closely related to x1

set.seed(200)
x2 <- rnorm(300, mean=0, sd=2) # no categories, just continuous random data, no relation to y
sim <- data.frame(y=y, x1=x1, x2=x2)
fit <- rpart(y ~ ., data = sim)
varImp(fit)

##     Overall
## x1 1.485791
## x2 3.812263

The greater variance in a variable might

8.3.

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Per Kuhn and Johnson, boosting employs the greedy strategy of choosing the optimal weak learner at each iteration, and one way to constrain the greediness is by adjusting learning rate. Small values of the learning parameter (<0.01) typically work best but as learning rate increases so does the computational time required to find the model. Stochastic gradient boosting and bagging techniques employ random sampling of data and select a fraction at the first iteration, while the models in the remaining step are only based on the sample of the data. The greater the size (fraction), the less computational power is required, that method, however, also results in less predictors being selected.

(b) Which model do you think would be more predictive of other samples?

Per Ridgeway, smaller learning parameters tend to perform better. Selecting smaller fraction of data might also reduce selection bias and allow for a more flexible model that is less dependent on a few variables.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Trees from boosting are dependent on past trees, have minimum depth, and contribute to final model unequally. If interaction depth was increased, that would resemple more of a single tree, in which case same variables could be selected at different iterations. At the same time, variable importance is dependent on reduction in squared error due to each predictor, and if all variables have a chance at being selected at each iteration, all variables can equally contribute to the improvement in squared error so the importance would be spread over all predictors as opposed to giving more weights to a few that would’ve been selected with smaller depth.

8.7.

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data(ChemicalManufacturingProcess)
features <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select=Yield)

prep <- preProcess(features, method=c('scale','center','knnImpute'))
prep_features <- predict(prep,features)
# Train and test
set.seed(1)
split <- createDataPartition(yield$Yield,p=0.75,list=FALSE)
x_train <- prep_features[split,]
y_train <- yield[split,]
x_test <- prep_features[-split,]
y_test <- yield[-split,]

# Additional preprocessing
## remove near zero variance predictors that carry no information
pred_to_remove <- nearZeroVar(features)
x_train <- x_train[-pred_to_remove]
x_test <- x_test[-pred_to_remove]

## Remove highly correlated features
corThresh <- 0.9
tooHigh <- findCorrelation(cor(x_train),corThresh)
x_train <- x_train[,-tooHigh]
x_test <- x_test[,-tooHigh]

set.seed(1)
ctrl <- trainControl(method='cv',number=10)

CART

cart_model <- train(x=x_train,y=y_train,
                   method='rpart2', 
                   metric='Rsquared',
                   trControl = ctrl)
cart_model

## CART 
## 
## 132 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 119, 118, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##   1         1.370534  0.4384761  1.099251
##   2         1.325843  0.4672979  1.076872
##   3         1.362245  0.4667858  1.066312
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 2.

plot(cart_model)

set.seed(1)
predictions <- predict(cart_model,x_test)
postResample(obs = y_test,pred = predictions)

##      RMSE  Rsquared       MAE 
## 1.6472041 0.3466194 1.2919750

Random forest

# rf_model <- randomForest(x=x_train,y=y_train,
#                        importance = TRUE,
#                        ntree = 1000)
rf_grid<- expand.grid(mtry=c(1,3,5,7,10))
rf_model <- train(x=x_train,y=y_train,
                   method='rf', 
                   metric='Rsquared',
                   importance=TRUE,
                  tuneGrid=rf_grid,
                   trControl = ctrl)
rf_model

## Random Forest 
## 
## 132 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 119, 118, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    1    1.313275  0.6320500  1.0759019
##    3    1.144113  0.6929042  0.9334935
##    5    1.125750  0.6840603  0.9130703
##    7    1.096581  0.6998949  0.8820174
##   10    1.080599  0.7022848  0.8748689
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.

plot(rf_model)

set.seed(1)
predictions <- predict(rf_model,x_test)
postResample(obs = y_test,pred = predictions)

##      RMSE  Rsquared       MAE 
## 1.3387788 0.6217914 1.0128653

Boosting

library(gbm)
set.seed(1)
gbmGrid <- expand.grid(interaction.depth = seq(1,5,by=1),
                       n.trees=c(10,20,50,100),
                       shrinkage=c(0.01,0.1,0.5),
                       n.minobsinnode=10)
gbm_model <- train(x=x_train,y=y_train,
                   method='gbm', 
                   metric='Rsquared',
                   verbose=FALSE,
                   tuneGrid=gbmGrid,
                   trControl = ctrl)
gbm_model

## Stochastic Gradient Boosting 
## 
## 132 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 119, 118, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE      
##   0.01       1                   10      1.723398  0.4983709  1.4013663
##   0.01       1                   20      1.671157  0.4968049  1.3552942
##   0.01       1                   50      1.548643  0.5255066  1.2532318
##   0.01       1                  100      1.400241  0.5692484  1.1282749
##   0.01       2                   10      1.716561  0.5148675  1.3957618
##   0.01       2                   20      1.655387  0.5197523  1.3436512
##   0.01       2                   50      1.496011  0.5815619  1.2117334
##   0.01       2                  100      1.333279  0.6118879  1.0788588
##   0.01       3                   10      1.705582  0.5340190  1.3845724
##   0.01       3                   20      1.635189  0.5595889  1.3250517
##   0.01       3                   50      1.477134  0.5922749  1.1976969
##   0.01       3                  100      1.308911  0.6228345  1.0626236
##   0.01       4                   10      1.698382  0.5814462  1.3801726
##   0.01       4                   20      1.630054  0.5718858  1.3233621
##   0.01       4                   50      1.465891  0.5952168  1.1883195
##   0.01       4                  100      1.309565  0.6147817  1.0637739
##   0.01       5                   10      1.708137  0.5035807  1.3883022
##   0.01       5                   20      1.639681  0.5492787  1.3295971
##   0.01       5                   50      1.473248  0.5938854  1.1913155
##   0.01       5                  100      1.314841  0.6126844  1.0630924
##   0.10       1                   10      1.426153  0.5243323  1.1566885
##   0.10       1                   20      1.261989  0.5966798  1.0225136
##   0.10       1                   50      1.127329  0.6554557  0.9029270
##   0.10       1                  100      1.064618  0.6840756  0.8437091
##   0.10       2                   10      1.358825  0.5607452  1.0935824
##   0.10       2                   20      1.221661  0.5933675  0.9695901
##   0.10       2                   50      1.120007  0.6403005  0.8725922
##   0.10       2                  100      1.085203  0.6573020  0.8460159
##   0.10       3                   10      1.322121  0.5427148  1.0690831
##   0.10       3                   20      1.208183  0.5861849  0.9686540
##   0.10       3                   50      1.111969  0.6535825  0.8947736
##   0.10       3                  100      1.059770  0.6823334  0.8622593
##   0.10       4                   10      1.330432  0.5722127  1.0901729
##   0.10       4                   20      1.221179  0.5954643  0.9847109
##   0.10       4                   50      1.116395  0.6547404  0.8751232
##   0.10       4                  100      1.067728  0.6859512  0.8476542
##   0.10       5                   10      1.307958  0.6060999  1.0544101
##   0.10       5                   20      1.177054  0.6214995  0.9509815
##   0.10       5                   50      1.067697  0.6809763  0.8483713
##   0.10       5                  100      1.024568  0.7003654  0.8269520
##   0.50       1                   10      1.255636  0.5464585  1.0209971
##   0.50       1                   20      1.255687  0.5488650  0.9879220
##   0.50       1                   50      1.237069  0.5670153  0.9698515
##   0.50       1                  100      1.312456  0.5304760  1.0298776
##   0.50       2                   10      1.213052  0.5467724  0.9518224
##   0.50       2                   20      1.286085  0.5125172  1.0208341
##   0.50       2                   50      1.374955  0.4604876  1.0888086
##   0.50       2                  100      1.328536  0.4991827  1.0538630
##   0.50       3                   10      1.292069  0.5277342  1.0269943
##   0.50       3                   20      1.270056  0.5547592  1.0142672
##   0.50       3                   50      1.245095  0.6051454  0.9506360
##   0.50       3                  100      1.232185  0.6126716  0.9562459
##   0.50       4                   10      1.207302  0.5641461  0.9609271
##   0.50       4                   20      1.185843  0.5871817  0.9315298
##   0.50       4                   50      1.220040  0.5788122  0.9858341
##   0.50       4                  100      1.216712  0.5715531  0.9656392
##   0.50       5                   10      1.183431  0.5964186  0.9494210
##   0.50       5                   20      1.149551  0.5884861  0.9614074
##   0.50       5                   50      1.195672  0.5599751  0.9817420
##   0.50       5                  100      1.170901  0.5764055  0.9560094
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
##  5, shrinkage = 0.1 and n.minobsinnode = 10.

plot(gbm_model)

set.seed(1)
predictions <- predict(gbm_model,x_test)
postResample(obs = y_test,pred = predictions)

##      RMSE  Rsquared       MAE 
## 1.2930695 0.5994154 0.9755731

Cubist

set.seed(1)
cubGrid <- expand.grid(committees=c(1,3,5,7,10),
                       neighbors=c(1,3,5))
cub_model <- train(x=x_train,y=y_train,
                   method='cubist', 
                   metric='Rsquared',
                   tuneGrid=cubGrid,
                   trControl = ctrl,
                   verbose=FALSE)
cub_model

## Cubist 
## 
## 132 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 119, 118, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          1          1.175820  0.6380818  0.8675950
##    1          3          1.187623  0.6000023  0.9152934
##    1          5          1.220173  0.5830305  0.9667796
##    3          1          1.129601  0.6389493  0.8634972
##    3          3          1.117596  0.6328648  0.8709977
##    3          5          1.160379  0.6098827  0.9225344
##    5          1          1.079827  0.6731328  0.8199351
##    5          3          1.058706  0.6748209  0.8441410
##    5          5          1.108418  0.6528887  0.9082350
##    7          1          1.082860  0.6667072  0.8214243
##    7          3          1.038260  0.6842429  0.8249929
##    7          5          1.094210  0.6578758  0.8899572
##   10          1          1.042537  0.6865912  0.7929366
##   10          3          1.002615  0.7022529  0.7934951
##   10          5          1.049291  0.6813160  0.8472975
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were committees = 10 and neighbors = 3.

plot(cub_model)

set.seed(1)
predictions <- predict(cub_model,x_test)
postResample(obs = y_test,pred = predictions)

##      RMSE  Rsquared       MAE 
## 0.9869493 0.7798385 0.6792144

(a) Which tree-based regression model gives the optimal resampling and test set performance?

The following R squared were achieved for train/test respectively for each of the models:
rf 0.7024581/0.6400612
gbm 0.7003654/0.5994154
cubist 0.7022529/0.7798385

Although all three produced approximately the same Rsq on train, cubist resulted in highest R squared on test set and lowest RMSE than all other models including previously developed linear and non-linear.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

plot(varImp(cub_model, scale = FALSE), top=10,scales = list(y = list(cex = 0.8)))

All models identify similar variables as important but in different order, however, in all cases, MP32, MP09, MP17, BM12, BM03 appear more frequently in top 5.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

plot(as.party(cart_model$finalModel), gp=gpar(fontsize = 8))

Using optimal CART model, MP32 is on the top as the most important variable which is consistent with what we’ve been seeing with many models. Lower values of MP32 tend to be associated to lower yield. At the same time, biological materials start to contribute on the lower spectrum.