8.1.

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1) 
simulated <- cbind(simulated$x, simulated$y) 
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

head(simulated)

##          V1        V2         V3         V4         V5         V6        V7
## 1 0.5337724 0.6478064 0.85078526 0.18159957 0.92903976 0.36179060 0.8266609
## 2 0.5837650 0.4381528 0.67272659 0.66924914 0.16379784 0.45305931 0.6489601
## 3 0.5895783 0.5879065 0.40967108 0.33812728 0.89409334 0.02681911 0.1785614
## 4 0.6910399 0.2259548 0.03335447 0.06691274 0.63744519 0.52500637 0.5133614
## 5 0.6673315 0.8188985 0.71676079 0.80324287 0.08306864 0.22344157 0.6644906
## 6 0.8392937 0.3862983 0.64618857 0.86105431 0.63038947 0.43703891 0.3360117
##          V8         V9       V10        y
## 1 0.4214081 0.59111440 0.5886216 18.46398
## 2 0.8446239 0.92819306 0.7584008 16.09836
## 3 0.3495908 0.01759542 0.4441185 17.76165
## 4 0.7970260 0.68986918 0.4450716 13.78730
## 5 0.9038919 0.39696995 0.5500808 18.42984
## 6 0.6489177 0.53116033 0.9066182 20.85817

str(simulated)

## 'data.frame':    200 obs. of  11 variables:
##  $ V1 : num  0.534 0.584 0.59 0.691 0.667 ...
##  $ V2 : num  0.648 0.438 0.588 0.226 0.819 ...
##  $ V3 : num  0.8508 0.6727 0.4097 0.0334 0.7168 ...
##  $ V4 : num  0.1816 0.6692 0.3381 0.0669 0.8032 ...
##  $ V5 : num  0.929 0.1638 0.8941 0.6374 0.0831 ...
##  $ V6 : num  0.3618 0.4531 0.0268 0.525 0.2234 ...
##  $ V7 : num  0.827 0.649 0.179 0.513 0.664 ...
##  $ V8 : num  0.421 0.845 0.35 0.797 0.904 ...
##  $ V9 : num  0.5911 0.9282 0.0176 0.6899 0.397 ...
##  $ V10: num  0.589 0.758 0.444 0.445 0.55 ...
##  $ y  : num  18.5 16.1 17.8 13.8 18.4 ...

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000) 
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1

##         Overall
## V1   8.83890885
## V2   6.49023056
## V3   0.67583163
## V4   7.58822553
## V5   2.27426009
## V6   0.17436781
## V7   0.15136583
## V8  -0.03078937
## V9  -0.02989832
## V10 -0.08529218

Did the random forest model significantly use the uninformative predictors (V6 – V10)? For the uninformative predictors (V6 – V10), from the above table I can see the predictor used a very small importance for uninformative predictors comparing to V1 - V5.

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1

cor(simulated$duplicate1, simulated$V1)

## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          6.29780744
## V2          6.08038134
## V3          0.58410718
## V4          6.93924427
## V5          2.03104094
## V6          0.07947642
## V7         -0.02566414
## V8         -0.11007435
## V9         -0.08839463
## V10        -0.00715093
## duplicate1  3.56411581

From the above I can see the importance of V1 decreased because of the split between V1 and duplicated1, its because the trees in the random forest are more likely to pick V1 for a split than to pick duplicated1 because they are very correlated. therefore the total numbers of the splits that belonged to V1 is shared with duplicated1.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

cforest_model <- cforest(y~., data= simulated)

# Unconditional importance measure
varimp(cforest_model) %>% sort(decreasing = T)

##          V4          V1          V2  duplicate1          V5          V3 
##  7.05287555  6.81916581  5.76581501  4.61092783  2.23557425  0.24301265 
##          V7          V6          V9          V8         V10 
##  0.23570674  0.05463455 -0.07485005 -0.16012095 -0.23819254

# Conditional importance measure
varimp(cforest_model, conditional=T) %>% sort(decreasing = T)

##          V4          V2          V1  duplicate1          V5          V3 
##  5.90953963  4.88997499  3.97578757  1.80565856  1.54768911  0.04714052 
##          V7          V9          V6          V8         V10 
## -0.08006441 -0.18315215 -0.19461785 -0.26034939 -0.35914281

From the above tables I can see the V4 and V2 are rated first and second in importance. The importance of the conditional importance measures are slightly different than the traditional measurement. The uninformative predictors (V6 ~ V10) are still low importance, while the importance of the other predictor are reduced, especially the V1 and duplicated1 variables.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

I will start by using the gradiant boosted trees model.

gbmmodel <- gbm(y~., data = simulated, distribution = "gaussian")
summary(gbmmodel)

##                   var    rel.inf
## V4                 V4 29.1007123
## V1                 V1 25.4265069
## V2                 V2 24.4644959
## V5                 V5 10.5766499
## V3                 V3  7.2246950
## duplicate1 duplicate1  2.5932312
## V6                 V6  0.3390871
## V9                 V9  0.2746217
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V10               V10  0.0000000

From the table above I can see that the same pattern occurs, (V6-V10) are still very low, and V4 and V2 still are on the top, the grenadian boosted shows that duplicate1 is more important than V1, which is different from the models before.its seems trees from boosting are dependent on each other and will have correlated structures. As a result, the magnitude of their contribution to the importance metric looks more “stretched” comparing to other tree models.

Now looking at the Cubist model

cubistmodel <- cubist(x=simulated[, -(ncol(simulated)-1)], y=simulated$y, committees=100)

varImp(cubistmodel)

##            Overall
## V1            64.5
## V3            41.0
## V2            60.0
## V4            48.0
## V5            31.0
## V6             9.0
## duplicate1     6.0
## V8             2.0
## V10            0.5
## V7             0.0
## V9             0.0

This is little different while the uninformative predictors V6 ~ V10 are still very low, the top one is V1 not V4, then it puts the duplicated preditor less than V1.

8.2.

Use a simulation to show tree bias with different granularities.

Single regression trees suffer from selection bias, while predictors with a higher number of distinct values (lower variance) are favored over more granular (higher variance) predictors. when the data consist a mix of informative and noise variables, then the noise data may have more splits than the informative variables, then the trees would be misleading

A predictor of v1 with 100 one values and 100 two values Y response variable is created that is equal to X1 plus a pseudo-random Gaussian variable with μ=0,σ=2. v2 is a variable with higher variance that are not related created as pseudo-random Gaussian variable X2 with μ=0,σ=4.

I will use rpart() function which makes splits based on the CART methodology. and the varImp() function then calculates the variable importance for the model.

set.seed(123)

v1 <- rep(1:2, each= 100)
y<- v1+ rnorm(200, mean = 0, sd = 4)

v2 <- rnorm(200, mean = 0, sd=4)

mydata <- data.frame(y=y, v1=v1, v2=v2)

fit <- rpart(y~ ., data = mydata)

varImp(fit)

##       Overall
## v1 0.01511841
## v2 0.34894660

From the above I can see the v1 the lower variance predictor is less important than the higher variance predictor v2, the importance of v2 is way higher.

8.3.

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has higher bagging fraction and learning ratio, and the fraction of the training data used to construct the decision tree also becomes higher, as the bagging fraction approaches 1, each bootstrap sample become more similar with each other, this means the predictors being more dominant.

the model of the left ahs a lower bagging rate and learning rate, the lower the learning rate the less greedy the model is, which makes it more likely to identify more predictors to be important.

Which model do you think would be more predictive of other samples?

The model of the left looks more more predictive and which has less bagging rate and learning rate.

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

When we increase the tree depth, the variable importance is more likely to be spread out over more predictors as more and more predictors are considered for tree spliting.

8.7.

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data(ChemicalManufacturingProcess)

set.seed(123)
cmp_pred <- as.matrix(ChemicalManufacturingProcess[,2:58])
cmp_yield <- ChemicalManufacturingProcess[,1]

train <- createDataPartition(cmp_yield, p=0.75, list = F) #create train set

train_x <- ChemicalManufacturingProcess[train, -1]
train_y <- ChemicalManufacturingProcess[train, 1]
test_x <- ChemicalManufacturingProcess[-train, -1]
test_y <- ChemicalManufacturingProcess[-train, 1]

pre_process <- c("nzv", "corr", "center", "scale", "medianImpute")

ctrl <- trainControl(method = "boot", number = 25)

Which tree-based regression model gives the optimal resampling and test set performance?

Based on RMSE result, cubist shows the lowest value.

Recursive Partitioning.

set.seed(123)
rpartGrid <- expand.grid(maxdepth= seq(1,10,by=1))
rp_model <- train(x = train_x, y = train_y, method = "rpart2",metric = "Rsquared", tuneGrid = rpartGrid, trControl = ctrl, preProcess=pre_process)
rp_model

## CART 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (47), scaled (47), median imputation (47), remove (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##    1        1.587109  0.3222359  1.262216
##    2        1.588304  0.3471138  1.260578
##    3        1.592221  0.3548035  1.257264
##    4        1.577274  0.3744462  1.237809
##    5        1.596035  0.3695650  1.259102
##    6        1.590127  0.3799986  1.248773
##    7        1.594061  0.3795646  1.250253
##    8        1.588828  0.3833099  1.239152
##    9        1.591847  0.3840905  1.239384
##   10        1.593658  0.3852311  1.240069
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 10.

Recursive Partitioning: RMSE 1.496662

Random Forest

set.seed(123)
rfGrid <- expand.grid(mtry=seq(2,38,by=3))
rf_model <- train(x = train_x, y = train_y, method = "rf", tuneGrid = rfGrid, metric = "Rsquared", importance = TRUE, trControl = ctrl,preProcess=pre_process)

rf_model

## Random Forest 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (47), scaled (47), median imputation (47), remove (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.322484  0.6228670  1.0575703
##    5    1.258284  0.6315350  1.0022734
##    8    1.233013  0.6361845  0.9780715
##   11    1.222241  0.6352190  0.9674696
##   14    1.223318  0.6284183  0.9651774
##   17    1.221142  0.6257127  0.9622421
##   20    1.219153  0.6220413  0.9578579
##   23    1.218108  0.6195639  0.9558484
##   26    1.220456  0.6150509  0.9569360
##   29    1.222664  0.6112294  0.9566539
##   32    1.226227  0.6062553  0.9562464
##   35    1.227096  0.6038911  0.9586372
##   38    1.231679  0.5993550  0.9592835
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.

Generalized Boosted Regression

set.seed(300)
gbmGrid <- expand.grid(interaction.depth=seq(1,6,by=1),
                       n.trees=c(25,50,100,200),
                       shrinkage=c(0.01,0.05,0.1,0.2),
                       n.minobsinnode=5)

gb_model <- train(x = train_x, y = train_y,method = "gbm", metric = "Rsquared",verbose = FALSE, tuneGrid = gbmGrid, trControl = ctrl, preProcess=pre_process)

gb_model

## Stochastic Gradient Boosting 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (47), scaled (47), median imputation (47), remove (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE      
##   0.01       1                   25      1.750868  0.4436225  1.4232258
##   0.01       1                   50      1.655577  0.4723749  1.3428954
##   0.01       1                  100      1.526614  0.5115909  1.2361620
##   0.01       1                  200      1.387282  0.5448973  1.1180862
##   0.01       2                   25      1.720034  0.4909505  1.3971250
##   0.01       2                   50      1.604482  0.5129094  1.3008413
##   0.01       2                  100      1.457907  0.5388108  1.1768789
##   0.01       2                  200      1.320333  0.5676001  1.0537606
##   0.01       3                   25      1.709386  0.5048068  1.3898578
##   0.01       3                   50      1.586327  0.5187980  1.2868118
##   0.01       3                  100      1.431423  0.5429917  1.1560628
##   0.01       3                  200      1.296479  0.5721581  1.0305485
##   0.01       4                   25      1.698272  0.5251595  1.3803802
##   0.01       4                   50      1.567858  0.5429447  1.2741709
##   0.01       4                  100      1.405934  0.5618872  1.1358119
##   0.01       4                  200      1.280113  0.5807098  1.0165825
##   0.01       5                   25      1.697737  0.5119898  1.3801486
##   0.01       5                   50      1.566274  0.5345568  1.2723130
##   0.01       5                  100      1.402910  0.5593929  1.1322326
##   0.01       5                  200      1.273215  0.5830422  1.0078441
##   0.01       6                   25      1.691419  0.5274626  1.3740537
##   0.01       6                   50      1.558710  0.5410137  1.2656960
##   0.01       6                  100      1.395360  0.5600297  1.1245079
##   0.01       6                  200      1.268217  0.5822350  1.0007987
##   0.05       1                   25      1.481059  0.5057445  1.2009515
##   0.05       1                   50      1.350340  0.5444165  1.0838001
##   0.05       1                  100      1.261149  0.5746823  0.9951545
##   0.05       1                  200      1.230040  0.5852977  0.9564855
##   0.05       2                   25      1.411503  0.5359489  1.1366300
##   0.05       2                   50      1.292021  0.5710135  1.0217085
##   0.05       2                  100      1.227254  0.5908100  0.9521033
##   0.05       2                  200      1.190506  0.6086009  0.9128289
##   0.05       3                   25      1.382928  0.5433680  1.1121320
##   0.05       3                   50      1.271772  0.5715253  1.0053490
##   0.05       3                  100      1.217050  0.5927125  0.9388424
##   0.05       3                  200      1.193700  0.6039165  0.9100972
##   0.05       4                   25      1.358562  0.5515938  1.0875376
##   0.05       4                   50      1.256933  0.5753566  0.9839688
##   0.05       4                  100      1.212496  0.5933630  0.9287913
##   0.05       4                  200      1.190970  0.6053405  0.9028941
##   0.05       5                   25      1.361084  0.5458255  1.0890699
##   0.05       5                   50      1.257658  0.5758353  0.9827664
##   0.05       5                  100      1.208479  0.5974218  0.9290219
##   0.05       5                  200      1.183366  0.6117928  0.9004172
##   0.05       6                   25      1.356040  0.5487619  1.0884287
##   0.05       6                   50      1.251162  0.5766617  0.9819329
##   0.05       6                  100      1.204640  0.5983575  0.9283533
##   0.05       6                  200      1.184062  0.6100934  0.9039285
##   0.10       1                   25      1.329125  0.5602768  1.0666733
##   0.10       1                   50      1.256313  0.5813212  0.9910342
##   0.10       1                  100      1.226950  0.5886404  0.9545939
##   0.10       1                  200      1.213943  0.5924137  0.9324331
##   0.10       2                   25      1.304996  0.5482027  1.0350321
##   0.10       2                   50      1.249796  0.5703396  0.9734349
##   0.10       2                  100      1.221744  0.5846132  0.9379748
##   0.10       2                  200      1.207236  0.5928956  0.9181553
##   0.10       3                   25      1.284438  0.5543862  1.0162224
##   0.10       3                   50      1.242525  0.5726031  0.9663431
##   0.10       3                  100      1.220125  0.5849813  0.9376313
##   0.10       3                  200      1.205650  0.5946167  0.9221334
##   0.10       4                   25      1.274906  0.5615247  1.0019898
##   0.10       4                   50      1.230322  0.5782859  0.9477100
##   0.10       4                  100      1.207437  0.5900834  0.9173836
##   0.10       4                  200      1.203434  0.5924397  0.9086169
##   0.10       5                   25      1.264447  0.5689347  0.9940740
##   0.10       5                   50      1.220251  0.5885744  0.9449055
##   0.10       5                  100      1.200152  0.6022066  0.9195133
##   0.10       5                  200      1.192557  0.6068172  0.9103507
##   0.10       6                   25      1.276518  0.5551852  0.9991284
##   0.10       6                   50      1.229838  0.5791035  0.9501079
##   0.10       6                  100      1.209216  0.5899568  0.9265784
##   0.10       6                  200      1.200604  0.5946278  0.9161597
##   0.20       1                   25      1.290425  0.5441367  1.0113265
##   0.20       1                   50      1.268703  0.5519874  0.9848023
##   0.20       1                  100      1.260119  0.5592280  0.9735457
##   0.20       1                  200      1.258171  0.5621760  0.9641298
##   0.20       2                   25      1.284798  0.5440384  1.0076286
##   0.20       2                   50      1.273004  0.5531786  0.9822348
##   0.20       2                  100      1.261320  0.5619601  0.9670267
##   0.20       2                  200      1.260892  0.5642607  0.9614738
##   0.20       3                   25      1.278338  0.5464713  0.9887245
##   0.20       3                   50      1.248787  0.5660209  0.9528273
##   0.20       3                  100      1.233801  0.5762532  0.9380424
##   0.20       3                  200      1.232703  0.5772614  0.9354164
##   0.20       4                   25      1.264492  0.5534458  0.9796558
##   0.20       4                   50      1.241729  0.5640622  0.9516062
##   0.20       4                  100      1.233250  0.5715214  0.9428358
##   0.20       4                  200      1.233453  0.5718412  0.9417097
##   0.20       5                   25      1.240226  0.5667739  0.9559535
##   0.20       5                   50      1.228031  0.5756949  0.9381049
##   0.20       5                  100      1.226838  0.5778495  0.9343807
##   0.20       5                  200      1.227169  0.5781840  0.9329630
##   0.20       6                   25      1.263097  0.5566761  0.9682854
##   0.20       6                   50      1.251723  0.5640812  0.9549264
##   0.20       6                  100      1.250546  0.5666356  0.9511011
##   0.20       6                  200      1.250309  0.5673331  0.9484911
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 5
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200, interaction.depth =
##  5, shrinkage = 0.05 and n.minobsinnode = 5.

Cubist

set.seed(300)
cubistGrid <- expand.grid(committees = c(1, 5, 10, 20, 50, 100), 
                          neighbors = c(0, 1, 3, 5, 7))

cubist_model <- train(x = train_x, y = train_y,method = "cubist", 
                        verbose = FALSE, metric = "Rsquared", tuneGrid = cubistGrid,trControl = ctrl, preProcess=pre_process)

cubist_model

## Cubist 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (47), scaled (47), median imputation (47), remove (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##     1         0          1.783484  0.3447336  1.3008054
##     1         1          1.760482  0.3695971  1.2533727
##     1         3          1.749436  0.3673567  1.2596560
##     1         5          1.763659  0.3589877  1.2719841
##     1         7          1.774063  0.3518269  1.2839886
##     5         0          1.408501  0.4751252  1.0681905
##     5         1          1.370121  0.5084215  1.0178053
##     5         3          1.371397  0.5013727  1.0294790
##     5         5          1.384226  0.4938183  1.0376905
##     5         7          1.394223  0.4876719  1.0489456
##    10         0          1.304946  0.5319941  0.9931924
##    10         1          1.258686  0.5671439  0.9394749
##    10         3          1.264749  0.5595739  0.9533675
##    10         5          1.280093  0.5503698  0.9638110
##    10         7          1.290437  0.5436020  0.9749352
##    20         0          1.237286  0.5751213  0.9499917
##    20         1          1.192742  0.6059267  0.9006055
##    20         3          1.199731  0.5995782  0.9129920
##    20         5          1.214937  0.5910349  0.9254097
##    20         7          1.224631  0.5847037  0.9356405
##    50         0          1.185544  0.6091600  0.9103050
##    50         1          1.129865  0.6430393  0.8499082
##    50         3          1.142375  0.6350705  0.8664559
##    50         5          1.161150  0.6245744  0.8811305
##    50         7          1.172664  0.6176112  0.8928533
##   100         0          1.173871  0.6170264  0.8979396
##   100         1          1.116110  0.6513634  0.8336709
##   100         3          1.129138  0.6433284  0.8508466
##   100         5          1.148746  0.6325300  0.8665395
##   100         7          1.160761  0.6253304  0.8792325
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were committees = 100 and neighbors = 1.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

cubist_imp <- varImp(cubist_model, scale = FALSE)
plot(cubist_imp, top=15, scales = list(y = list(cex = 0.8)))

We can see from the cubist model our selected as the optial model, ManufacturingProcess32 is the most important, then there are few ManufacturingProcess from the plot above, When compared to variable importance from the most optimal linear and non linear models, ManufacturingProcess32 is still within the top five most important predictors.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Here using the "partykit" library

plot(as.party(rp_model$finalModel),gp=gpar(fontsize=10))

From the plot I can see Manufacturing Process 32 is the root node, and when its greater then this results in an increased yield.

Manufacturing processes 32 and 13 are on the top, with higher values of process 32 being associated with larger yields. Lower values of process 32 are associated with smaller yields. However, a lower values of process 32 may be counter-acted with a corresponding lower value of Manufacturing process 13.

HW9_624 Predictive Analytics

Karim Hammoud

8.1.

8.2.

8.3.

8.7.