CUNY DATA624 HW9

8.1 Recreate the simulated data from Exercise 7.2:
8.2 Use a simulation to show tree bias with different granularities.
8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

8.1 Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a) Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 – V10)?

library(randomForest)
library(caret)

rf_simimp <- function(Data){
  set.seed(200)
  mod <- randomForest(y ~ ., data = Data,
                      importance = TRUE,
                      ntree = 1000)
  rfImp <- varImp(mod, scale = FALSE)
  return(rfImp)
}

knitr::kable(rf_simimp(simulated))

	Overall
V1	8.6887988
V2	6.4748856
V3	0.7180118
V4	7.7379189
V5	2.2867714
V6	0.1857716
V7	0.0337107
V8	-0.0653486
V9	-0.1012404
V10	-0.1005368

The random forest model did not significantly use the uninformative predictors.

b) Now add an additional predictor that is highly correlated with one of the informative predictors. Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Due to the repetition, I’m going to change a couple things to reduce the amount of code used throughout the question.

simv2 <- simulated
simv2$duplicate1 <- simv2$V1 + rnorm(200) * .1
cor(simv2$duplicate1, simv2$V1)

## [1] 0.9328125

knitr::kable(rf_simimp(simv2))

	Overall
V1	6.2896718
V2	6.2556730
V3	0.6225100
V4	6.6803445
V5	1.9958522
V6	0.1050748
V7	0.0967768
V8	-0.0093658
V9	-0.0690029
V10	-0.0907833
duplicate1	4.3568917

We can see above that the importance score for V1 decreased. In fact, it appears that all the predictors importance score went down slightly where V1 decreased the most. Another predictor will now be added that’s also highly correlated with V1.

simv3 <- simv2
simv3$duplicate2 <- simv3$V1 + rnorm(200) * .1
cor(simv3$duplicate2, simv3$V1)

## [1] 0.938327

knitr::kable(rf_simimp(simv3))

	Overall
V1	5.1835568
V2	6.1836382
V3	0.4649312
V4	6.9762914
V5	2.1419831
V6	0.0445405
V7	-0.0135479
V8	-0.0851698
V9	-0.0243260
V10	0.0293922
duplicate1	3.6094594
duplicate2	2.6629171

Above, we can see that the importance of V1 continues to drop with an additional predictor that’s highly correlated to V1. However, unlike earlier, it appears that some of the importance scores went back up. This suggests that the slight changes to the other predictors is not entirely significant.

c) Use the `cforest` function in the party package to fit a random forest model using conditional inference trees. The party package function `varimp` can calculate predictor importance. The `conditional` argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
set.seed(200)
cf_sim <- cforest(y ~ ., data = simulated)
varimp(cf_sim)

##           V1           V2           V3           V4           V5           V6 
##  8.889688266  6.475678898  0.048199904  8.374714465  1.911482592 -0.026382693 
##           V7           V8           V9          V10 
##  0.003470791 -0.020945307 -0.040833731 -0.017820271

barplot(rf_simimp(simulated)$Overall)

barplot(varimp(cf_sim))

From the above two plots, we can see that the pattern of importance for the random forest model using conditional inference trees is very similar to the traditional random forest model.

d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Trees

library(gbm)
set.seed(200)
gbmsim <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmsim)

##     var    rel.inf
## V4   V4 30.6216330
## V1   V1 26.7533742
## V2   V2 22.2563998
## V5   V5 12.5198810
## V3   V3  7.3475220
## V6   V6  0.3458915
## V8   V8  0.1552985
## V7   V7  0.0000000
## V9   V9  0.0000000
## V10 V10  0.0000000

For the boosted tree model, we see that the same pattern does not quite occur. V6-V10 still remains relatively unused and V3 is still the most unused of V1-V5. However, the order of importance has changed. For the boosted tree model, V4 is the most relevant instead of V1.

Cubist

library(Cubist)
set.seed(200)
cub_sim <- cubist(simulated[,-11], simulated$y)
cub_sim$usage

##    Conditions Model Variable
## 1           0   100       V1
## 2           0   100       V2
## 3           0   100       V4
## 4           0   100       V5
## 5           0     0       V3
## 6           0     0       V6
## 7           0     0       V7
## 8           0     0       V8
## 9           0     0       V9
## 10          0     0      V10

The cubist model above also does not utilize V6-V10, but also does not include V3 either. There also appears to be no differentiation between V1-V4, which is unlike the other tree models.

8.2 Use a simulation to show tree bias with different granularities.

We’ll create a simulated data set with low variance (X1) along with another simulated data set with high variance (X2). A response variable will be created that will correlate with X1, but will not be associated with X2.

library(rpart)
set.seed(200)
X1 <- rep(1:2, each = 100)
X2 <- rnorm(200, mean=0,sd=2)
Y <- X1 + rnorm(200,mean=0,sd=4)

sim <- data.frame(Y=Y, X1=X1, X2=X2)
sim_tree <- rpart(Y~., data = sim)
varImp(sim_tree)

##      Overall
## X1 0.1390440
## X2 0.4393341

The model above is biased towards X2 because as the variance increases, the probability that X2 is selected increases. This shows that even though X2 has no true association with the response variable, it is still selected based more on the granularity of X2.

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Figure 8.24

a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The right hand plot has bagging fraction and learning rate set at a higher level (both at 0.9), whereas the left hand plot has the bagging fraction and learning rate both set at 0.1. This means that for the right hand plot the stochastic element is smaller and thus fewer predictors will be regarded as important. It employs the greedy strategy to determine the optimal solution, but only at a specific stage. It has a tendency to overfit the data.

b) Which model do you think would be more predictive of other samples?

Because the parameters for the model on the right are set pretty high, it’s likely and overfit, so the one on the left would probably be more predictive.

c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

As interaction depth increases, so does spread of how important predictors become. Essentially, for the lollipop plots above, each lollipop would be longer and more spread out across predictors.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

We’ll take the same code from the previous two assignments and paste it below

library(AppliedPredictiveModeling)
library(VIM)

data(ChemicalManufacturingProcess)
cmpImp <- kNN(ChemicalManufacturingProcess, imp_var = FALSE) # kNN imputation
cmp_non_pred <- nearZeroVar(cmpImp) # remove predictors
cmp <- cmpImp[,-cmp_non_pred]
set.seed(8)
trainrows <- createDataPartition(cmp$Yield,
                                 p=0.8,
                                 list=FALSE)
cmp_train <- cmp[trainrows,]
cmp_test <- cmp[-trainrows,]

X_train <- cmp_train[,-1]
Y_train <- cmp_train[,1]

X_test <- cmp_test[,-1]
Y_test <- cmp_test[,1]

# cross validation 
ctrl = trainControl(method='cv', number = 10)

Single Trees (CART)

set.seed(8)
cmp_cart <- train(X_train, Y_train,
                  method = "rpart2",
                  tuneLength = 10,
                  trControl = ctrl)
cmp_cart

## CART 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 132, 129, 128, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##    1        1.458115  0.4068974  1.163036
##    2        1.387326  0.4628283  1.087864
##    3        1.422046  0.4422614  1.120483
##    4        1.409582  0.4770331  1.108151
##    5        1.376247  0.5025622  1.085459
##    6        1.427623  0.4695895  1.092411
##    7        1.421760  0.4713292  1.089925
##    8        1.394941  0.5017504  1.066067
##    9        1.392049  0.5063458  1.057686
##   10        1.390395  0.5082862  1.059367
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 5.

cmp_cart$results[6,1:4]

##   maxdepth     RMSE  Rsquared      MAE
## 6        6 1.427623 0.4695895 1.092411

cmp_cart_pred <- predict(cmp_cart, newdata = X_test)
postResample(obs = Y_test, pred=cmp_cart_pred)

##      RMSE  Rsquared       MAE 
## 1.4198787 0.4381377 0.9752138

Bagged Trees

library(ipred)
set.seed(8)
cmp_bt <- ipredbagg(Y_train, X_train)
cmp_bt

## 
## Bagging regression trees with 25 bootstrap replications

cmp_bt_pred <- predict(cmp_bt, newdata = X_test)
postResample(obs = Y_test, pred=cmp_bt_pred)

##      RMSE  Rsquared       MAE 
## 1.2073504 0.5507839 0.9404803

Random Forest

set.seed(8)
cmp_rf <- train(X_train, Y_train,
                method = "rf",
                tuneLength = 10,
                trControl = ctrl)
cmp_rf

## Random Forest 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 132, 129, 128, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.224664  0.6357345  0.9865845
##    8    1.114779  0.6731384  0.8851360
##   14    1.100216  0.6716059  0.8625890
##   20    1.096407  0.6666550  0.8511977
##   26    1.096274  0.6627013  0.8497513
##   32    1.089063  0.6632595  0.8468469
##   38    1.097846  0.6569936  0.8453105
##   44    1.108649  0.6490903  0.8574726
##   50    1.100782  0.6563632  0.8499740
##   56    1.098181  0.6560735  0.8500226
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 32.

cmp_rf$results[5,1:4]

##   mtry     RMSE  Rsquared       MAE
## 5   26 1.096274 0.6627013 0.8497513

cmp_rf_pred <- predict(cmp_rf, newdata = X_test)
postResample(obs = Y_test, pred=cmp_rf_pred)

##      RMSE  Rsquared       MAE 
## 1.1319717 0.6178381 0.8720294

Boosted Trees

gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
                       .n.trees = seq(150, 1000, by = 50),
                       .shrinkage = c(0.01, 0.1),
                       .n.minobsinnode=10)
set.seed(8)
cmp_boost <- train(X_train, Y_train,
                   method = "gbm",
                   tuneGrid = gbmGrid,
                   verbose = FALSE)

cmp_boost$results[143,1:6]

##    shrinkage interaction.depth n.minobsinnode n.trees     RMSE  Rsquared
## 72      0.01                 7             10    1000 1.241094 0.5751436

cmp_boost_pred <- predict(cmp_boost, newdata = X_test)
postResample(obs = Y_test, pred=cmp_boost_pred)

##      RMSE  Rsquared       MAE 
## 1.0350636 0.6732565 0.7834616

Cubist

set.seed(8)
cmp_cube <- train(X_train, Y_train,
                  method = "cubist")
cmp_cube

## Cubist 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.872870  0.3353305  1.3371172
##    1          5          1.843279  0.3536104  1.3090912
##    1          9          1.857232  0.3446822  1.3231901
##   10          0          1.312002  0.5307383  1.0047465
##   10          5          1.279140  0.5541529  0.9671315
##   10          9          1.295849  0.5430002  0.9848205
##   20          0          1.251475  0.5681472  0.9649637
##   20          5          1.219542  0.5906075  0.9276127
##   20          9          1.236099  0.5801048  0.9464027
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

cmp_cube$results[8,1:5]

##   committees neighbors     RMSE  Rsquared       MAE
## 8         20         5 1.219542 0.5906075 0.9276127

cmp_cube_pred <- predict(cmp_cube, newdata = X_test)
postResample(obs = Y_test, pred=cmp_cube_pred)

##      RMSE  Rsquared       MAE 
## 0.8128278 0.8230894 0.6829123

a) Which tree-based regression model gives the optimal resampling and test set performance?

The cubist model is the best performing model where the performance metrics for the resampling:

cmp_cube$results[8,1:5]

##   committees neighbors     RMSE  Rsquared       MAE
## 8         20         5 1.219542 0.5906075 0.9276127

and the performance metrics for the test set:

postResample(obs = Y_test, pred=cmp_cube_pred)

##      RMSE  Rsquared       MAE 
## 0.8128278 0.8230894 0.6829123

b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(cmp_cube)

## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     60.87
## ManufacturingProcess04   56.52
## ManufacturingProcess17   45.65
## ManufacturingProcess39   44.57
## ManufacturingProcess27   44.57
## ManufacturingProcess13   44.57
## ManufacturingProcess33   40.22
## ManufacturingProcess29   39.13
## BiologicalMaterial03     39.13
## ManufacturingProcess09   38.04
## BiologicalMaterial02     29.35
## ManufacturingProcess30   22.83
## ManufacturingProcess02   17.39
## BiologicalMaterial08     14.13
## ManufacturingProcess19   14.13
## ManufacturingProcess20   14.13
## BiologicalMaterial11     13.04
## ManufacturingProcess18   13.04
## ManufacturingProcess31   13.04

The manufacturing processes dominate the top 20 important predictors. ManufacturingProcess32, ManufacturingProcess17, and ManufacturingProcess09 appears to be a consistently important predictor across all optimal models (linear, non-linear, tree-based). The cubist model appears to emphasize other manufacturing processes more than the optimal linear (lasso) and non-linear (SVM) models. More specifics can be accessed here:

add link to homework 7 add link to homework 8

c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(partykit)
rpartTree <- rpart(Y_train ~., data = X_train)
rpartTree2 <- as.party(rpartTree)
plot(rpartTree2)