CUNY DATA624 HW9
CUNY DATA624 HW9
- 8.1 Recreate the simulated data from Exercise 7.2:
- a) Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 – V10)?
- b) Now add an additional predictor that is highly correlated with one of the informative predictors. Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
- c) Use the
cforestfunction in the party package to fit a random forest model using conditional inference trees. The party package functionvarimpcan calculate predictor importance. Theconditionalargument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model? - d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
- 8.2 Use a simulation to show tree bias with different granularities.
- 8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
- a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
- b) Which model do you think would be more predictive of other samples?
- c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
- 8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
- a) Which tree-based regression model gives the optimal resampling and test set performance?
- b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
- c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
8.1 Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"a) Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 – V10)?
library(randomForest)
library(caret)
rf_simimp <- function(Data){
set.seed(200)
mod <- randomForest(y ~ ., data = Data,
importance = TRUE,
ntree = 1000)
rfImp <- varImp(mod, scale = FALSE)
return(rfImp)
}
knitr::kable(rf_simimp(simulated))| Overall | |
|---|---|
| V1 | 8.6887988 |
| V2 | 6.4748856 |
| V3 | 0.7180118 |
| V4 | 7.7379189 |
| V5 | 2.2867714 |
| V6 | 0.1857716 |
| V7 | 0.0337107 |
| V8 | -0.0653486 |
| V9 | -0.1012404 |
| V10 | -0.1005368 |
The random forest model did not significantly use the uninformative predictors.
c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
## V1 V2 V3 V4 V5 V6
## 8.889688266 6.475678898 0.048199904 8.374714465 1.911482592 -0.026382693
## V7 V8 V9 V10
## 0.003470791 -0.020945307 -0.040833731 -0.017820271
From the above two plots, we can see that the pattern of importance for the random forest model using conditional inference trees is very similar to the traditional random forest model.
d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Boosted Trees
library(gbm)
set.seed(200)
gbmsim <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmsim)## var rel.inf
## V4 V4 30.6216330
## V1 V1 26.7533742
## V2 V2 22.2563998
## V5 V5 12.5198810
## V3 V3 7.3475220
## V6 V6 0.3458915
## V8 V8 0.1552985
## V7 V7 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
For the boosted tree model, we see that the same pattern does not quite occur. V6-V10 still remains relatively unused and V3 is still the most unused of V1-V5. However, the order of importance has changed. For the boosted tree model, V4 is the most relevant instead of V1.
Cubist
## Conditions Model Variable
## 1 0 100 V1
## 2 0 100 V2
## 3 0 100 V4
## 4 0 100 V5
## 5 0 0 V3
## 6 0 0 V6
## 7 0 0 V7
## 8 0 0 V8
## 9 0 0 V9
## 10 0 0 V10
The cubist model above also does not utilize V6-V10, but also does not include V3 either. There also appears to be no differentiation between V1-V4, which is unlike the other tree models.
8.2 Use a simulation to show tree bias with different granularities.
We’ll create a simulated data set with low variance (X1) along with another simulated data set with high variance (X2). A response variable will be created that will correlate with X1, but will not be associated with X2.
library(rpart)
set.seed(200)
X1 <- rep(1:2, each = 100)
X2 <- rnorm(200, mean=0,sd=2)
Y <- X1 + rnorm(200,mean=0,sd=4)
sim <- data.frame(Y=Y, X1=X1, X2=X2)
sim_tree <- rpart(Y~., data = sim)
varImp(sim_tree)## Overall
## X1 0.1390440
## X2 0.4393341
The model above is biased towards X2 because as the variance increases, the probability that X2 is selected increases. This shows that even though X2 has no true association with the response variable, it is still selected based more on the granularity of X2.
8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Figure 8.24
a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The right hand plot has bagging fraction and learning rate set at a higher level (both at 0.9), whereas the left hand plot has the bagging fraction and learning rate both set at 0.1. This means that for the right hand plot the stochastic element is smaller and thus fewer predictors will be regarded as important. It employs the greedy strategy to determine the optimal solution, but only at a specific stage. It has a tendency to overfit the data.
b) Which model do you think would be more predictive of other samples?
Because the parameters for the model on the right are set pretty high, it’s likely and overfit, so the one on the left would probably be more predictive.
c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
As interaction depth increases, so does spread of how important predictors become. Essentially, for the lollipop plots above, each lollipop would be longer and more spread out across predictors.
8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
We’ll take the same code from the previous two assignments and paste it below
library(AppliedPredictiveModeling)
library(VIM)
data(ChemicalManufacturingProcess)
cmpImp <- kNN(ChemicalManufacturingProcess, imp_var = FALSE) # kNN imputation
cmp_non_pred <- nearZeroVar(cmpImp) # remove predictors
cmp <- cmpImp[,-cmp_non_pred]
set.seed(8)
trainrows <- createDataPartition(cmp$Yield,
p=0.8,
list=FALSE)
cmp_train <- cmp[trainrows,]
cmp_test <- cmp[-trainrows,]
X_train <- cmp_train[,-1]
Y_train <- cmp_train[,1]
X_test <- cmp_test[,-1]
Y_test <- cmp_test[,1]
# cross validation
ctrl = trainControl(method='cv', number = 10)Single Trees (CART)
set.seed(8)
cmp_cart <- train(X_train, Y_train,
method = "rpart2",
tuneLength = 10,
trControl = ctrl)
cmp_cart## CART
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 132, 129, 128, 131, 131, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 1.458115 0.4068974 1.163036
## 2 1.387326 0.4628283 1.087864
## 3 1.422046 0.4422614 1.120483
## 4 1.409582 0.4770331 1.108151
## 5 1.376247 0.5025622 1.085459
## 6 1.427623 0.4695895 1.092411
## 7 1.421760 0.4713292 1.089925
## 8 1.394941 0.5017504 1.066067
## 9 1.392049 0.5063458 1.057686
## 10 1.390395 0.5082862 1.059367
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 5.
## maxdepth RMSE Rsquared MAE
## 6 6 1.427623 0.4695895 1.092411
## RMSE Rsquared MAE
## 1.4198787 0.4381377 0.9752138
Bagged Trees
##
## Bagging regression trees with 25 bootstrap replications
## RMSE Rsquared MAE
## 1.2073504 0.5507839 0.9404803
Random Forest
set.seed(8)
cmp_rf <- train(X_train, Y_train,
method = "rf",
tuneLength = 10,
trControl = ctrl)
cmp_rf## Random Forest
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 132, 129, 128, 131, 131, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.224664 0.6357345 0.9865845
## 8 1.114779 0.6731384 0.8851360
## 14 1.100216 0.6716059 0.8625890
## 20 1.096407 0.6666550 0.8511977
## 26 1.096274 0.6627013 0.8497513
## 32 1.089063 0.6632595 0.8468469
## 38 1.097846 0.6569936 0.8453105
## 44 1.108649 0.6490903 0.8574726
## 50 1.100782 0.6563632 0.8499740
## 56 1.098181 0.6560735 0.8500226
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 32.
## mtry RMSE Rsquared MAE
## 5 26 1.096274 0.6627013 0.8497513
## RMSE Rsquared MAE
## 1.1319717 0.6178381 0.8720294
Boosted Trees
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(150, 1000, by = 50),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode=10)
set.seed(8)
cmp_boost <- train(X_train, Y_train,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 72 0.01 7 10 1000 1.241094 0.5751436
cmp_boost_pred <- predict(cmp_boost, newdata = X_test)
postResample(obs = Y_test, pred=cmp_boost_pred)## RMSE Rsquared MAE
## 1.0350636 0.6732565 0.7834616
Cubist
## Cubist
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.872870 0.3353305 1.3371172
## 1 5 1.843279 0.3536104 1.3090912
## 1 9 1.857232 0.3446822 1.3231901
## 10 0 1.312002 0.5307383 1.0047465
## 10 5 1.279140 0.5541529 0.9671315
## 10 9 1.295849 0.5430002 0.9848205
## 20 0 1.251475 0.5681472 0.9649637
## 20 5 1.219542 0.5906075 0.9276127
## 20 9 1.236099 0.5801048 0.9464027
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
## committees neighbors RMSE Rsquared MAE
## 8 20 5 1.219542 0.5906075 0.9276127
## RMSE Rsquared MAE
## 0.8128278 0.8230894 0.6829123
a) Which tree-based regression model gives the optimal resampling and test set performance?
The cubist model is the best performing model where the performance metrics for the resampling:
## committees neighbors RMSE Rsquared MAE
## 8 20 5 1.219542 0.5906075 0.9276127
and the performance metrics for the test set:
## RMSE Rsquared MAE
## 0.8128278 0.8230894 0.6829123
b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 60.87
## ManufacturingProcess04 56.52
## ManufacturingProcess17 45.65
## ManufacturingProcess39 44.57
## ManufacturingProcess27 44.57
## ManufacturingProcess13 44.57
## ManufacturingProcess33 40.22
## ManufacturingProcess29 39.13
## BiologicalMaterial03 39.13
## ManufacturingProcess09 38.04
## BiologicalMaterial02 29.35
## ManufacturingProcess30 22.83
## ManufacturingProcess02 17.39
## BiologicalMaterial08 14.13
## ManufacturingProcess19 14.13
## ManufacturingProcess20 14.13
## BiologicalMaterial11 13.04
## ManufacturingProcess18 13.04
## ManufacturingProcess31 13.04
The manufacturing processes dominate the top 20 important predictors. ManufacturingProcess32, ManufacturingProcess17, and ManufacturingProcess09 appears to be a consistently important predictor across all optimal models (linear, non-linear, tree-based). The cubist model appears to emphasize other manufacturing processes more than the optimal linear (lasso) and non-linear (SVM) models. More specifics can be accessed here:
add link to homework 7 add link to homework 8
c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(partykit)
rpartTree <- rpart(Y_train ~., data = X_train)
rpartTree2 <- as.party(rpartTree)
plot(rpartTree2)