Data 624 HW9

11/21/2021

Gabe Abreu

8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

  model1 <- randomForest(y ~ ., data = simulated,
  importance = TRUE,
  ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Based on the values shown above, v6-v10 are not really used by the random forest alogrithm. The v6 variable is slightly above 0 and the rest of the values are negative.

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

When another predictor that is highly correlated is added the overall importance of the original predictor is reduced. Duplicate 1 reduces the importance of V1 by almost half.

Adding another correlated predictor with V1:

simulated$dup3 <- simulated$V1 + rnorm(200) * .1

model3 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3

##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## dup3        1.87721959

Adding correlated predictors dilutes the importance of the predictors. In this scenario, V1 loses decreases in importance and so does duplicate 1, dup 3 is not as important as the 2 original variables.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

suppressMessages(library(party))

## Warning: package 'modeltools' was built under R version 3.6.3

## Warning: package 'strucchange' was built under R version 3.6.3

cf_model <- cforest(y ~ ., data = simulated)
varimp(cf_model, conditional = TRUE)

##           V1           V2           V3           V4           V5           V6 
##  1.634564971  4.686043840  0.009352431  5.653333661  1.001318061  0.019234933 
##           V7           V8           V9          V10   duplicate1         dup3 
## -0.036099766 -0.015387282  0.002306261 -0.022986160  1.825986440  0.279429060

The same patterns are apparent in the using the cforest function. The same variables are still important to the model.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Regression Trees

suppressMessages(library(gbm))

## Warning: package 'gbm' was built under R version 3.6.3

# simulated1 <- simulated %>% dplyr::select(-duplicate1, -dup3)
# 
# gbmGrid <- expand.grid(
#          interaction.depth = seq(1, 7, by = 2),
#          n.trees = seq(100, 1000, by = 50),
#          shrinkage = c(0.01, 0.1),
#          n.minobsinnode = 10
#          )
# 
# gbm_model <- train(y ~ ., 
#                    data = simulated1, 
#                    method="gbm", 
#                    tuneGrid = gbm_grid, 
#                    verbose = FALSE)
# 
# gbm_imp <- varImp(gbm_model) 
# 
# gbm_imp

# Keep getting this error running this function Error in relative.influence(object, n.trees = numTrees) : could not find function "relative.influence"

gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")

summary(gbmModel)

##                   var    rel.inf
## V4                 V4 29.3374776
## V2                 V2 23.6493221
## V1                 V1 15.3478582
## duplicate1 duplicate1 13.0384824
## V5                 V5 10.3514098
## V3                 V3  7.6565238
## V6                 V6  0.6189262
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000
## dup3             dup3  0.0000000

Cubist

#create a dataset with only predictors

simulatedP <- subset(simulated, select=c(-y))


cModel <- cubist(x = simulatedP, 
                 y = simulated$y)
cubist_imp <- varImp(cModel)
cubist_imp

##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## duplicate1      50
## V3               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0
## dup3             0

The same variables are important but the scoring is much different. I suspect if the duplicate1 variable was removed the V1 predictor might have a higher score.

8.2

Use a simulation to show tree bias with different granularities.

X1 is only 2 categories, X2 has 200 random outcomes.

set.seed(150)

X1 <- rep(1:2, each=100)

Y <- X1 + rnorm(200, mean=0, sd=4)

set.seed(100)

X2 <- rnorm(200, mean= 0, sd=2)


sim <- data.frame(Y=Y, X1=X1, X2=X2)

rpSim <- randomForest(Y~., data=sim)
varImp(rpSim)

##      Overall
## X1  98.17826
## X2 466.41040

The random forest heavily used X2 over X1. The difference in possible outcomes definitely skewed the alogrithm. More granulairty seems to be important.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The right model emphasizes the importance of a few predictors because the learning rate increases as it reaches 1, making the model more greedy. This makes the model select fewer predictors as it is looking for the optimal predictor. As the bagging fraction increases, the model uses more of the data in creating the model. The larger bagging fraction, the fewer predictors will be identified as important.

Which model do you think would be more predictive of other samples?

Since the model on the right concentrates on a limited number of predictors, the model on the left would be more predictive of other samples.

How would increasing interaction depth affect the slope of predictor importance for either model in Figure 2?

Interaction depth does affect variable importance. As the tree depth increases, the importance of the variable spreads over more predictors, decreasing the slope of the predictor importance.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

## Warning: package 'AppliedPredictiveModeling' was built under R version 3.6.3

Build the different forest models:

Simple Tree (Rpart and Ctree)

set.seed(500)

rpartTree <- train(training2, training_yield,
                   method = "rpart2",
                   tuneLength = 10,
                   trControl = trainControl(method = "cv")
                   )

ctTree <- train(training2, training_yield,
                   method = "ctree2",
                   tuneLength = 10,
                   trControl = trainControl(method = "cv")
                   )

Random Forest

rfModel <- randomForest(training2, training_yield, importance=TRUE, ntree = 500)

Gradient Boosting Machine

I’m going to use a tuned grid for the gradient boosting machine.

gbmGrid <- expand.grid(
         interaction.depth = seq(1, 7, by = 2),
         n.trees = seq(100, 1000, by = 50),
         shrinkage = c(0.01, 0.1),
         n.minobsinnode = 10
         )

set.seed(100)

gbmTune <- train(training2, training_yield,
    method = "gbm",
    tuneGrid = gbmGrid,
    verbose = FALSE)

Cubist

set.seed(100)

cubistTune <- train(training2, training_yield,
    method = "cubist",
    verbose = FALSE)

Predictions:

set.seed(150)

simplePred <- predict(rpartTree, newdata=test2)
simpleResample <- postResample(pred = simplePred, obs = test_yield)

ctTreePred <- predict(ctTree, newdata = test2)
ctResample <- postResample(pred=ctTreePred, obs = test_yield)

randomPred <- predict(rfModel, newdata = test2)
randomResample <- postResample(pred=randomPred, obs = test_yield)

gbmPred <-predict(gbmTune, newdata=test2)
gbmResample <- postResample(pred=gbmPred, obs=test_yield)

cubistPred <-predict(cubistTune, newdata=test2)
cubistSample <- postResample(pred=cubistPred, obs = test_yield)

display <- rbind(
"Simple Tree" = simpleResample, 
"Simple Tree - ct Tree" = ctResample,
"Random Forest" = randomResample,
"Gradient Boosted Machine" = gbmResample,
"Cubist" = cubistSample
)


display %>% kable() %>% kable_paper()

	RMSE	Rsquared	MAE
Simple Tree	1.2042221	0.5920196	1.0287404
Simple Tree - ct Tree	1.2060249	0.5424733	0.9708932
Random Forest	0.8293321	0.8040729	0.6303472
Gradient Boosted Machine	0.8300481	0.7585793	0.5661109
Cubist	1.0819031	0.6591117	0.6696670

Which tree-based regression model gives the optimal resampling and test set performance?

Out of all the models, Random Forest gave the best rsquared value at 0.8040, even though the GBM model was not far behind at 0.7585. I suspect that with further tuning, the GBM model can be improved. The simple trees did not perform as well and were on par with the linear models.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImpPlot(rfModel, n.var = 10)

ManufacturingProcess32 and ManufacturingProcess13 top the list of most important variables once again. These variables were important for nonlinear and linear models. ManufactoringProcess variables are the most important contributors to the model. Also, BiologicalMaterial06 and BiologicalMaterial03 are showing in the top ten. This proves that importance of variables transcends models, the importance of variables will not really change across top performing models.

The main difference between the random forest models and nonlinear models compared against the linear models is that the importance of ManufacturingProcess17 is reduced, but it is still in the top 10.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Looking at the optimal single tree, it is interesting to see that BiologicalMaterial11 is used as a decision node after the root node ManufacturingProcess32. The right side of the tree does not suprise me.

suppressMessages(library(partykit))

party_tree <- as.party(rpartTree$finalModel)
plot(party_tree)