DATA 624 - Homework 9

Kuhn and Johnson Chapter 8

Exercise 8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)

rfImp1 |>
  arrange(desc(Overall)) |>
  knitr::kable()

	Overall
V1	8.7322354
V4	7.6151188
V2	6.4153694
V5	2.0235246
V3	0.7635918
V6	0.1651112
V7	-0.0059617
V10	-0.0749448
V9	-0.0952927
V8	-0.1663626

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No. V6-V10 are very insignificant compared to V1-V5.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)

rfImp2 |>
  arrange(desc(Overall)) |>
  knitr::kable()

	Overall
V4	7.0475224
V2	6.0689606
V1	5.6911997
duplicate1	4.2833158
V5	1.8723844
V3	0.6297022
V6	0.1356906
V10	0.0289481
V9	0.0084044
V7	-0.0134564
V8	-0.0437056

The importance score for V1 decreased significantly. The importance scores for the other significant variables also decreased slightly.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9408631

model3 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE)

rfImp3 |>
  arrange(desc(Overall)) |>
  knitr::kable()

	Overall
V4	7.0487092
V2	6.5281650
V1	4.9168733
duplicate1	3.8006823
V5	2.0311556
duplicate2	1.8772196
V3	0.5871155
V6	0.1421315
V7	0.1099199
V10	0.0923058
V9	-0.0107503
V8	-0.0840569

When another highly correlated predictor is added the importance of V1 decreases even more.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

cforestModel <- cforest(y ~ .,
        data = simulated[,c(1:11)])

data.frame(Importance = varimp(cforestModel)) |>
  arrange(desc(Importance)) |>
  knitr::kable()

	Importance
V1	8.3013758
V4	7.3074416
V2	6.3849498
V5	2.1097038
V3	0.2021611
V7	0.1021704
V9	-0.1127212
V10	-0.1345254
V8	-0.1382315
V6	-0.1612203

data.frame(Importance = varimp(cforestModel, conditional=T)) |>
    arrange(desc(Importance)) |>
  knitr::kable()

	Importance
V1	6.2461745
V4	6.0852583
V2	5.2780486
V5	1.4900429
V3	0.0514986
V7	-0.0179179
V6	-0.1412536
V9	-0.1834125
V10	-0.1838288
V8	-0.2463130

The pattern of importance remains the same (V1, V4, V2, V5, V3). However, V3 is much less significant than in the original random forest model. V6-V10 remain unimportant.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

# boosted model
boostedModel <- gbm(y ~ .,
                    data = simulated[,c(1:11)],
                    distribution = "gaussian",
                    n.trees=1000)

summary(boostedModel, plotit=F) |>
  dplyr::select(-var) |>
  knitr::kable()

	rel.inf
V4	25.317426
V1	24.548002
V2	17.731543
V3	10.442501
V5	10.386558
V7	3.113415
V6	2.988204
V8	2.062039
V10	1.720418
V9	1.689893

The boosted model has the same significant (V1-V5) and insignificant (V6-V10) predictors. However, the pattern of importance is slightly different, with V4 having a higher influence than V1.

# cubist model
cubistModel <- train(y ~ .,
                     data = simulated[,c(1:11)],
                     method = "cubist")

varImp(cubistModel$finalModel, scale = FALSE) |>
  arrange(desc(Overall)) |>
  knitr::kable()

	Overall
V1	72.0
V2	54.5
V4	49.0
V3	42.0
V5	40.0
V6	11.0
V7	0.0
V8	0.0
V9	0.0
V10	0.0

In the cubist model, V1 is again the most important variable but in this model, V2 is rated more important than V4 and V3 is rated more important than V5. V7-V10 are wholly insignificant but V6 has a slightly higher importance in this model. The importance of V6 still pales in comparison to the other significant variables.

Exercise 8.2

Use a simulation to show tree bias with different granularities.

Predictor selection tends to be more biased towards predictors with more distinct values.

set.seed(613)
a <- sample(1:10 / 10, 200, replace = TRUE)
b <- sample(1:100 / 100, 200, replace = TRUE)
c <- sample(1:1000 / 1000, 200, replace = TRUE)
d <- sample(1:10000 / 10000, 200, replace = TRUE)

y <- a + b + c + rnorm(200, mean=0, sd=5)

simData <- data.frame(a, b, c, d, y) 

rpartTree <- rpart(y ~ ., data = simData)

plot(as.party(rpartTree))

d appears in the tree very often despite not being directly related to y.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has a higher bagging fraction and a higher learning rate (0.9 for both). Therefore, the model tends to fit faster to a few predictors as it overfits and focuses on the earlier predictors. The model on the left spreads the importance across more predictors because there is a lower bagging fraction and a lower learning rate so the model incorporates information from more predictors.

(b) Which model do you think would be more predictive of other samples?

The left model with a lower bagging fraction and lower learning rate is probably more predictive of other samples as it is less likely to overfit to the training data. The model on right is likely overfit and not as able to generalize to new data.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth is likely to spread the variable importance over more predictors, increasing the importance of some of the less significant variables.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

# load data
data(ChemicalManufacturingProcess)

# apply same imputations, data splitting, and pre-processing as done in HW7

# impute
imputations <- preProcess(ChemicalManufacturingProcess, 
               method = c("knnImpute"), 
               k=5)

chem_man_imputed <- predict(imputations, ChemicalManufacturingProcess)

# filter nonzero variance variables
chem_man_filtered <- chem_man_imputed[,-nearZeroVar(chem_man_imputed)]

set.seed(613)  # for reproducibility

# split into training and testing
train_indices <- sample(nrow(chem_man_filtered), nrow(chem_man_filtered)*.8, replace=F)

trainChem <- chem_man_filtered[train_indices,]
testChem <- chem_man_filtered[-train_indices,]

# train models
# random forest model
rf_mod <- randomForest(Yield~.,
                       data=trainChem,
                       importance=T,
                       ntree=1000)

rf_pred <- predict(rf_mod, testChem)

# bagged tree
bagged_mod <- bagging(Yield ~ ., 
                     data = trainChem)

bagged_pred <- predict(bagged_mod, testChem)

# boosted tree
boosted_mod <- gbm(Yield ~ .,
                   data = trainChem,
                   distribution = "gaussian",
                   n.trees=1000)

boosted_pred <- predict(boosted_mod, testChem)

## Using 1000 trees...

# cubist 
cubist_mod <- train(Yield ~ .,
                    data = trainChem,
                    method = "cubist")

cubist_pred <- predict(cubist_mod, testChem)

(a) Which tree-based regression model gives the optimal resampling and test set performance?

rbind(RandomForest = postResample(rf_pred, testChem$Yield),
      BaggedTree = postResample(bagged_pred, testChem$Yield),
      BoostedTree = postResample(boosted_pred, testChem$Yield),
      Cubist = postResample(cubist_pred, testChem$Yield))

##                   RMSE  Rsquared       MAE
## RandomForest 0.6240582 0.5225684 0.5052709
## BaggedTree   0.6770303 0.4435007 0.5418391
## BoostedTree  0.6618618 0.5656576 0.5076747
## Cubist       0.5847081 0.6158759 0.4676564

The cubist model has the highest \(R^2\) and the lowest \(RMSE\).

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

We can look at the top 10 most important predictors in the cubist model.

plot(varImp(cubist_mod), 10)

The top 10 predictors are mostly process variables, with only one biological predictor present.

SVM Non-Linear Model

The top two predictors are the same for the non-linear SVM model and the tree model. The third most influential variable in the tree model, ManufacturingProcess09 is the tenth most important in the non-linear SVM model. BiologicalMaterial06 appears in the top four most important predictors for both models. ManufacturingProcess17 is the only other shared important predictor for these two models.

Lasso Linear Model

ManufacturingProcess32 is the top predictor for both the non-linear lasso model and the tree model. ManufacturingProcess13 and ManufacturingProcess09 appear in the top three, but the order of importance is switched for these two models. None of the biological predictors are shared amongst these models. ManufacturingProcess17 also appears as an important predictor within this model.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

single_tree <- rpart(Yield ~ ., 
                     data = trainChem)

rpart.plot(single_tree)

The top predictor is the same, which is unsurprising. There are more biological predictors present in the single tree model than were seen in the more robust tree models. The tree gives interpretable breakdowns behind the reasoning of the splits which show the relationship between each predictor in the tree and the yield.