DATA 624 Homework 9

8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.0.5

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The random forest model dd not use the uninformative predictors significantly.

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change?

The importance score lowered for V1 while the scores for V2 and V4 increased and became more important than V1.

model2 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

What happens when you add another predictor that is also highly correlated with V1?

When another highly correlated predictor is added, the important scores for the other variables increase while the importance score for V1 decreased even more.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9408631

model3 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)

rfImp3

##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

These importance scores show almost similar patterns as the traditional random forest model. In the first model, V1 was considered the most important and V4 was the second most important. In the cforest model, these variables switch places.

model4 <- cforest(y ~ ., data = simulated[, c(1:11)])

rfImp4 <- varimp(model4, conditional = TRUE)

rfImp4

##          V1          V2          V3          V4          V5          V6 
##  6.35767602  5.33898936  0.14934113  6.13948255  1.38546210 -0.22927142 
##          V7          V8          V9         V10 
## -0.12173581 -0.18135897 -0.02920453 -0.20105653

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Trees

Even though, the importance scores are higher, there is almost the same pattern here as in the traditional random forest model. V1 is the most important, but the second most important is now V2.

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

rfImp5 <- varImp(gbmTune$finalModel, scale = FALSE)

rfImp5

##       Overall
## V1  4634.0234
## V2  4316.6044
## V3  1310.7178
## V4  4287.0793
## V5  1844.2147
## V6   416.3966
## V7   368.9383
## V8   224.4769
## V9   246.1400
## V10  250.6263

Cubist

This function also shows similar patterns as other patterns. V2 is the second most important, unlike the traditional model.

set.seed(100)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubistTuned$finalModel, scale = FALSE)

rfImp6

##     Overall
## V1     72.0
## V3     42.0
## V2     54.5
## V4     49.0
## V5     40.0
## V6     11.0
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0

8.2

Use a simulation to show tree bias with different granularities.

When there is a variable that has higher number of distinct values, the tree will select that variable over others. There is a higher probability the model will select the noise variables over the informative variables in the top nodes.

In this simulation, the tree-based model selected the variables that have more distinct values as more important. It also selected the noisy or the variables with the most repetitive values as the top node.

set.seed(624)

a <- sample(1:10 / 10, 500, replace = TRUE)
b <- sample(1:100 / 100, 500, replace = TRUE)
c <- sample(1:1000 / 1000, 500, replace = TRUE)
d <- sample(1:10000 / 10000, 500, replace = TRUE)
e <- sample(1:100000 / 100000, 500, replace = TRUE)

y <- a + b + c + d + e

simData <- data.frame(a,b,c,d,e,y) 

rpartTree <- rpart(y ~ ., data = simData)

plot(as.party(rpartTree), gp = gpar(fontsize = 7))

varImp(rpartTree)

##    Overall
## a 1.702702
## b 4.769899
## c 3.551946
## d 3.205420
## e 3.957698

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The bagging fraction is the fraction of the training data used, whereas, the learning rate is the fraction of the current predicted value that is added to the previous iteration’s predicted value, or how fast the model learns. A lower learning rate is optimal, as it means there are more iterations. The model on the left has a smaller learning rate and bagging fraction, which means it learns slower and takes more computation time, thus performing better. It also uses a smaller subset of the data. The model on the right is most likely overfitting since it has a higher learning rate and bagging fraction. It uses more of the training data with each iteration and and is learning faster. Since it is learning faster, it increases the weight or contribution of each predictor, hence focuses its importance on just the first few predictors.

Which model do you think would be more predictive of other samples?

The model on the left would be more predictive of other samples, as there are more iterations, thus decreasing the weight of each predictor. It generalizes better, making it more accurate.

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction depth is the number of splits to perform on a tree, or the maximum nodes per tree. When the interaction depth increases, the importance of the predictors increases, allowing the smaller important predictors to contribute more. Hence, the slope would become steeper or increase.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

set.seed(100)

data(ChemicalManufacturingProcess)

# imputation
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)

# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]

set.seed(624)

# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)

# train 
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]

# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]

Which tree-based regression model gives the optimal resampling and test set performance?

Single Tree (CART)

set.seed(100)

cartTune <- train(train_x, train_y,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))

cartPred <- predict(cartTune, test_x)

postResample(cartPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.5632389 0.5035625 1.1913334

Bagged Trees

set.seed(100)

baggedTree <- ipredbagg(train_y, train_x)
 
baggedPred <- predict(baggedTree, test_x)

postResample(baggedPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.2311358 0.7141633 0.8911824

Random Forest

set.seed(100)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.2598015 0.7216121 0.8998200

Boosted Trees

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.1824945 0.7195344 0.8678498

Cubist

set.seed(100)
cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)

##      RMSE  Rsquared       MAE 
## 0.9804618 0.7964499 0.7363520

The lowest RMSE is found in the cubist model, giving the best optimal resampling and test set performance.

rbind(cart = postResample(cartPred, test_y),
      bagged = postResample(baggedPred, test_y),
      randomForest = postResample(rfPred, test_y),
      boosted = postResample(gbmPred, test_y),
      cubist = postResample(cubistPred, test_y))

##                   RMSE  Rsquared       MAE
## cart         1.5632389 0.5035625 1.1913334
## bagged       1.2311358 0.7141633 0.8911824
## randomForest 1.2598015 0.7216121 0.8998200
## boosted      1.1824945 0.7195344 0.8678498
## cubist       0.9804618 0.7964499 0.7363520

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The manufacturing process variables dominate the list at a ratio of 16:4, whereas the optimal linear and nonlinear models had ratios of 11:9.

varImp(cubistTuned, scale = TRUE)

## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess17   62.35
## ManufacturingProcess09   54.12
## BiologicalMaterial03     54.12
## BiologicalMaterial02     52.94
## ManufacturingProcess13   38.82
## BiologicalMaterial06     36.47
## ManufacturingProcess04   34.12
## ManufacturingProcess33   32.94
## ManufacturingProcess10   22.35
## ManufacturingProcess11   18.82
## ManufacturingProcess14   17.65
## ManufacturingProcess39   17.65
## ManufacturingProcess28   14.12
## ManufacturingProcess16   12.94
## BiologicalMaterial12     12.94
## ManufacturingProcess31   12.94
## ManufacturingProcess29   12.94
## ManufacturingProcess02   12.94
## ManufacturingProcess21   11.76

plot(varImp(cubistTuned), top = 20)

For the tree-based model, only 3 are biological variables out of the top 10, compared to 4 in the linear and nonlinear models. ManufactingProcess32 still is deemed the most important. The other predictors have less variable importance. BiologicalMaterial06 was deemed only the seventh most important, where it was the second most important in other variables. There are some predictors that were not in the top 10 previously, that are in the top 10 now, such as Manufacting Processes number 17, 4, 33, and 10.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])

plot(as.party(rpartTree), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))

DATA 624 Homework 9

Orli Khaimova

5/1/2022

8.1

Boosted Trees

Cubist

8.2

8.3

8.7

Single Tree (CART)

Bagged Trees

Random Forest

Boosted Trees

Cubist