Data 624 Homework 9

Exercise 8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

set.seed(10)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

# Print out the variable importance scores.
rfImp1

##         Overall
## V1   8.90420548
## V2   6.70898998
## V3   0.86791755
## V4   7.62604729
## V5   2.23267325
## V6   0.06923985
## V7   0.06089436
## V8  -0.10874321
## V9  -0.07789474
## V10 -0.06803808

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

Answer:

From the above variable importance table we can see that the model significantly utilized variables V1 through V5. Whilst present, Variables V6 through V10 were not significantly utilized by the model. This is apparent by their low importance scores.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9427268

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

set.seed(10)
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

# Print out the variable importance scores.
rfImp2

##                Overall
## V1          5.31827901
## V2          5.99427268
## V3          0.48697184
## V4          7.10547761
## V5          2.00014094
## V6          0.07296670
## V7          0.08425677
## V8         -0.19300835
## V9         -0.02235592
## V10        -0.07353255
## duplicate1  5.21717340

Answer:

After adding an additional highly correlated predictor, the importance of variable V1 dropped significantly.

What happens when you add another predictor that is also highly correlated with V1?

set.seed(10)
# Add an additional highly correlated variable.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9450441

# Fit an additional random forest model. 
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE)

# Print out the variable importance scores.
rfImp3

##                Overall
## V1          3.43278224
## V2          6.21300854
## V3          0.37855134
## V4          6.93456970
## V5          1.94111181
## V6          0.20595267
## V7          0.06673948
## V8         -0.07343753
## V9         -0.03910869
## V10         0.08350986
## duplicate1  3.27864985
## duplicate2  3.53117795

Answer:

Adding an additional highly correlated variable further reduces the importance of V1, but not as dramatically as the reduction after the addition of the previous highly correlated variable.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Traditional Importance Measure

set.seed(10)
# Fit a random forest model using conditional inference trees.
cforest_model <- cforest(y ~ ., simulated)

# Print out the variable importance scores without conditional argument.
cforestImp3 <- varimp(cforest_model)
cforestImp3

##           V1           V2           V3           V4           V5           V6 
##  5.799253750  5.298015397  0.258691754  6.330561629  1.823288605  0.007444318 
##           V7           V8           V9          V10   duplicate1   duplicate2 
##  0.208115563 -0.116038987  0.035354230 -0.074702917  4.430253665  4.557321703

Conditional Importance Measure

# Print out the variable importance scores using conditional argument.
cforestImp3Conditional <- varimp(cforest_model, conditional = TRUE)
cforestImp3Conditional

##          V1          V2          V3          V4          V5          V6 
##  2.59979292  4.27632052  0.03614943  5.61307920  1.18052467 -0.28216376 
##          V7          V8          V9         V10  duplicate1  duplicate2 
## -0.10972358 -0.15616185 -0.16504915 -0.16416097  1.73207732  1.83901858

Answer:

The conditional importance measure dramatically decreases the importance of all the variables in the model when compared to the traditional measure.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Generalized Boosted Model

set.seed(10)
# Fit a Generalized Boosted model.
grid = expand.grid(.interaction.depth = seq(1, 7, by = 2), 
                   .n.trees = seq(100, 1000, by = 50), 
                   .shrinkage = c(0.01, 0.1),
                   .n.minobsinnode = 5)

gbmModel = train(y ~ ., data = simulated, method = 'gbm', tuneGrid = grid, verbose = FALSE)

# Print out the variable importance scores.
gbmImp <- varImp(gbmModel)
gbmImp

## gbm variable importance
## 
##             Overall
## V4         100.0000
## V2          76.9717
## duplicate2  53.5461
## V5          40.5607
## V3          35.6464
## V1          28.2897
## duplicate1  22.7830
## V6           4.3984
## V7           2.3665
## V9           1.0731
## V8           0.3006
## V10          0.0000

Answer:

It appears the same pattern occurs for the GBM model - the V4 variable remains the most influential.

Cubist Model

set.seed(10)
# Fit a Cubist model using the cubist() function.
cubistModel <- cubist(simulated[,-11], simulated[, 11])

# Print out the variable importance scores.
cubistImp<-varImp(cubistModel)
cubistImp

##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## V6              50
## V3               0
## V7               0
## V8               0
## V9               0
## V10              0
## duplicate1       0
## duplicate2       0

Answer:

The same pattern does not seem to occur for the Cubist model. Variables V1 through V6 hold the same importance.

Fig. 8.24: A comparison of variable importance magnitudes for differing values of the bagging fraction and shrinkage parameters. Both tuning parameters are set to 0.1 in the left figure. Both are set to 0.9 in the right figure.

Exercise 8.2

Use a simulation to show tree bias with different granularities.

set.seed(10)
# Simulate the data using the twoClassSim() function.
simulation <- twoClassSim(200, noiseVars = 6, corrVar = 4, corrValue = 0.8) %>%
              mutate(TwoFactor1 = as.factor(round(TwoFactor1, 0)),
                     TwoFactor2 = as.factor(round(TwoFactor2, 0)))

# Create the tree and plot the data.
simulationTree <- rpart(Linear01 ~ ., simulation)
fancyRpartPlot(simulationTree, caption = NULL)

# Print out the variable importance scores.
varImp(simulationTree)

##               Overall
## Class      0.12899309
## Corr1      0.56877057
## Corr2      0.80484357
## Corr3      0.45148582
## Corr4      0.31354494
## Linear02   0.09092615
## Linear04   0.57816796
## Linear05   0.30368897
## Linear06   0.34327812
## Linear07   0.21178385
## Linear08   0.21668061
## Linear09   0.08907005
## Linear10   0.64657871
## Noise1     0.15824703
## Noise2     0.25301676
## Noise3     0.42263110
## Noise4     0.64929630
## Noise5     0.85549633
## Noise6     0.24896270
## Nonlinear1 0.57689821
## Nonlinear2 0.32892122
## Nonlinear3 0.66215725
## TwoFactor1 0.46778884
## TwoFactor2 0.67971422
## Linear03   0.00000000

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Answer:

The model on the right has a higher learning and bagging rate, which means that a larger percentage of the data is used. Additionally, this results in a larger portion of the current predicted value being added to the previous iterations predicted value. According to the text, "The importance profile for boosting has a much steeper importance slope than the one for random forests. This is due to the fact that the trees from boosting are dependent on each other and hence will have correlated structures as the method follows by the gradient. Therefore many of the same predictors will be selected across the trees, increasing their contribution to the importance metric".

(b) Which model do you think would be more predictive of other samples?

Answer:

Smaller learning and bagging rates will lead to less variance for new samples so I believe the model on the left would be more predictive of other samples.

(C) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Answer:

Increasing interaction depth would result in the slope of predictor importance becoming steeper due to the spreading out of importance predictors.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

set.seed(10)
data(ChemicalManufacturingProcess)

# Impute the missing values using KNN.
cmpImputed <- preProcess(ChemicalManufacturingProcess, 'knnImpute')

# Predict after imputation.
chemicalMPData <- predict(cmpImputed, ChemicalManufacturingProcess)

# Split the training data using an 80% training data split.
trainingData <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)
xTrainData <- chemicalMPData[trainingData, ]
yTrainData <- ChemicalManufacturingProcess$Yield[trainingData]

# Split the test data.
xTestData <- chemicalMPData[-trainingData, ]
yTestData <- ChemicalManufacturingProcess$Yield[-trainingData]

Single Tree Model

set.seed(10)
# Define the R Part Single Tree model.
rPartSingleTree <- train(x = xTrainData,
                         y = yTrainData,
                         method = 'rpart',
                         tuneLength = 10,
                         trControl = trainControl(method = 'cv'))

# Run predict() and postResample() on the model.
rPartSingleTreePred <- predict(rPartSingleTree, newdata = xTestData)
rPartSingleTreePerformance <- postResample(pred = rPartSingleTreePred, obs = yTestData)
rPartSingleTreePerformance

##      RMSE  Rsquared       MAE 
## 0.2897274 0.9749795 0.2309521

Random Forest Model

set.seed(10)
# Define the Random Forest model.
randomForest <- train(x = xTrainData,
                      y = yTrainData,
                      method = 'rf',
                      tuneLength = 10,
                      importance = TRUE,
                      trControl = trainControl(method = 'cv'))

# Run predict() and postResample() on the model.
randomForestPred <- predict(randomForest, newdata = xTestData)
randomForestPerformance <- postResample(pred = randomForestPred, obs = yTestData)
randomForestPerformance

##       RMSE   Rsquared        MAE 
## 0.06814128 0.99860581 0.05070449

GBM Model

set.seed(10)

# Define the GBM model.
grid <- expand.grid(interaction.depth = seq(1, 6, by = 1),
                    n.trees = c(25, 50, 100, 200),
                    shrinkage = c(0.01, 0.05, 0.1, 0.2),
                    n.minobsinnode = c(5, 10, 15))

gbmModel <- train(x = xTrainData,
                  y = yTrainData,
                  method = 'gbm',
                  tuneGrid = grid,
                  trControl = trainControl(method = 'cv'),
                  verbose = FALSE)

# Run predict() and postResample() on the model.
gbmModelPred <- predict(gbmModel, newdata = xTestData)
gbmModelPerformance <- postResample(pred = gbmModelPred, obs = yTestData)
gbmModelPerformance

##       RMSE   Rsquared        MAE 
## 0.08243419 0.99776327 0.06678381

(a) Which tree-based regression model gives the optimal resampling and test set performance?

rbind('Single Tree Model' = rPartSingleTreePerformance,
      'Random Forest Model' = randomForestPerformance,
      'GBM Model' = gbmModelPerformance) %>%
kable() %>% kable_styling()

	RMSE	Rsquared	MAE
Single Tree Model	0.2897274	0.9749795	0.2309521
Random Forest Model	0.0681413	0.9986058	0.0507045
GBM Model	0.0824342	0.9977633	0.0667838

Answer:

Based on the lowest RMSE value and the highest Rsquared value, the Random Forest model gives the optimal resampling and test set performance.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list?

varImp(randomForest)

## rf variable importance
## 
##   only 20 most important variables shown (out of 58)
## 
##                        Overall
## Yield                  100.000
## ManufacturingProcess33   5.433
## BiologicalMaterial03     5.425
## BiologicalMaterial11     5.354
## ManufacturingProcess30   4.390
## ManufacturingProcess38   4.386
## ManufacturingProcess05   4.363
## ManufacturingProcess12   4.156
## ManufacturingProcess40   4.093
## ManufacturingProcess02   4.092
## ManufacturingProcess11   3.983
## BiologicalMaterial09     3.978
## ManufacturingProcess14   3.953
## ManufacturingProcess19   3.828
## BiologicalMaterial10     3.685
## ManufacturingProcess27   3.659
## ManufacturingProcess44   3.645
## ManufacturingProcess06   3.603
## ManufacturingProcess39   3.598
## ManufacturingProcess09   3.592

vip(randomForest, aesthetics = list(color = 'red', fill='green'))

Answer:

The most important predictors in the optimal tree-based regression model (Random Forest) are shown above. As we can see, the process variables dominate the list.

How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

# Define and tune the SVM model.
svmModel <- train(x = xTrainData,
                  y = yTrainData,
                  method = 'svmRadial',
                  preProc = c('center', 'scale'),
                  tuneLength = 14,
                  trControl = trainControl(method = 'cv'))

varImp(svmModel)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 58)
## 
##                        Overall
## Yield                   100.00
## ManufacturingProcess32   39.43
## ManufacturingProcess13   36.09
## BiologicalMaterial06     31.25
## ManufacturingProcess17   30.28
## ManufacturingProcess36   28.51
## ManufacturingProcess09   27.98
## BiologicalMaterial03     26.77
## BiologicalMaterial12     26.22
## ManufacturingProcess06   24.76
## BiologicalMaterial02     23.26
## BiologicalMaterial11     22.08
## ManufacturingProcess31   20.56
## ManufacturingProcess33   19.15
## BiologicalMaterial08     18.34
## BiologicalMaterial04     17.97
## ManufacturingProcess11   16.75
## BiologicalMaterial01     15.42
## ManufacturingProcess29   14.38
## ManufacturingProcess12   13.94

vip(svmModel, aesthetics = list(color = 'red', fill='purple'))

Answer:

From the previous homework on linear and non linear models, the SVM model was found to be the optimal model. As we can see from the above, the process variables dominate the list the same way they do for the Random Forest model.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpartTree <- rpart(Yield ~., data = xTrainData)
fancyRpartPlot(rpartTree, caption = 'Distribution of Yield')

Answer:

The above tree diagram does provide additional knowledge about the biological or process predictors and their relationship with yield. The diagram shows us the importance of each of the variables. The higher up the tree, the more important the variable is. This helps us to determine which variables are influential in improving yield.

Data 624 Homework 9

Week 12 Trees and Rules

Stephen Haslett

11/15/2021

Exercise 8.1

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Exercise 8.2

Exercise 8.3

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

(b) Which model do you think would be more predictive of other samples?

(C) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Exercise 8.7

(a) Which tree-based regression model gives the optimal resampling and test set performance?

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?