Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
library(randomForest)
library(caret)
library(data.table)
library(party)
library(tidyverse)
library(rpart)
library(rpart.plot)
library(AppliedPredictiveModeling)
library(missForest)
library(Cubist)
library(partykit)
set.seed(200)
simu <- mlbench.friedman1(200, sd=1)
simu <- cbind(simu$x, simu$y)
simu <- as.data.frame(simu)
colnames(simu)[ncol(simu)] <- "y"
  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model_rf<- randomForest(y ~ ., data = simu,
                       importance = TRUE,
                       ntree = 1000)
model_rf
## 
## Call:
##  randomForest(formula = y ~ ., data = simu, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.754258
##                     % Var explained: 72.3

Below you will find the importance feature, and the overall score along with the variables

rf_imp <- varImp(model_rf, scale = FALSE)
rf_imp
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

From using the VarImpPlot feature you can view the variables in the plot, this will ensure an understanding of their importance/ranking

varImpPlot(model_rf)

Here you will visualize V6 to V10 which do not appear important

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simu$dup1 <- simu$V1 + rnorm(200) * .1
cor(simu$dup1, simu$V1)
## [1] 0.9460206

above you can see the correlation of V1 is 94%

model_rf2 <- randomForest(y ~ ., data = simu,
                       importance = TRUE,
                       ntree = 1000)
model_rf2
## 
## Call:
##  randomForest(formula = y ~ ., data = simu, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.922537
##                     % Var explained: 71.61
rf2_Imp <- varImp(model_rf2, scale = FALSE)
rf2_Imp
##          Overall
## V1    5.69119973
## V2    6.06896061
## V3    0.62970218
## V4    7.04752238
## V5    1.87238438
## V6    0.13569065
## V7   -0.01345645
## V8   -0.04370565
## V9    0.00840438
## V10   0.02894814
## dup1  4.28331581
varImpPlot(model_rf2)

Here we can see V1 importance going down significantly with a predictor (duplicate 1 variable ) added.

  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
simu <- subset(simu, select = -dup1)

model_3 <- cforest(y ~ ., data = simu)
feature_imp <- varimp(model_3)
feature_imp
##           V1           V2           V3           V4           V5           V6 
##  7.926955696  5.860766989  0.208635817  7.208444729  2.077012973 -0.032252040 
##           V7           V8           V9          V10 
##  0.008656239 -0.133778236 -0.072364217 -0.016155247

From visualizing, cforest has V1-V5 as the important variables, and V6-V10 are low ranking features.

  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
model_cub <- cubist(simu[,-11], simu$y)
model_cub
## 
## Call:
## cubist.default(x = simu[, -11], y = simu$y)
## 
## Number of samples: 200 
## Number of predictors: 10 
## 
## Number of committees: 1 
## Number of rules: 1
summary(model_cub)
## 
## Call:
## cubist.default(x = simu[, -11], y = simu$y)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Tue May 04 23:07:11 2021
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 200 cases (11 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
## 
##  outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
## 
## 
## Evaluation on training data (200 cases):
## 
##     Average  |error|           2.224012
##     Relative |error|               0.55
##     Correlation coefficient        0.84
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    V1
##           100%    V2
##           100%    V4
##           100%    V5
## 
## 
## Time: 0.0 secs

We can see that Cubist will only uses highly important variables. Which are V1, V2, V4 and V5.

8.2 Use a simulation to show tree bias with different granularities.

tree_sim <- twoClassSim(200, 
                        noiseVars = 10, 
                        corrVar = 4, 
                        corrValue = 0.8,
                        mislabel= 0) 

fit <- rpart(Linear01 ~ ., tree_sim)
rpart.plot(fit)

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

From looking at the model on the right it focuses its importance on the first few predictors, then the model on the left which spreads its importance across multiple predictors. This is because as the training of the model goes up, the number of focuses predictors goes down.Therefore, the model on the right uses a higher bagging fraction, so this causes the data to have less variation between samples. Finally, the model also has a high shrinkage parameter, this mean the model can learn quicker vs. one has smaller shrinkage parameter.

  1. Which model do you think would be more predictive of other samples?

I would think the model on the left will perform better. For the reason that the left model uses more predictors, which might be more representative of the data.

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

From Increasing the interaction depth this could affect the predictors importance by impacting how the model learns. Therefore, the higher the interaction depth, the faster the learner works. This could have a negative impact on the predictors.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

  1. Which tree-based regression model gives the optimal resampling and test set performance?
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess



df_1 <- missForest(df)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
df_2 <- df_1$ximp
head(df_2)
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                 10.937                 11.217                 1.5341
## 2                  0.000                  0.000                 1.5454
## 3                  0.000                  0.000                 1.5457
## 4                  0.000                  0.000                 1.5442
## 5                 10.700                  0.000                 1.5409
## 6                 12.000                  0.000                 1.5425
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                 929.82                998.286                205.919
## 2                 917.00               1032.200                210.000
## 3                 912.00               1003.600                207.100
## 4                 911.00               1014.600                213.300
## 5                 918.00               1027.500                205.700
## 6                 924.00               1016.800                208.900
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                 177.53                 177.73                  43.00
## 2                 177.00                 178.00                  46.57
## 3                 178.00                 178.00                  45.07
## 4                 177.00                 177.00                  44.92
## 5                 178.00                 178.00                  44.96
## 6                 178.00                 178.00                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                  9.098                  9.404                  45.49
## 2                  9.491                  9.887                   0.00
## 3                  9.216                  9.565                   0.00
## 4                  9.238                  9.620                   0.00
## 5                  8.912                  9.268                   0.00
## 6                  9.182                  9.912                   0.00
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                    5.4                   3.59                  12.33
## 2                    3.0                   0.00                   3.00
## 3                    4.0                   1.00                   4.00
## 4                    5.0                   2.00                   5.00
## 5                    8.0                   4.00                  18.00
## 6                    9.0                   1.00                   1.00
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                  0.014                  0.018                   11.6
## 2                  0.100                  0.150                   11.1
## 3                  0.000                  0.000                   12.0
## 4                  0.000                  0.000                   10.6
## 5                  0.000                  0.000                   11.0
## 6                  0.000                  0.000                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0
data <- df_2[, 2:58]
target <- df_2[,1]
train <- createDataPartition(target, p=0.75)
train_pred <- data[train$Resample1,]
train_tar <- target[train$Resample]
test_pred <- data[-train$Resample1,]
test_target <- target[-train$Resample1]
control <- trainControl(method = "cv", number=10)

Cubist

cube_model = train(train_pred, train_tar, method="cubist", metric = 
"Rsquared", tuneLength=10, trainControl=control)
predict_cube = predict(cube_model, test_pred)
postResample(predict_cube, obs = test_target)
##      RMSE  Rsquared       MAE 
## 1.0742298 0.7184455 0.8087956

Random Forest

rf_model <- train(train_pred, train_tar,method = "rf",tuneLength = 
10,trControl = control, metric ="Rsquared", importance=TRUE )
predict_rf = predict(rf_model, test_pred)
postResample(predict_rf, obs = test_target)
##      RMSE  Rsquared       MAE 
## 1.2334240 0.6479936 0.9771199

Single Tree

rt_model <- train(train_pred, train_tar,method ="rpart2", metric 
="Rsquared", tuneLength=9, trControl=control)
predict_rt = predict(rt_model, test_pred)
postResample(predict_rt, obs = test_target)
##      RMSE  Rsquared       MAE 
## 1.7911437 0.2064795 1.3719619

I believe the cubist model has the best RMSE.

  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
plot(varImp(cube_model), top=10)

From visualizing the graph above ManufacturingProcess32 is ranked the highest. I believe the cubist model shows us the top predictors are very similar with the results from the optimal linear and nonlinear models.

  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
plot(as.party(rt_model$finalModel))

The ManufacturingProcess32 has a higher impact on the yield, this is because it’s related to the previous models. The biological process does not appear to have as much of an impact with the response yield.