DATA624 - Homework 9

Author

Anthony Josue Roman

library(caret)
library(lattice)
library(randomForest)
library(mlbench)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(rpart.plot)

set.seed(200)

Exercise 8.1

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

model1 <- randomForest(
  y ~ ., 
  data = simulated,
  importance = TRUE,
  ntree = 1000
)

rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
        Overall
V1   8.86329776
V2   6.72851763
V3   0.84145353
V4   7.60284159
V5   2.26864193
V6   0.11268425
V7   0.07374772
V8  -0.07210708
V9  -0.06913906
V10 -0.10577619

Exercise 8.1 A

rfImp1 <- varImp(model1, scale = FALSE)

imp_df <- data.frame(
  Variable = rownames(rfImp1),
  Importance = rfImp1$Overall
)

imp_df <- imp_df[order(imp_df$Importance), ]

barplot(
  imp_df$Importance,
  names.arg = imp_df$Variable,
  horiz = TRUE,
  las = 1,
  main = "Random Forest Variable Importance",
  xlab = "Importance"
)

According to the predictor importance graph, variables V1, V4, and V2 have been found to be the most important predictors, while there is relatively lesser contribution from V5 and V3. On the other hand, predictors V6-V10 are near-zero importance predictors, meaning that their contribution towards the prediction task is not significant.

As per the data generating mechanism, V1-V5 are the only important variables. It can be concluded that the random forest model has made the correct decision of considering these variables for prediction.

Exercise 8.1 B

simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1)
[1] 0.9356508
model2 <- randomForest(
  y ~ ., 
  data = simulated,
  importance = TRUE,
  ntree = 1000
)

rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
               Overall
V1          5.78865821
V2          6.48386619
V3          0.55469974
V4          6.78484850
V5          1.96248183
V6          0.10126938
V7          0.14210730
V8         -0.09726812
V9         -0.08440763
V10         0.04878300
duplicate1  4.64551303
rfImp_df <- as.data.frame(rfImp2)

rfImp_df$Variable <- rownames(rfImp_df)

rfImp_df <- rfImp_df[order(rfImp_df$Overall), ]
rfImp_df$Variable[rfImp_df$Variable == "duplicate1"] <- "Duplicate of V1 (1)"
par(mar = c(5, 8, 4, 2))
barplot(
  rfImp_df$Overall,
  names.arg = rfImp_df$Variable,
  horiz = TRUE,
  las = 1,
  col = "steelblue",
  main = "Random Forest Importance with One Duplicate Predictor",
  xlab = "Importance"
)

simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate2, simulated$V1)
[1] 0.9432429
model3 <- randomForest(
  y ~ ., 
  data = simulated,
  importance = TRUE,
  ntree = 1000
)

rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
                Overall
V1          4.642486671
V2          6.999126963
V3          0.410061203
V4          7.088994719
V5          1.989570698
V6          0.152581329
V7         -0.020144331
V8         -0.077797272
V9         -0.019773933
V10         0.004167046
duplicate1  3.625925699
duplicate2  2.729716790
rfImp_df <- as.data.frame(rfImp3)

rfImp_df$Variable <- rownames(rfImp_df)

rfImp_df <- rfImp_df[order(rfImp_df$Overall), ]
rfImp_df$Variable[rfImp_df$Variable == "duplicate1"] <- "Duplicate of V1 (1)"
rfImp_df$Variable[rfImp_df$Variable == "duplicate2"] <- "Duplicate of V1 (2)"
par(mar = c(5, 8, 4, 2))
barplot(
  rfImp_df$Overall,
  names.arg = rfImp_df$Variable,
  horiz = TRUE,
  las = 1,
  col = "steelblue",
  main = "Random Forest Importance with Correlated Predictors",
  xlab = "Importance"
)

Following the inclusion of a predictor variable highly correlated with V1, the importance of V1 reduced, with the newly included predictor variable (duplicate1) getting high importance. This is a demonstration that the random forest algorithm splits the importance of correlated predictors, rather than assigning it to one single predictor.

Inclusion of a third predictor highly correlated with V1 (duplicate2) further reduced the importance of V1, distributing it among all three predictors. The three predictors contain similar information, and the random forest algorithm substitutes them across various decision trees in its process of prediction.

From the above demonstrations, it can be observed that the importance of variables in a random forest algorithm is prone to multicollinearity. In the case of highly correlated predictor variables, importance is shared between the predictors, reducing their individual importance in the model.

Exercise 8.1 C

set.seed(200)

cf_model <- cforest(
  y ~ ., 
  data = simulated,
  controls = cforest_unbiased(ntree = 500)
)

cf_imp <- varimp(cf_model)
cf_imp
         V1          V2          V3          V4          V5          V6 
 4.97896327  6.15922332 -0.02823395  7.22184597  1.62500755  0.02438630 
         V7          V8          V9         V10  duplicate1  duplicate2 
 0.02172171 -0.02996967 -0.02823848 -0.06106928  1.86875797  2.95969824 
cf_imp_cond <- varimp(cf_model, conditional = TRUE)
cf_imp_cond
          V1           V2           V3           V4           V5           V6 
 1.880891248  5.006040202  0.012850815  5.754878336  1.074777998  0.032360933 
          V7           V8           V9          V10   duplicate1   duplicate2 
 0.018411717 -0.011146824 -0.004839891 -0.012942948  0.695643563  1.229281188 

The conditional inference forest showed a different importance variable ranking compared to the traditional random forest. In the case where the importance measure is based on the standard approach, the findings were similar to those of the random forest, as the importance was distributed among correlated variables like V1, duplicate1, and duplicate2.

On the other hand, when the importance measure is based on the conditional approach, the importance of V1 and its correlated duplicates was greatly reduced, while predictors such as V2 and V4 dominated. This happens because the conditional approach measures the effect of each predictor after accounting for the correlation with other predictors.

In summary, the conditional inference forest minimizes the bias from correlated predictors in estimating the importance of the predictors.

Exercise 8.1 D

ctrl <- trainControl(method = "cv", number = 10)

set.seed(200)
gbm_model <- train(
  y ~ ., 
  data = simulated,
  method = "gbm",
  trControl = ctrl,
  verbose = FALSE
)

gbmImp <- varImp(gbm_model)
gbmImp
gbm variable importance

            Overall
V4         100.0000
V2          69.5897
V1          43.7215
V5          38.6493
V3          29.1566
duplicate2  27.6569
duplicate1  19.8065
V6           2.9481
V7           0.4883
V10          0.4299
V9           0.3435
V8           0.0000
plot(gbmImp, main = "Boosted Tree Variable Importance")

set.seed(200)
cubist_model <- train(
  y ~ ., 
  data = simulated,
  method = "cubist",
  trControl = ctrl
)

cubistImp <- varImp(cubist_model)
cubistImp
cubist variable importance

           Overall
V1         100.000
V2          85.507
V4          71.014
V3          58.696
V5          49.275
V6          11.594
duplicate2   7.971
V8           0.000
V7           0.000
duplicate1   0.000
V9           0.000
V10          0.000
plot(cubistImp, main = "Cubist Variable Importance")

Similar results can be observed from the analysis of the boosted tree model (GBM). Here we can see that even for the boosted tree model, the distribution of importance is still partially spread among correlated predictors. The main variables (V1, V2, and V4) continue to play an important role, but also the duplicate predictors are considered important. This is indicative of how important variables’ importance continues to be shared among correlated predictors in the boosted tree algorithm.

On the contrary, the Cubist model yields different results. For this model, the concentration of predictor importance is seen mostly for the initial variables (V1, V2, and V4), while the duplicate variables get very low importance. This indicates that Cubist is not affected much by multicollinearity.

Exercise 8.2

set.seed(123)

n <- 200

# True signal
x_signal <- runif(n)

# Noise variables
x_continuous <- runif(n)          # high granularity
x_binary <- sample(c(0, 1), n, replace = TRUE)  # low granularity

# Response depends ONLY on signal
y <- 3 * x_signal + rnorm(n)

df <- data.frame(
  y,
  x_signal,
  x_continuous,
  x_binary
)

library(rpart)
library(rpart.plot)

tree_model <- rpart(y ~ ., data = df)

rpart.plot(tree_model, main = "Tree Bias Example")

tree_model$variable.importance
    x_signal x_continuous     x_binary 
  114.388953    13.911352     2.228692 

A simulation was conducted to show how regression trees prefer predictors with higher granularity. For this purpose, we created a model where one predictor (x_signal) determined our response variable, while two other predictors acted as noises – a continuous one (x_continuous) and a binary one (x_binary).

As expected, most of the splits occurred using our predictor (x_signal). However, the continuous noise predictor (x_continuous) was used for splitting at least once, while our binary noise predictor (x_binary) was never used. Moreover, as we can see from variable importance scores, the importance of our continuous noise predictor is significantly higher than the importance of our binary predictor, despite their equal informativeness.

This happens because of the fact that continuous variables allow splitting at multiple different points, giving more opportunity for splits with lower error rates to be found, whereas binary variables give only one option. Thus, binary predictors become less likely to be used by the model.

Exercise 8.3

Exercise 8.3 A

For the plot on the left, because of the lower learning rate and bagging fraction, the model learns slowly and adds many predictors in the process. As a consequence, the importance of the predictors will be shared among many different predictors.

For the plot on the right, due to the high learning rate and large bagging fraction, the model will concentrate its effort on minimizing the error rate through the use of fewer predictors. Therefore, the predictors’ importance will be concentrated among just a few predictors, with a steep fall-off after that.

Exercise 8.3 B

The left model will probably have more predictive power than the right model because of the lower learning rate and smaller bagging fraction used. The learning rate makes the algorithm learn slowly and allows the model to include information from various predictors, thus making the algorithm more generalized.

The right model puts too much emphasis on a few predictors, which may cause noise in the algorithm and make it less predictive.

Exercise 8.3 C

A higher level of interaction will enable the model to capture more complex interactions between independent variables. This will enable many variables to contribute via interaction terms, spreading their contributions to more predictors.

This will make the curve less steep and will lower the degree of domination by a few predictors.

Exercise 8.7

library(AppliedPredictiveModeling)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(Cubist)

data("ChemicalManufacturingProcess", package = "AppliedPredictiveModeling")

chem_x <- ChemicalManufacturingProcess[, -which(names(ChemicalManufacturingProcess) == "Yield")]
chem_y <- ChemicalManufacturingProcess$Yield

set.seed(624)
train_index <- createDataPartition(chem_y, p = 0.8, list = FALSE)

train_x <- chem_x[train_index, ]
test_x  <- chem_x[-train_index, ]

train_y <- chem_y[train_index]
test_y  <- chem_y[-train_index]

pp <- preProcess(train_x, method = c("medianImpute", "center", "scale"))

train_x <- predict(pp, train_x)
test_x  <- predict(pp, test_x)

Exercise 8.7 A

ctrl <- trainControl(method = "cv", number = 10)

set.seed(624)
tree_model <- train(train_x, train_y, method = "rpart", trControl = ctrl)

set.seed(624)
rf_model <- train(train_x, train_y, method = "rf", trControl = ctrl)

set.seed(624)
gbm_model <- train(train_x, train_y, method = "gbm", trControl = ctrl, verbose = FALSE)

set.seed(624)
cubist_model <- train(train_x, train_y, method = "cubist", trControl = ctrl)
results <- resamples(list(
  Tree = tree_model,
  RF = rf_model,
  GBM = gbm_model,
  Cubist = cubist_model
))

summary(results)

Call:
summary.resamples(object = results)

Models: Tree, RF, GBM, Cubist 
Number of resamples: 10 

MAE 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
Tree   0.9415783 1.0738451 1.2239231 1.2104317 1.2970522 1.4734498    0
RF     0.7465905 0.7835893 0.8939922 0.9036463 1.0327632 1.0776648    0
GBM    0.6318791 0.7350843 0.9025933 0.8892779 1.0094747 1.2318036    0
Cubist 0.6701826 0.7447992 0.8441574 0.8287095 0.8829658 0.9725766    0

RMSE 
            Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
Tree   1.1802098 1.3344422 1.466828 1.498851 1.649388 1.993385    0
RF     0.8871032 0.9287008 1.192248 1.130584 1.245867 1.391208    0
GBM    0.8009295 0.9339374 1.157815 1.141010 1.334901 1.511166    0
Cubist 0.7992482 0.9069234 1.031942 1.039236 1.180590 1.281195    0

Rsquared 
             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
Tree   0.08860495 0.1854612 0.3550092 0.3367805 0.4584719 0.5806388    0
RF     0.57363345 0.6244933 0.6580719 0.6757886 0.7229862 0.8091094    0
GBM    0.44267369 0.5894177 0.6443627 0.6501936 0.7363517 0.8366172    0
Cubist 0.49247798 0.6146083 0.6567250 0.6903806 0.7892502 0.9379103    0
postResample(predict(tree_model, test_x), test_y)
     RMSE  Rsquared       MAE 
1.5632389 0.5035625 1.1913334 
postResample(predict(rf_model, test_x), test_y)
     RMSE  Rsquared       MAE 
1.2436291 0.7200462 0.8853601 
postResample(predict(gbm_model, test_x), test_y)
     RMSE  Rsquared       MAE 
1.1262388 0.7425400 0.8478939 
postResample(predict(cubist_model, test_x), test_y)
     RMSE  Rsquared       MAE 
0.8836276 0.8399471 0.6663759 

A number of different tree-based models have been constructed and analyzed; some of them include decision tree, random forest, GBM, and Cubist. According to the results of resampling, the worst performance belongs to the decision tree model; it had the highest RMSE and the lowest R², which means that its prediction ability is quite low. As for the other two algorithms, they demonstrate good performance, showing lower RMSE and higher R².

All tree-based models have shown rather good results in predicting the yield, but among them, the best one is the Cubist model; it showed the lowest value of RMSE and MAE, as well as the highest value of R², both on the resampling and test sets.

In general, the Cubist model was chosen as the best one because of its performance.

Exercise 8.7 B

varImp(rf_model)
rf variable importance

  only 20 most important variables shown (out of 57)

                       Overall
ManufacturingProcess32 100.000
BiologicalMaterial03    22.317
BiologicalMaterial06    21.384
BiologicalMaterial12    20.190
ManufacturingProcess31  17.968
ManufacturingProcess13  15.126
ManufacturingProcess09  11.922
ManufacturingProcess06   9.420
ManufacturingProcess36   8.557
BiologicalMaterial11     8.487
BiologicalMaterial02     7.554
BiologicalMaterial08     7.264
ManufacturingProcess25   7.023
ManufacturingProcess17   6.410
ManufacturingProcess27   5.984
BiologicalMaterial04     5.713
ManufacturingProcess30   5.617
BiologicalMaterial05     5.438
BiologicalMaterial09     5.342
ManufacturingProcess24   4.479
varImp(gbm_model)
gbm variable importance

  only 20 most important variables shown (out of 57)

                       Overall
ManufacturingProcess32 100.000
ManufacturingProcess31  28.301
ManufacturingProcess09  20.643
BiologicalMaterial12    20.043
ManufacturingProcess25  13.885
BiologicalMaterial03    13.553
ManufacturingProcess13  12.296
ManufacturingProcess27  12.207
BiologicalMaterial05    10.383
BiologicalMaterial08     9.808
ManufacturingProcess06   9.549
ManufacturingProcess17   8.229
BiologicalMaterial09     8.008
BiologicalMaterial10     7.458
ManufacturingProcess19   7.431
ManufacturingProcess01   7.135
BiologicalMaterial06     6.362
ManufacturingProcess43   6.191
ManufacturingProcess35   6.006
ManufacturingProcess37   5.900
varImp(cubist_model)
cubist variable importance

  only 20 most important variables shown (out of 57)

                       Overall
ManufacturingProcess32  100.00
ManufacturingProcess17   50.00
BiologicalMaterial03     42.05
ManufacturingProcess09   40.91
ManufacturingProcess33   38.64
BiologicalMaterial06     38.64
BiologicalMaterial02     37.50
ManufacturingProcess13   32.95
ManufacturingProcess04   32.95
ManufacturingProcess11   23.86
ManufacturingProcess15   21.59
ManufacturingProcess21   18.18
BiologicalMaterial12     18.18
ManufacturingProcess31   17.05
ManufacturingProcess26   17.05
ManufacturingProcess10   17.05
ManufacturingProcess14   14.77
ManufacturingProcess39   14.77
ManufacturingProcess36   13.64
ManufacturingProcess01   12.50

In terms of variable importance, the predictors that were considered most important were mostly the manufacturing process predictors. In all three models; namely, Random Forest, GBM, and Cubist, the predictor ManufacturingProcess32 had the highest importance.

While some biological material predictors like BiologicalMaterial03, BiologicalMaterial06, and BiologicalMaterial12 seemed to be high up in terms of importance, the majority of the predictors were the manufacturing process predictors. This can be seen more in the Cubist model, which is the best performing model in the tree-based analysis performed.

This could mean that the manufacturing process predictors had more impact on the prediction of yield than the biological material predictors. From a practical point of view, it means that the process predictors may be changed to improve yield whereas the biological material predictors only serve to measure material quality.

Similarities with the previous models (linear and nonlinear) include the presence of similar predictors at the top of the list, including ManufacturingProcess32.

Exercise 8.7 C

rpart.plot(tree_model$finalModel, main = "Optimal Tree Model")

From the decision tree, ManufacturingProcess32 stands out as the most important variable since it is the only variable considered for the split in the first level. The tree splits the observations depending on whether the variable is below 0.17.

The average yield for observations that have ManufacturingProcess32 below 0.17 is around 39, while the average yield for those with the variable value above 0.17 is around 41. In other words, higher levels of this process variable are linked with better yields.

In addition, the variable is the most important predictor in all the models used in previous sections. Consequently, this implies that modifying this process variable will positively affect future yields.