library(caret)
library(lattice)
library(randomForest)
library(mlbench)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(rpart.plot)
set.seed(200)DATA624 - Homework 9
Exercise 8.1
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000
)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 Overall
V1 8.86329776
V2 6.72851763
V3 0.84145353
V4 7.60284159
V5 2.26864193
V6 0.11268425
V7 0.07374772
V8 -0.07210708
V9 -0.06913906
V10 -0.10577619
Exercise 8.1 A
rfImp1 <- varImp(model1, scale = FALSE)
imp_df <- data.frame(
Variable = rownames(rfImp1),
Importance = rfImp1$Overall
)
imp_df <- imp_df[order(imp_df$Importance), ]
barplot(
imp_df$Importance,
names.arg = imp_df$Variable,
horiz = TRUE,
las = 1,
main = "Random Forest Variable Importance",
xlab = "Importance"
)According to the predictor importance graph, variables V1, V4, and V2 have been found to be the most important predictors, while there is relatively lesser contribution from V5 and V3. On the other hand, predictors V6-V10 are near-zero importance predictors, meaning that their contribution towards the prediction task is not significant.
As per the data generating mechanism, V1-V5 are the only important variables. It can be concluded that the random forest model has made the correct decision of considering these variables for prediction.
Exercise 8.1 B
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1)[1] 0.9356508
model2 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000
)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 Overall
V1 5.78865821
V2 6.48386619
V3 0.55469974
V4 6.78484850
V5 1.96248183
V6 0.10126938
V7 0.14210730
V8 -0.09726812
V9 -0.08440763
V10 0.04878300
duplicate1 4.64551303
rfImp_df <- as.data.frame(rfImp2)
rfImp_df$Variable <- rownames(rfImp_df)
rfImp_df <- rfImp_df[order(rfImp_df$Overall), ]
rfImp_df$Variable[rfImp_df$Variable == "duplicate1"] <- "Duplicate of V1 (1)"
par(mar = c(5, 8, 4, 2))
barplot(
rfImp_df$Overall,
names.arg = rfImp_df$Variable,
horiz = TRUE,
las = 1,
col = "steelblue",
main = "Random Forest Importance with One Duplicate Predictor",
xlab = "Importance"
)simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate2, simulated$V1)[1] 0.9432429
model3 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000
)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3 Overall
V1 4.642486671
V2 6.999126963
V3 0.410061203
V4 7.088994719
V5 1.989570698
V6 0.152581329
V7 -0.020144331
V8 -0.077797272
V9 -0.019773933
V10 0.004167046
duplicate1 3.625925699
duplicate2 2.729716790
rfImp_df <- as.data.frame(rfImp3)
rfImp_df$Variable <- rownames(rfImp_df)
rfImp_df <- rfImp_df[order(rfImp_df$Overall), ]
rfImp_df$Variable[rfImp_df$Variable == "duplicate1"] <- "Duplicate of V1 (1)"
rfImp_df$Variable[rfImp_df$Variable == "duplicate2"] <- "Duplicate of V1 (2)"
par(mar = c(5, 8, 4, 2))
barplot(
rfImp_df$Overall,
names.arg = rfImp_df$Variable,
horiz = TRUE,
las = 1,
col = "steelblue",
main = "Random Forest Importance with Correlated Predictors",
xlab = "Importance"
)Following the inclusion of a predictor variable highly correlated with V1, the importance of V1 reduced, with the newly included predictor variable (duplicate1) getting high importance. This is a demonstration that the random forest algorithm splits the importance of correlated predictors, rather than assigning it to one single predictor.
Inclusion of a third predictor highly correlated with V1 (duplicate2) further reduced the importance of V1, distributing it among all three predictors. The three predictors contain similar information, and the random forest algorithm substitutes them across various decision trees in its process of prediction.
From the above demonstrations, it can be observed that the importance of variables in a random forest algorithm is prone to multicollinearity. In the case of highly correlated predictor variables, importance is shared between the predictors, reducing their individual importance in the model.
Exercise 8.1 C
set.seed(200)
cf_model <- cforest(
y ~ .,
data = simulated,
controls = cforest_unbiased(ntree = 500)
)
cf_imp <- varimp(cf_model)
cf_imp V1 V2 V3 V4 V5 V6
4.97896327 6.15922332 -0.02823395 7.22184597 1.62500755 0.02438630
V7 V8 V9 V10 duplicate1 duplicate2
0.02172171 -0.02996967 -0.02823848 -0.06106928 1.86875797 2.95969824
cf_imp_cond <- varimp(cf_model, conditional = TRUE)
cf_imp_cond V1 V2 V3 V4 V5 V6
1.880891248 5.006040202 0.012850815 5.754878336 1.074777998 0.032360933
V7 V8 V9 V10 duplicate1 duplicate2
0.018411717 -0.011146824 -0.004839891 -0.012942948 0.695643563 1.229281188
The conditional inference forest showed a different importance variable ranking compared to the traditional random forest. In the case where the importance measure is based on the standard approach, the findings were similar to those of the random forest, as the importance was distributed among correlated variables like V1, duplicate1, and duplicate2.
On the other hand, when the importance measure is based on the conditional approach, the importance of V1 and its correlated duplicates was greatly reduced, while predictors such as V2 and V4 dominated. This happens because the conditional approach measures the effect of each predictor after accounting for the correlation with other predictors.
In summary, the conditional inference forest minimizes the bias from correlated predictors in estimating the importance of the predictors.
Exercise 8.1 D
ctrl <- trainControl(method = "cv", number = 10)
set.seed(200)
gbm_model <- train(
y ~ .,
data = simulated,
method = "gbm",
trControl = ctrl,
verbose = FALSE
)
gbmImp <- varImp(gbm_model)
gbmImpgbm variable importance
Overall
V4 100.0000
V2 69.5897
V1 43.7215
V5 38.6493
V3 29.1566
duplicate2 27.6569
duplicate1 19.8065
V6 2.9481
V7 0.4883
V10 0.4299
V9 0.3435
V8 0.0000
plot(gbmImp, main = "Boosted Tree Variable Importance")set.seed(200)
cubist_model <- train(
y ~ .,
data = simulated,
method = "cubist",
trControl = ctrl
)
cubistImp <- varImp(cubist_model)
cubistImpcubist variable importance
Overall
V1 100.000
V2 85.507
V4 71.014
V3 58.696
V5 49.275
V6 11.594
duplicate2 7.971
V8 0.000
V7 0.000
duplicate1 0.000
V9 0.000
V10 0.000
plot(cubistImp, main = "Cubist Variable Importance")Similar results can be observed from the analysis of the boosted tree model (GBM). Here we can see that even for the boosted tree model, the distribution of importance is still partially spread among correlated predictors. The main variables (V1, V2, and V4) continue to play an important role, but also the duplicate predictors are considered important. This is indicative of how important variables’ importance continues to be shared among correlated predictors in the boosted tree algorithm.
On the contrary, the Cubist model yields different results. For this model, the concentration of predictor importance is seen mostly for the initial variables (V1, V2, and V4), while the duplicate variables get very low importance. This indicates that Cubist is not affected much by multicollinearity.
Exercise 8.2
set.seed(123)
n <- 200
# True signal
x_signal <- runif(n)
# Noise variables
x_continuous <- runif(n) # high granularity
x_binary <- sample(c(0, 1), n, replace = TRUE) # low granularity
# Response depends ONLY on signal
y <- 3 * x_signal + rnorm(n)
df <- data.frame(
y,
x_signal,
x_continuous,
x_binary
)
library(rpart)
library(rpart.plot)
tree_model <- rpart(y ~ ., data = df)
rpart.plot(tree_model, main = "Tree Bias Example")tree_model$variable.importance x_signal x_continuous x_binary
114.388953 13.911352 2.228692
A simulation was conducted to show how regression trees prefer predictors with higher granularity. For this purpose, we created a model where one predictor (x_signal) determined our response variable, while two other predictors acted as noises – a continuous one (x_continuous) and a binary one (x_binary).
As expected, most of the splits occurred using our predictor (x_signal). However, the continuous noise predictor (x_continuous) was used for splitting at least once, while our binary noise predictor (x_binary) was never used. Moreover, as we can see from variable importance scores, the importance of our continuous noise predictor is significantly higher than the importance of our binary predictor, despite their equal informativeness.
This happens because of the fact that continuous variables allow splitting at multiple different points, giving more opportunity for splits with lower error rates to be found, whereas binary variables give only one option. Thus, binary predictors become less likely to be used by the model.
Exercise 8.3
Exercise 8.3 A
For the plot on the left, because of the lower learning rate and bagging fraction, the model learns slowly and adds many predictors in the process. As a consequence, the importance of the predictors will be shared among many different predictors.
For the plot on the right, due to the high learning rate and large bagging fraction, the model will concentrate its effort on minimizing the error rate through the use of fewer predictors. Therefore, the predictors’ importance will be concentrated among just a few predictors, with a steep fall-off after that.
Exercise 8.3 B
The left model will probably have more predictive power than the right model because of the lower learning rate and smaller bagging fraction used. The learning rate makes the algorithm learn slowly and allows the model to include information from various predictors, thus making the algorithm more generalized.
The right model puts too much emphasis on a few predictors, which may cause noise in the algorithm and make it less predictive.
Exercise 8.3 C
A higher level of interaction will enable the model to capture more complex interactions between independent variables. This will enable many variables to contribute via interaction terms, spreading their contributions to more predictors.
This will make the curve less steep and will lower the degree of domination by a few predictors.
Exercise 8.7
library(AppliedPredictiveModeling)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(Cubist)
data("ChemicalManufacturingProcess", package = "AppliedPredictiveModeling")
chem_x <- ChemicalManufacturingProcess[, -which(names(ChemicalManufacturingProcess) == "Yield")]
chem_y <- ChemicalManufacturingProcess$Yield
set.seed(624)
train_index <- createDataPartition(chem_y, p = 0.8, list = FALSE)
train_x <- chem_x[train_index, ]
test_x <- chem_x[-train_index, ]
train_y <- chem_y[train_index]
test_y <- chem_y[-train_index]
pp <- preProcess(train_x, method = c("medianImpute", "center", "scale"))
train_x <- predict(pp, train_x)
test_x <- predict(pp, test_x)Exercise 8.7 A
ctrl <- trainControl(method = "cv", number = 10)
set.seed(624)
tree_model <- train(train_x, train_y, method = "rpart", trControl = ctrl)
set.seed(624)
rf_model <- train(train_x, train_y, method = "rf", trControl = ctrl)
set.seed(624)
gbm_model <- train(train_x, train_y, method = "gbm", trControl = ctrl, verbose = FALSE)
set.seed(624)
cubist_model <- train(train_x, train_y, method = "cubist", trControl = ctrl)results <- resamples(list(
Tree = tree_model,
RF = rf_model,
GBM = gbm_model,
Cubist = cubist_model
))
summary(results)
Call:
summary.resamples(object = results)
Models: Tree, RF, GBM, Cubist
Number of resamples: 10
MAE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
Tree 0.9415783 1.0738451 1.2239231 1.2104317 1.2970522 1.4734498 0
RF 0.7465905 0.7835893 0.8939922 0.9036463 1.0327632 1.0776648 0
GBM 0.6318791 0.7350843 0.9025933 0.8892779 1.0094747 1.2318036 0
Cubist 0.6701826 0.7447992 0.8441574 0.8287095 0.8829658 0.9725766 0
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
Tree 1.1802098 1.3344422 1.466828 1.498851 1.649388 1.993385 0
RF 0.8871032 0.9287008 1.192248 1.130584 1.245867 1.391208 0
GBM 0.8009295 0.9339374 1.157815 1.141010 1.334901 1.511166 0
Cubist 0.7992482 0.9069234 1.031942 1.039236 1.180590 1.281195 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
Tree 0.08860495 0.1854612 0.3550092 0.3367805 0.4584719 0.5806388 0
RF 0.57363345 0.6244933 0.6580719 0.6757886 0.7229862 0.8091094 0
GBM 0.44267369 0.5894177 0.6443627 0.6501936 0.7363517 0.8366172 0
Cubist 0.49247798 0.6146083 0.6567250 0.6903806 0.7892502 0.9379103 0
postResample(predict(tree_model, test_x), test_y) RMSE Rsquared MAE
1.5632389 0.5035625 1.1913334
postResample(predict(rf_model, test_x), test_y) RMSE Rsquared MAE
1.2436291 0.7200462 0.8853601
postResample(predict(gbm_model, test_x), test_y) RMSE Rsquared MAE
1.1262388 0.7425400 0.8478939
postResample(predict(cubist_model, test_x), test_y) RMSE Rsquared MAE
0.8836276 0.8399471 0.6663759
A number of different tree-based models have been constructed and analyzed; some of them include decision tree, random forest, GBM, and Cubist. According to the results of resampling, the worst performance belongs to the decision tree model; it had the highest RMSE and the lowest R², which means that its prediction ability is quite low. As for the other two algorithms, they demonstrate good performance, showing lower RMSE and higher R².
All tree-based models have shown rather good results in predicting the yield, but among them, the best one is the Cubist model; it showed the lowest value of RMSE and MAE, as well as the highest value of R², both on the resampling and test sets.
In general, the Cubist model was chosen as the best one because of its performance.
Exercise 8.7 B
varImp(rf_model)rf variable importance
only 20 most important variables shown (out of 57)
Overall
ManufacturingProcess32 100.000
BiologicalMaterial03 22.317
BiologicalMaterial06 21.384
BiologicalMaterial12 20.190
ManufacturingProcess31 17.968
ManufacturingProcess13 15.126
ManufacturingProcess09 11.922
ManufacturingProcess06 9.420
ManufacturingProcess36 8.557
BiologicalMaterial11 8.487
BiologicalMaterial02 7.554
BiologicalMaterial08 7.264
ManufacturingProcess25 7.023
ManufacturingProcess17 6.410
ManufacturingProcess27 5.984
BiologicalMaterial04 5.713
ManufacturingProcess30 5.617
BiologicalMaterial05 5.438
BiologicalMaterial09 5.342
ManufacturingProcess24 4.479
varImp(gbm_model)gbm variable importance
only 20 most important variables shown (out of 57)
Overall
ManufacturingProcess32 100.000
ManufacturingProcess31 28.301
ManufacturingProcess09 20.643
BiologicalMaterial12 20.043
ManufacturingProcess25 13.885
BiologicalMaterial03 13.553
ManufacturingProcess13 12.296
ManufacturingProcess27 12.207
BiologicalMaterial05 10.383
BiologicalMaterial08 9.808
ManufacturingProcess06 9.549
ManufacturingProcess17 8.229
BiologicalMaterial09 8.008
BiologicalMaterial10 7.458
ManufacturingProcess19 7.431
ManufacturingProcess01 7.135
BiologicalMaterial06 6.362
ManufacturingProcess43 6.191
ManufacturingProcess35 6.006
ManufacturingProcess37 5.900
varImp(cubist_model)cubist variable importance
only 20 most important variables shown (out of 57)
Overall
ManufacturingProcess32 100.00
ManufacturingProcess17 50.00
BiologicalMaterial03 42.05
ManufacturingProcess09 40.91
ManufacturingProcess33 38.64
BiologicalMaterial06 38.64
BiologicalMaterial02 37.50
ManufacturingProcess13 32.95
ManufacturingProcess04 32.95
ManufacturingProcess11 23.86
ManufacturingProcess15 21.59
ManufacturingProcess21 18.18
BiologicalMaterial12 18.18
ManufacturingProcess31 17.05
ManufacturingProcess26 17.05
ManufacturingProcess10 17.05
ManufacturingProcess14 14.77
ManufacturingProcess39 14.77
ManufacturingProcess36 13.64
ManufacturingProcess01 12.50
In terms of variable importance, the predictors that were considered most important were mostly the manufacturing process predictors. In all three models; namely, Random Forest, GBM, and Cubist, the predictor ManufacturingProcess32 had the highest importance.
While some biological material predictors like BiologicalMaterial03, BiologicalMaterial06, and BiologicalMaterial12 seemed to be high up in terms of importance, the majority of the predictors were the manufacturing process predictors. This can be seen more in the Cubist model, which is the best performing model in the tree-based analysis performed.
This could mean that the manufacturing process predictors had more impact on the prediction of yield than the biological material predictors. From a practical point of view, it means that the process predictors may be changed to improve yield whereas the biological material predictors only serve to measure material quality.
Similarities with the previous models (linear and nonlinear) include the presence of similar predictors at the top of the list, including ManufacturingProcess32.
Exercise 8.7 C
rpart.plot(tree_model$finalModel, main = "Optimal Tree Model")From the decision tree, ManufacturingProcess32 stands out as the most important variable since it is the only variable considered for the split in the first level. The tree splits the observations depending on whether the variable is below 0.17.
The average yield for observations that have ManufacturingProcess32 below 0.17 is around 39, while the average yield for those with the variable value above 0.17 is around 41. In other words, higher levels of this process variable are linked with better yields.
In addition, the variable is the most important predictor in all the models used in previous sections. Consequently, this implies that modifying this process variable will positively affect future yields.