library(mlbench)
library(AppliedPredictiveModeling)
library(caret)
library(randomForest)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(patchwork)
library(partykit)
library(Cubist)
library(gbm)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ . , data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
p1 <- rfImp1 %>%
mutate(var = rownames(rfImp1)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = "seagreen") +
labs(title = "Variable Importance", y = "Variable")
p1
Did the random forest model significantly use the uninformative predictors (V6– V10)?
Response: Based on the output above, the RF model did not give any importance to the V6-V10 uninformative predictors, scoring them close to zero.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ . , data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ . , data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
## Overall
## V1 4.91687329
## V2 6.52816504
## V3 0.58711552
## V4 7.04870917
## V5 2.03115561
## V6 0.14213148
## V7 0.10991985
## V8 -0.08405687
## V9 -0.01075028
## V10 0.09230576
## duplicate1 3.80068234
## duplicate2 1.87721959
p2 <- rfImp2 %>%
mutate(var = rownames(rfImp2)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = "seagreen") +
labs(title = "VI Duplicate 1", y = "Variable")
p3 <- rfImp3 %>%
mutate(var = rownames(rfImp3)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = "seagreen") +
labs(title = "VI Duplicate 2", y = "Variable")
(p1 | p2 | p3)
Response: When adding one duplicate predictor (correlation above 94%), V1 becomes the second most important variable according to the data above. When adding a second duplicate predictor (correlation above 94%), V1 becomes the third most important variable emphasizing the importance in feature selection when it comes to highly correlated features like the ones seen above. V1 originally has an importance value of 8.63767413, with the addition of duplicate 1, it drops to 6.363711628 and with the inclusion of duplicate 2, it drops to 5.198015601. RF splits importance across correlated variables which is why we see the distribution across V1, duplicate1 and duplicate 2.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
model4 <- cforest(y ~ . , data = simulated)
cfImp4False <- varimp(model4, conditional = FALSE) %>% as.data.frame()
cfImp4True <- varimp(model4, conditional = TRUE) %>% as.data.frame()
cfImp4False
## .
## V1 5.42152778
## V2 5.79075433
## V3 -0.01425384
## V4 5.96294311
## V5 1.87652947
## V6 -0.06822257
## V7 0.07696942
## V8 -0.17067028
## V9 0.02155362
## V10 -0.13376134
## duplicate1 5.27935126
## duplicate2 2.92224516
cfImp4True
## .
## V1 2.442631466
## V2 5.022117020
## V3 -0.007635221
## V4 5.209258304
## V5 1.321172122
## V6 -0.079239512
## V7 0.012405765
## V8 -0.372933317
## V9 -0.106808893
## V10 -0.206104896
## duplicate1 2.130422253
## duplicate2 0.592237637
p4 <- cfImp4False %>%
rename(Overall = '.') %>%
mutate(var = rownames(cfImp4False)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = "seagreen") +
labs(title = "Condition = False", y = "Variable")
p5 <- cfImp4True %>%
rename(Overall = '.') %>%
mutate(var = rownames(cfImp4True)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = "seagreen") +
labs(title = "Condition = True", y = "Variable")
p4 | p5
Response: When the conditional parameter is set to FALSE, V1’s importance value is set at 6.409840483 becoming the second most important variable and following the pattern of a traditional RF where the importance of V1 decreases as the duplicates gain some importance (distribution). When set to TRUE, V1’s importance value drops to 3.31179002, falling to the third most important variable but the duplicates also drop in importance meaning that it adjusts for correlation and therefore does not follow the pattern of a traditional RF.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
gbmModel <- train(y ~ ., data = simulated, method = "gbm", distribution = "gaussian", verbose = FALSE)
gbmImp <- varImp(gbmModel)
gbmImp
## gbm variable importance
##
## Overall
## V4 100.0000
## V2 78.4226
## duplicate1 50.8513
## V1 43.5310
## V5 36.4829
## V3 32.2035
## duplicate2 11.8373
## V6 2.6307
## V7 2.1681
## V9 1.1758
## V8 0.5614
## V10 0.0000
cubistModel <- train(y ~ ., data = simulated, method = "cubist")
cubImp <- varImp(cubistModel)
cubImp
## cubist variable importance
##
## Overall
## V2 100.00
## V1 77.70
## V4 71.94
## V5 54.68
## V3 46.04
## duplicate2 35.97
## duplicate1 35.97
## V6 14.39
## V9 0.00
## V8 0.00
## V7 0.00
## V10 0.00
p5 <- gbmImp$importance %>%
mutate(var = rownames(gbmImp$importance)) %>%
ggplot(aes(Overall, reorder(var, Overall), var)) +
geom_col(fill = "seagreen") +
labs(title = "Boosted Tree Model: VI", y = "Variable")
p6 <- cubImp$importance %>%
mutate(var = rownames(cubImp$importance)) %>%
ggplot(aes(Overall, reorder(var, Overall), var)) +
geom_col(fill = "seagreen") +
labs(title = "Cubist Model: VI", y = "Variable")
p5 | p6
Response: When fitting boosted tress to the simulated data, the variable importance scores from the correlated duplicates are less favored than the traditional RF because once V1 was used, the duplicates added less new information than what was already given with V1. As for the cubist model, it displays a mixed pattern in regards to the traditional RF, where the importance is less evenly split and some duplicates (duplicate 2) get emphasized more than others.