library(mlbench)
library(dplyr)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 %>% arrange(desc(Overall))
As shown above, the variables V6-V10 were properly recognized way less significant to the model.
cforest function in the party
package to fit a random forest model using conditional inference trees.
The party package function varimp can calculate predictor
importance. The conditional argument of that function
toggles between the traditional importance measure and the modified
version described in Strobl et al. (2007). Do these importances show the
same pattern as the traditional random forest model?library(party)
bagCtrl <- cforest_control(mtry = ncol(simulated) - 1)
baggedTree <- cforest(y ~ ., data = simulated, controls = bagCtrl)
set.seed(233)
condImp <- varimp(baggedTree,conditional = TRUE)
nonCondImp <- varimp(baggedTree, conditional = FALSE)
library(tibble)
library(DataExplorer)
par <- par(mfrow=c(1, 2))
as.data.frame(condImp) %>% rownames_to_column('Var') %>% plot_bar(with="condImp")
as.data.frame(nonCondImp) %>% rownames_to_column('Var') %>% plot_bar(with="nonCondImp")
We can see above that both the conditional and non conditional variable importance handled the duplicates far better than the random forest model, making the second duplicate entirely worthless. It would appear however, that the conditional model overall handled it best, properly recognizing the duplicate nature and lowering the importance of V1 and duplicate1.
library(gbm)
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1), n.minobsinnode = 10)
set.seed(100)
gbmModel <- train(x = select(simulated,!'y') ,y = simulated$y, method = "gbm",
tuneGrid = gbmGrid,
verbose= F)
condImpGbm <- varImp(gbmModel)
condImpGbm$importance %>% rownames_to_column('Var') %>% plot_bar(with="Overall")
library(Cubist)
cubistMod <- cubist(select(simulated,!'y'),simulated$y)
condImpCub <- varImp(cubistMod)
condImpCub %>% rownames_to_column('Var') %>% plot_bar(with="Overall")
As the caret varImp does not take conditional as an argument, I checked both the gbm and cubist model one time. Cubist has the most interesting results. It shows what we would expect, the top 5 variables are all important, and the rest are worthless. The gbm model looks almost the same as the conditional cforest output. It would seem that cubist performed best overall.
We can see below that despite all three variables contributing equally to the value of y, the only variable it cares for in the end is the first one with a large number of distinct values in it.
set.seed(123)
v1 <- runif(1000,1,1000)
v2 <- rnorm(1000,2,300)
v3 <- rpois(1000,2.4)
y = (v1 + v3 + v3)
sim <- data.frame(y,v1,v2,v3)
simModel<- randomForest(y ~ ., data = sim,
importance = TRUE,
ntree = 1000)
varImp(simModel)
Repeating the same process, and reducing the number of unique values in the first variable, we find the model will always find the variable with the larger number of unique values and prfer it.
set.seed(123)
v1 <- runif(1000,1,5)
v2 <- rnorm(1000,2,300)
v3 <- rpois(1000,2.4)
y = (v1 + v3 + v3)
sim <- data.frame(y,v1,v2,v3)
simModel<- randomForest(y ~ ., data = sim,
importance = TRUE,
ntree = 1000)
varImp(simModel)
The results seen are a due to the different bagging fraction chosen. The .1 fraction used on the leftmost tree makes it care for granularity more whereas the .9 on the right makes that less so important.
The model on the left, as it is far more balanced and far less likely to overestimate based on the few very important predictors chosen by the model on the right.
Doing so would increase the number of predictors considered for the final model, making the results more balanced.
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
library(RANN)
impute <- preProcess(ChemicalManufacturingProcess, "knnImpute")
chem_data <- predict(impute, ChemicalManufacturingProcess)
chem_data <- chem_data %>% select(!nearZeroVar(.))
train_index_chen <- createDataPartition(chem_data$Yield , p=.8, list=F)
train_chem <- chem_data[ train_index_chen,]
test_chem <- chem_data[-train_index_chen,]
randomForChem <- randomForest(Yield ~ ., data = train_chem,
importance = TRUE,
ntree = 1000)
bagCtrl <- cforest_control(mtry = ncol(train_chem) - 1)
baggedTreeChem <- cforest(Yield ~ ., data = train_chem, controls = bagCtrl)
set.seed(100)
gbmModelChem <- train(Yield ~ ., data = train_chem, method = "gbm",
tuneGrid = gbmGrid,
verbose= F)
set.seed(122)
cubistModChem <- cubist(select(train_chem,!'Yield'),train_chem$Yield)
cubistModChemAlt <- train(Yield ~ ., data = train_chem,
method = 'cubist')
As seen below, the caret cubist model has the best overall score.
postResample(predict(randomForChem, test_chem), test_chem$Yield)
RMSE Rsquared MAE
0.6112199 0.5735795 0.4738552
postResample(predict(baggedTreeChem, newdata = test_chem), test_chem$Yield)
RMSE Rsquared MAE
0.6389091 0.5458071 0.4931080
postResample(predict(gbmModelChem, test_chem), test_chem$Yield)
RMSE Rsquared MAE
0.5924979 0.5978230 0.4380292
postResample(predict(cubistModChem, test_chem), test_chem$Yield)
RMSE Rsquared MAE
0.4931156 0.7252979 0.3967717
postResample(predict(cubistModChemAlt , test_chem), test_chem$Yield)
RMSE Rsquared MAE
0.4718199 0.7488810 0.3751559
plot(varImp(cubistModChemAlt),top=10)
This model further reinforces the importance of process variables, they dominate the below tree. It also gives a clearer view of how to increase the yield of a given sample.
library(rpart.plot)
multi.class.model = rpart(Yield~., data=train_chem)
rpart.plot(multi.class.model)