Exercises from Chapter 7 of textbook Applied Predictive Modeling by Kuhn & Johnson
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
From the output below it appears V6-V10 had minimal important in the model.
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Yes, the importance of V1 decreased in this second model, though not quite proportional the the amount V11 (the duplicate) created.
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model2, scale = FALSE)
rfImp1
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?They do appear to have the same pattern, with variables V6-V10 not ranking high.
model4 <- cforest(y ~ ., data = simulated)
varimp(model4)
## V1 V2 V3 V4 V5 V6
## 5.42152778 5.79075433 -0.01425384 5.96294311 1.87652947 -0.06822257
## V7 V8 V9 V10 duplicate1 duplicate2
## 0.07696942 -0.17067028 0.02155362 -0.13376134 5.27935126 2.92224516
boost_model <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(boost_model)
## var rel.inf
## V4 V4 30.8498158
## V2 V2 20.9639117
## V1 V1 16.1293953
## duplicate1 duplicate1 11.5441703
## V5 V5 11.0492470
## V3 V3 8.5487055
## duplicate2 duplicate2 0.7137245
## V6 V6 0.2010299
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
cub_sim <- cubist(simulated[,-11], simulated$y)
cub_sim$usage
## Conditions Model Variable
## 1 0 100 V1
## 2 0 100 V2
## 3 0 100 V4
## 4 0 100 V5
## 5 0 100 duplicate1
## 6 0 0 V3
## 7 0 0 V6
## 8 0 0 V7
## 9 0 0 V8
## 10 0 0 V9
## 11 0 0 V10
## 12 0 0 duplicate2
The additionally models above also successfully avoided incorrectly using the unimportant variables to a high degree.
Below we see that more importance is placed on x over x2 simply because that variable has less variance.
x <- rep(1:2, each=100)
x2 <- rnorm(200, mean=0, sd=4)
y <- x + rnorm(200, mean= 0 , sd=1)
data <- data.frame(y,x,x2)
tree_simulation <- rpart(y~., data=data)
varImp(tree_simulation)
## Overall
## x 0.2157706
## x2 0.6133085
Figure
I believe it’s accurate to say that the right-hand plot is essentially more stringent, due to the tuning parameters, in it’s threshhold for considering a predictor as important - it’s considered a greedy or a rapid leaner. Alternatively, the left-hand plot could be characterizes as more methodical, it doesn’t need to learn fast so it spends more time finding relationships in the data.
While these are both extreme settings, I think the left-hand model would be better at predicting new data, as it’s incorporated more diverse data which should limit the overfitting issues the model on the right likely would have.
As the interaction depth is increased, we would see more predictors with importance, and more predictors with greater importance. As the maximum number of nodes per tree increases, it leaves more room to assign importance.
First I copy in the code from my previous work.
data(ChemicalManufacturingProcess)
chem <- as.data.frame(ChemicalManufacturingProcess)
knn_imp <- preProcess(chem, "knnImpute")
chem_imp <- predict(knn_imp, chem)
#split train/test
set.seed(3190)
sample_set <- sample(nrow(chem_imp), round(nrow(chem_imp)*0.75), replace = FALSE)
chem_train <-chem_imp[sample_set, ]
chem_train_x <- chem_train[, -1]
chem_train_y <- chem_train[, 1]
chem_test <-chem_imp[-sample_set, ]
chem_test_x <- chem_test[, -1]
chem_test_y <- chem_test[, 1]
Random Forest
This model doesn’t look spectacular, it explain 57% of the variance in the data with a somewhat high RMSE of 0.684.
randf_model <- randomForest(chem_train_x, chem_train_y, importance = TRUE, ntrees = 1000)
randf_model
##
## Call:
## randomForest(x = chem_train_x, y = chem_train_y, importance = TRUE, ntrees = 1000)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 19
##
## Mean of squared residuals: 0.3666167
## % Var explained: 63.09
randf_pred <- predict(randf_model, newdata = chem_test_x)
postResample(pred = randf_pred, obs = chem_test_y)
## RMSE Rsquared MAE
## 0.6839209 0.5676939 0.4982828
Boosted
This boosted model appear to be worse than the random forest above, it has a lower R-squared and even higher RMSE/MAE.
boosted_model <- gbm.fit(chem_train_x, chem_train_y, distribution = "gaussian")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.9925 nan 0.0010 0.0008
## 2 0.9918 nan 0.0010 0.0007
## 3 0.9910 nan 0.0010 0.0008
## 4 0.9902 nan 0.0010 0.0008
## 5 0.9895 nan 0.0010 0.0007
## 6 0.9889 nan 0.0010 0.0002
## 7 0.9881 nan 0.0010 0.0008
## 8 0.9874 nan 0.0010 0.0008
## 9 0.9867 nan 0.0010 0.0005
## 10 0.9858 nan 0.0010 0.0008
## 20 0.9783 nan 0.0010 0.0006
## 40 0.9626 nan 0.0010 0.0008
## 60 0.9485 nan 0.0010 0.0008
## 80 0.9343 nan 0.0010 0.0007
## 100 0.9212 nan 0.0010 0.0006
boosted_model
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 57 predictors of which 7 had non-zero influence.
boosted_pred <- predict(boosted_model, newdata = chem_test_x)
postResample(pred = boosted_pred, obs = chem_test_y)
## RMSE Rsquared MAE
## 0.9679884 0.4364299 0.7602427
Cubist
The cubist model is comparable to the first model, the Random Forest - however the Random Forest does technically outperform the cubist. Further tuning may find slight advantages, but at it’s base/default level it performs best on this data.
cub_model <- cubist(chem_train_x, chem_train_y)
cub_model
##
## Call:
## cubist.default(x = chem_train_x, y = chem_train_y)
##
## Number of samples: 132
## Number of predictors: 57
##
## Number of committees: 1
## Number of rules: 1
cub_pred <- predict(cub_model, newdata = chem_test_x)
postResample(pred = cub_pred, obs = chem_test_y)
## RMSE Rsquared MAE
## 0.6926415 0.5227459 0.5258773
For the Random Forest Model we see the top 10 important variables below. Again we see over half are manufacturing measures, although there is one more biological measure in this model compared to my last two, and they appear a bit higher in the list.
randf_imp <- varImp(randf_model)
head(rownames(randf_imp)[order(randf_imp$Overall, decreasing=TRUE)], 10)
## [1] "ManufacturingProcess32" "BiologicalMaterial12" "ManufacturingProcess31"
## [4] "ManufacturingProcess36" "ManufacturingProcess13" "BiologicalMaterial06"
## [7] "ManufacturingProcess17" "ManufacturingProcess39" "BiologicalMaterial03"
## [10] "BiologicalMaterial02"
With this visualization I see that the biological predictors only are useful when the first node, ManufacturingProcess32, is less than 0.192. This is where we see the 4 biological predictors come into play.
rpart_tree <- rpart(Yield ~., data = chem_train)
tree_plot <- as.party(rpart_tree)
plot(tree_plot)
Further, a quick look at how ManufacturingProcess32 is distributed shows that the nodes split point of 0.192 is fairly close to half of the data - meaning for half of the data no biological predictors are even used.
summary(chem_train$ManufacturingProcess32)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.86552 -0.64216 -0.08632 0.02456 0.65480 2.69287
boxplot(chem_train$ManufacturingProcess32)