library(caret)
library(tidyverse)
library(mlbench)
library(AppliedPredictiveModeling)
library(randomForest)
library(mice)
library(kableExtra)
library(gridExtra)
library(party)
library(xgboost)
set.seed(42)
set.seed(7)
train <- mlbench.friedman1(300)
train_x <- data.frame(train$x)
train_y <- train$y
test <- mlbench.friedman1(5000)
test_x <- data.frame(test$x)
test_y <- test$y
mod1 <- randomForest(y = train_y,x = train_x, ntree = 1000, importance = T)
varImp(mod1, scale = F) %>% data.frame() %>%
rownames_to_column() %>%
rename( 'Importance' = Overall, 'Variable' =rowname) %>%
ggplot() +
geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
geom_point(aes(x = Variable, y = Importance), size = 1) +
labs(title = 'Variable Importance') +
coord_flip() +
theme_minimal() +
theme(
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.text = element_text(color = 'Grey25'),
axis.title = element_text(color = 'Grey50')
)
Random Forst correctly identifies the variables contributing to the function. All of these 5 variables have at least 5 times the importance as the noise features.
We perform the same excercise with CI trees, first using the standard feature set.
f_data <- train_x
f_data$y <- train_y
modci <- cforest(y ~ ., data = f_data)
df1 <- varimp(modci) %>%
data.frame() %>%
rownames_to_column()
colnames(df1) <- c('Variable', 'Importance')
ggplot(df1) +
geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
geom_point(aes(x = Variable, y = Importance), size = 1) +
labs(title = 'Variable Importance') +
coord_flip() +
theme_minimal() +
theme(
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.text = element_text(color = 'Grey25'),
axis.title = element_text(color = 'Grey50')
)
X3 has an importance closer to the noise variables than the other models. If we look at splits like hypothesis tests, CI trees might be more likely to have type 2 errors (don’t split when you should have split) but less likely to have type 1 errors (split when you shouldn’t split)
Now we add a variable correlated to X1:
f_data_mod <- f_data
f_data_mod$dupe1 <- train_x_mod$X1 + rnorm(300) *.1
modci1 <- cforest(y ~ ., data = f_data_mod)
df1 <- varimp(modci1, conditional = F) %>%
data.frame() %>%
rownames_to_column()
colnames(df1) <- c('Variable', 'Importance')
ggplot(df1) +
geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
geom_point(aes(x = Variable, y = Importance), size = 1) +
labs(title = 'Variable Importance') +
coord_flip() +
theme_minimal() +
theme(
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.text = element_text(color = 'Grey25'),
axis.title = element_text(color = 'Grey50')
)
We see the usual pattern, with this new feature taking importance away from X1
Conditional = T
df1 <- varimp(modci, conditional = T) %>%
data.frame() %>%
rownames_to_column()
colnames(df1) <- c('Variable', 'Importance')
ggplot(df1) +
geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
geom_point(aes(x = Variable, y = Importance), size = 1) +
labs(title = 'Variable Importance') +
coord_flip() +
theme_minimal() +
theme(
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.text = element_text(color = 'Grey25'),
axis.title = element_text(color = 'Grey50')
)
The noise variable has a profound effect on the importance of X1. The total of the two is nowhere near the original importance of X1, although all variable importances have reduced in magnitude. This overall reduction in magnitude must be cause by the adjustment for correlation.
XGBoost
library(xgboost)
xgd <- train_x %>% as.matrix %>% xgb.DMatrix(label = train_y)
modxg <- xgb.train(data = xgd, nrounds = 750, subsmaple = .6, eta = .1)
xgb.importance(model = modxg)
Using the the XGBoost feature importance function, the 5 signal features are at the top, although Features 2 and 4 (X3 and X5 in the other models) aren’t far above the noise features.
Correlated Feature (feature 10)
library(xgboost)
xgd <- train_x_mod %>% as.matrix %>% xgb.DMatrix(label = train_y)
modxg <- xgb.train(data = xgd, nrounds = 750, subsmaple = .6, eta = .1)
xgb.importance(model = modxg)
Using gain for variable importance seems to do a better job at recognizing the duplicated feature as a noise variable.
Cubist
library(Cubist)
mod_cub <- cubist(train_x, train_y, commitees = 3, neighbors = 3)
varImp(mod_cub) %>%
rownames_to_column() %>%
rename( 'Importance' = Overall, 'Variable' =rowname) %>%
ggplot() +
geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
geom_point(aes(x = Variable, y = Importance), size = 1) +
labs(title = 'Variable Importance') +
coord_flip() +
theme_minimal() +
theme(
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.text = element_text(color = 'Grey25'),
axis.title = element_text(color = 'Grey50')
)
Going back to the standard VarImp function from caret, we see that the cubist model does a very good job of identifying the correct variables
Correlated Predictor
mod_cub <- cubist(train_x_mod, train_y, commitees = 3, neighbors = 3)
varImp(mod_cub) %>%
rownames_to_column() %>%
rename( 'Importance' = Overall, 'Variable' =rowname) %>%
ggplot() +
geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
geom_point(aes(x = Variable, y = Importance), size = 1) +
labs(title = 'Variable Importance') +
coord_flip() +
theme_minimal() +
theme(
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.text = element_text(color = 'Grey25'),
axis.title = element_text(color = 'Grey50')
)
Cubist does well in not assigning much importance to the duplicated predictor, but the noise variable X6 is bumped up somewhat. Still, neither is close in magnitude to the signal predictors.
Our goal is to show that variables with more distinct values are more likely to be selected to split on. We start by creating 4 random variables,
X11: Continuous uniform distributed random variable X1: Rounded version of X11, making it discrete X2: Normally distributed random variable X3: Normally distributed
And a function:
\(y\quad =\quad { 2X }_{ 2 }+{ e }^{ { X }_{ 3 } }\)
X11 <- runif(1000,0,10)
X1 <- round(X11)
X2 <- rnorm(1000,10)
X3 <- rnorm(1000,1)
e <- rnorm(1000,0,10)
sim_x <- data.frame(X1,X2,X3)
sim_y <- X2*2 + exp(X3) + e
sim_x_gran <- data.frame(X11, X2, X3)
sim_y_gran <- X2*2 + exp(X3) + e
Model with X1
mod_sim <- randomForest(x = sim_x, y = sim_y)
importance(mod_sim, scale = T)
## IncNodePurity
## X1 11068.18
## X2 33802.57
## X3 53965.42
Model with X11
mod_sim1 <- randomForest(x = sim_x_gran, y = sim_y_gran, importance = T)
importance(mod_sim1, scale = T)
## %IncMSE IncNodePurity
## X11 1.530631 29907.38
## X2 18.987845 35454.45
## X3 57.349556 55926.01
Even though X11 is a noise variable, its importance is about the same as X2 according to the node purity, measured as average difference in RSS between nodes. The MSE measure, calculated using OOB samples, correctly identifies it as an unimportant variable.
Importance measures are defined here: https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/importance
The high bagging fraction prevents there from being much variation in the dataset between boosting iterations. The variables that explain most of the variance in the response will be chosen in the intial boosting iterations. The high learning rate will increase the contribution of each tree so our trees built with nearly the whole dataset cause the explainable variance to evaporate quickly. Few trees of few variables means fewer important predictors.
A low bagging fraction will find a variety of predictors that explain variance over different subsets of the data. The low learning rate will cause the explainable variance to not disapear as quickly so many trees wind up contributing to model. Many trees of many variables means more important predictors.
Stochastic gradient boosting does tend to yield better predictive performance because it tends to reduce variance by making succesive trees independent. On the other hand, I’ve had success with bagging fractions around .5 so .1 seems pretty low.
A lower learning rate is also generally benificial because it prevents individual variables from having an extreme effect on the model. .1 is in the normal range, but this does depend on the number of trees used. With these factors considered, I would expect the model with the learning rate and bagging fraction of .1 to perform better.
Increasing interaction depth will cause the variable importance to drop off faster. The variables that explain more variance will be chosen more times for splits (likely multiple times for each tree) in the initial trees so all the explainable variance will evaporate quickly. If the interaction depth were set high enough on a high learning rate and no bag fraction, the model would be close what would be producted from a standard regression tree.
For this part, we’ll mostly be relying on crossvalidation to evaluate model performance, given the small test set sample. The test set can be used to make sure there are not any outlandish sets of predictions.
data(ChemicalManufacturingProcess)
zero_vars <- nearZeroVar(ChemicalManufacturingProcess)
ChemicalManufacturingProcess_new <- ChemicalManufacturingProcess[, -zero_vars]
pred <- dplyr::select(ChemicalManufacturingProcess_new, -ManufacturingProcess36) %>%
mice(data = , m = 3, method = 'rf', maxit = 2)
imputed <-complete(pred)
pred <- mice(data = imputed, m = 3, method = 'rf', maxit = 2)
chem_imputed <- complete(pred)
yield <- chem_imputed$Yield
all_x <- select(chem_imputed, - Yield)
samples <- createDataPartition(yield, p = .8, list = F)
bc_trans <- preProcess(all_x, method = 'BoxCox')
cm_bc <- predict(bc_trans, all_x)
cm_train <- cm_bc[samples, ]
cm_test <- cm_bc[-samples, ]
yield_train <- yield[samples]
yield_test<- yield[-samples]
show_results <- function(model){
rslts <- model$results %>%
arrange(RMSE)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
}
library(doParallel)
Random Forest
We try two different Random Forest packages, the first of which uses regulation. Random forests are fairly robust to overfitting, but having the additional hyperparameter to tune could help.
cl <- makeCluster(detectCores())
registerDoParallel(cl)
param_grid <- expand.grid(mtry = c(2, 27, 39, 55), .coefReg = c(.1, .3, .5, .7))
rf_mod <- train(x = cm_train,
y = yield_train,
method = 'RRFglobal',
trControl = trainControl('repeatedcv', repeats = 3),
tunGrid = param_grid,
preProcess = c('center', 'scale')
)
show_results(rf_mod)
| mtry | coefReg | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|---|
| 28 | 0.505 | 1.157263 | 0.6654183 | 0.9110522 | 0.2593933 | 0.1143435 | 0.1903930 |
| 28 | 1.000 | 1.167069 | 0.6561134 | 0.9160033 | 0.2597896 | 0.1152764 | 0.1906972 |
| 55 | 0.505 | 1.183974 | 0.6249334 | 0.9357544 | 0.2821034 | 0.1216053 | 0.2015996 |
| 55 | 1.000 | 1.193772 | 0.6225802 | 0.9356461 | 0.2814222 | 0.1186544 | 0.1953220 |
| 28 | 0.010 | 1.251883 | 0.5648011 | 1.0057630 | 0.2175844 | 0.1643909 | 0.1663055 |
| 2 | 1.000 | 1.309957 | 0.6145841 | 1.0648710 | 0.2473953 | 0.1438305 | 0.1751985 |
| 2 | 0.010 | 1.311756 | 0.6114346 | 1.0687536 | 0.2471386 | 0.1410520 | 0.1749794 |
| 2 | 0.505 | 1.312801 | 0.6094919 | 1.0680927 | 0.2439122 | 0.1405857 | 0.1742360 |
| 55 | 0.010 | 1.479277 | 0.4133462 | 1.2049655 | 0.3484808 | 0.2123346 | 0.2903198 |
stopCluster(cl)
Test Set
predict(rf_mod, cm_test) %>%
postResample(yield_test)
## RMSE Rsquared MAE
## 0.7622490 0.8821255 0.5641624
The lack of a fully expanded grid is a known issue with this model type, but we are still able to get an idea how it will perform.
https://github.com/topepo/caret/issues/442
cl <- makeCluster(detectCores())
registerDoParallel(cl)
rf_mod <- train(x = cm_train,
y = yield_train,
method = 'rf',
trControl = trainControl('repeatedcv', repeats = 3),
tuneLength = 10,
preProcess = c('center', 'scale')
)
show_results(rf_mod)
| mtry | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| 25 | 1.154881 | 0.6627669 | 0.9077380 | 0.3054115 | 0.1506337 | 0.2325729 |
| 19 | 1.157538 | 0.6670418 | 0.9105194 | 0.3072278 | 0.1488363 | 0.2348299 |
| 31 | 1.158146 | 0.6566100 | 0.9118567 | 0.3103867 | 0.1533239 | 0.2388814 |
| 37 | 1.162195 | 0.6494036 | 0.9184967 | 0.3086253 | 0.1531924 | 0.2369912 |
| 43 | 1.165508 | 0.6444061 | 0.9193881 | 0.3099862 | 0.1539599 | 0.2419811 |
| 13 | 1.166363 | 0.6681474 | 0.9222304 | 0.2991675 | 0.1511563 | 0.2261212 |
| 49 | 1.176992 | 0.6349022 | 0.9295061 | 0.3072818 | 0.1572418 | 0.2380722 |
| 55 | 1.187580 | 0.6255677 | 0.9344489 | 0.3084683 | 0.1574402 | 0.2398400 |
| 7 | 1.194262 | 0.6606631 | 0.9591552 | 0.3008275 | 0.1536315 | 0.2288058 |
| 2 | 1.309829 | 0.6078894 | 1.0640405 | 0.2922168 | 0.1594827 | 0.2145089 |
stopCluster(cl)
predict(rf_mod, cm_test) %>%
postResample(yield_test)
## RMSE Rsquared MAE
## 0.7659082 0.8861514 0.5859631
Their performance is approximately the same so we would choose the standard RF out of these two.
Cubist
cl <- makeCluster(detectCores())
registerDoParallel(cl)
param_grid <- expand.grid(.committees = c(1,3,5), .neighbors = c(1,3,5,7,9))
cub_mod <- train(x = cm_train,
y = yield_train,
method = 'cubist',
trControl = trainControl('repeatedcv', repeats = 3),
tuneGrid = param_grid,
preProcess = c('center', 'scale')
)
show_results(cub_mod)
| committees | neighbors | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|---|
| 5 | 3 | 1.029583 | 0.7108799 | 0.8010303 | 0.2747806 | 0.1470178 | 0.2079192 |
| 5 | 1 | 1.038426 | 0.7111855 | 0.7575884 | 0.3550227 | 0.1613835 | 0.2169354 |
| 5 | 5 | 1.089419 | 0.6844687 | 0.8508846 | 0.2614677 | 0.1509766 | 0.2027682 |
| 3 | 3 | 1.114090 | 0.6799769 | 0.8316262 | 0.3543068 | 0.1602972 | 0.2099361 |
| 3 | 1 | 1.114557 | 0.6901926 | 0.7833258 | 0.4155811 | 0.1717713 | 0.2398745 |
| 5 | 7 | 1.118693 | 0.6651688 | 0.8645219 | 0.2593058 | 0.1503381 | 0.1993100 |
| 5 | 9 | 1.163215 | 0.6295794 | 0.9023814 | 0.2549453 | 0.1643529 | 0.2021574 |
| 3 | 5 | 1.168829 | 0.6517273 | 0.8763304 | 0.3281129 | 0.1597636 | 0.2099609 |
| 3 | 7 | 1.198960 | 0.6316902 | 0.8907295 | 0.3288696 | 0.1572190 | 0.2004083 |
| 3 | 9 | 1.243138 | 0.5971839 | 0.9255320 | 0.3223195 | 0.1695675 | 0.2090795 |
stopCluster(cl)
The cubist model performs much better than the Random Forest. I would think this has to do with dataset size. Parametric models tend to perform closer to nonparametric models on small datasets, and Cubist is sort of a semiparametric model since it uses linear regressions in the nodes.
Test Set
predict(cub_mod, cm_test) %>%
postResample(yield_test)
## RMSE Rsquared MAE
## 0.6244370 0.8833883 0.4968323
XGBoost
cl <- makeCluster(detectCores())
registerDoParallel(cl)
param_grid <- expand.grid(eta = c(.025, .05), nrounds = c(1000), max_depth = c(3,4,5), gamma = c(0),
colsample_bytree = c(.6),
min_child_weight = c(.6,1),
subsample = c(.6))
xgb_mod <- train(x = cm_train,
y = yield_train,
method = 'xgbTree',
trControl = trainControl('repeatedcv', repeats = 2),
tuneGrid = param_grid,
preProcess = c('center', 'scale')
)
show_results(xgb_mod)
| eta | max_depth | gamma | colsample_bytree | min_child_weight | subsample | nrounds | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.025 | 5 | 0 | 0.6 | 1.0 | 0.6 | 1000 | 1.064211 | 0.6907474 | 0.8453107 | 0.2135513 | 0.1538367 | 0.1875079 |
| 0.025 | 3 | 0 | 0.6 | 1.0 | 0.6 | 1000 | 1.066742 | 0.6789256 | 0.8562385 | 0.2195220 | 0.1646512 | 0.1985797 |
| 0.050 | 4 | 0 | 0.6 | 0.6 | 0.6 | 1000 | 1.069007 | 0.6770233 | 0.8481751 | 0.2240927 | 0.1566163 | 0.1754718 |
| 0.050 | 5 | 0 | 0.6 | 1.0 | 0.6 | 1000 | 1.073203 | 0.6838848 | 0.8555583 | 0.2183363 | 0.1508189 | 0.1972080 |
| 0.025 | 4 | 0 | 0.6 | 0.6 | 0.6 | 1000 | 1.075705 | 0.6755844 | 0.8470594 | 0.1937811 | 0.1520972 | 0.1650256 |
| 0.025 | 4 | 0 | 0.6 | 1.0 | 0.6 | 1000 | 1.076213 | 0.6805889 | 0.8522590 | 0.1919182 | 0.1502313 | 0.1722959 |
| 0.050 | 5 | 0 | 0.6 | 0.6 | 0.6 | 1000 | 1.087120 | 0.6688570 | 0.8631735 | 0.2196770 | 0.1543973 | 0.1745584 |
| 0.050 | 3 | 0 | 0.6 | 1.0 | 0.6 | 1000 | 1.089527 | 0.6649618 | 0.8613373 | 0.2299412 | 0.1578870 | 0.1956012 |
| 0.025 | 5 | 0 | 0.6 | 0.6 | 0.6 | 1000 | 1.089731 | 0.6692567 | 0.8541825 | 0.2145866 | 0.1424727 | 0.1756773 |
| 0.025 | 3 | 0 | 0.6 | 0.6 | 0.6 | 1000 | 1.090687 | 0.6646234 | 0.8667459 | 0.2033608 | 0.1457812 | 0.1788588 |
stopCluster(cl)
Test Set
predict(xgb_mod, cm_test) %>%
postResample(yield_test)
## RMSE Rsquared MAE
## 0.7275301 0.8981322 0.5470986
XGBoost is the best so far, but isn’t too different from Cubist. We’ll try the regression trees as the base learner and see if the semi-parametric theory holds.
XGBoost DART
cl <- makeCluster(detectCores())
registerDoParallel(cl)
param_grid <- expand.grid(eta = c(.025), nrounds = c(1000), gamma = c(.1), skip_drop = c(.2,.6), rate_drop = c(.2,.6), max_depth = c(9),
colsample_bytree = c(.6),
min_child_weight = c(.6),
subsample = c(.6))
xgb_dart <- train(x = cm_train,
y = yield_train,
method = 'xgbDART',
trControl = trainControl('repeatedcv', repeats = 2),
tuneGrid = param_grid,
preProcess = c('center', 'scale')
)
show_results(xgb_dart)
| max_depth | eta | rate_drop | skip_drop | min_child_weight | subsample | colsample_bytree | gamma | nrounds | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | 0.025 | 0.2 | 0.6 | 0.6 | 0.6 | 0.6 | 0.1 | 1000 | 1.060240 | 0.7039839 | 0.8364756 | 0.2260349 | 0.1061089 | 0.1559207 |
| 9 | 0.025 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.1 | 1000 | 1.072227 | 0.6910420 | 0.8374601 | 0.2245514 | 0.1138196 | 0.1496667 |
| 9 | 0.025 | 0.2 | 0.2 | 0.6 | 0.6 | 0.6 | 0.1 | 1000 | 1.161304 | 0.6650710 | 0.8966838 | 0.2487948 | 0.1104787 | 0.1596107 |
| 9 | 0.025 | 0.6 | 0.2 | 0.6 | 0.6 | 0.6 | 0.1 | 1000 | 1.180713 | 0.6665166 | 0.9212160 | 0.2726684 | 0.1264783 | 0.1796319 |
stopCluster(cl)
predict(xgb_dart, cm_test) %>%
postResample(yield_test)
## RMSE Rsquared MAE
## 0.7288000 0.8718641 0.4993318
Even though we didn’t tune on a very big grid, the DART booster gave similar performance to the standard tree booster. It is possible that it could be tuned to yield better performance than the standard booster, but in this case we will choose the XGBTree model as our final model.
plot(varImp(xgb_mod), top = 10)
VS Other Nonlinears
Comparisons can be found here: http://rpubs.com/thefexexpress/486553
First, we visualize our XGBoost model, using the xgb.plot.multi.trees function. This will measure the frequency with which the top 4 predictors appear at each node.
dm <- xgb.DMatrix(data = cm_bc %>% as.matrix, label = yield)
params = list(eta = .025, gamma = 0, max_depth = 4,
colsample_bytree = .6,
min_child_weight = .6,
subsample = .6,
verbose = F,
booster = 'gbtree')
final_model <- xgb.train(dm, params = params, nrounds = 1000)
f_names <- colnames(cm_bc)
xgboost::xgb.plot.multi.trees(feature_names = f_names, model = final_model, num_features = 4)
This doesn’t tell us much that the variable importance measure didn’t already reveal. The one thing to note: - Even though ManufacturingProcess32 had more than double the importance of any other feature, the distribution evens out after the root node
We’ll use a single regression tree to try and get a better visual
library(rpart.plot)
rp_mod <- rpart::rpart(Yield ~ ., data = chem_imputed, method = 'anova')
rpart.plot(rp_mod, uniform=TRUE)
We can now see the paths that lead to high and low yields. With Domain knowledge, this tree could be quite useful.