HW9

Problem 8.1
Problem 8.2
Problem 8.3
8.7

library(caret)
library(tidyverse)
library(mlbench)
library(AppliedPredictiveModeling)
library(randomForest)
library(mice)
library(kableExtra)
library(gridExtra)
library(party)
library(xgboost)
set.seed(42)

Problem 8.1

Part A Fit Random Forst

set.seed(7)

train <- mlbench.friedman1(300)
train_x <- data.frame(train$x)
train_y <- train$y

test <- mlbench.friedman1(5000)
test_x <- data.frame(test$x)
test_y <- test$y
mod1 <- randomForest(y = train_y,x = train_x, ntree = 1000, importance = T)
varImp(mod1, scale = F) %>% data.frame() %>%  
  rownames_to_column() %>%
  rename( 'Importance' = Overall,  'Variable' =rowname) %>%
  ggplot() + 
    geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
    geom_point(aes(x = Variable, y = Importance), size = 1) +
    labs(title = 'Variable Importance') +
    coord_flip() + 
    theme_minimal() +
    theme(
      axis.text.y = element_blank(),
      axis.title.y = element_blank(),
      axis.text = element_text(color = 'Grey25'),
      axis.title = element_text(color = 'Grey50')
  )

Random Forst correctly identifies the variables contributing to the function. All of these 5 variables have at least 5 times the importance as the noise features.

Part B Correlated Variables

We add 1 variable highly correlated to X1 and find the correlation coefficient:

train_x_mod <- train_x
train_x_mod$dupe1 <- train_x_mod$X1 + rnorm(300) *.1
cor(train_x_mod$dupe1, train_x_mod$X1)

## [1] 0.9335937

modcor1 <- randomForest(y = train_y,x = train_x_mod, ntree = 1000, importance = T)
varImp(modcor1, scale = F) %>%
  rownames_to_column() %>%
  rename( 'Importance' = Overall,  'Variable' =rowname) %>%
  ggplot() + 
    geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
    geom_point(aes(x = Variable, y = Importance), size = 1) +
    labs(title = 'Variable Importance') +
    coord_flip() + 
    theme_minimal() +
    theme(
      axis.text.y = element_blank(),
      axis.title.y = element_blank(),
      axis.text = element_text(color = 'Grey25'),
      axis.title = element_text(color = 'Grey50')
  )

This variable has taken importance away form X1, as expected.

Adding another correlated variable has a similar effect:

train_x_mod$dupe2 <- train_x_mod$X1 + rnorm(300) *.1
modcor2 <- randomForest(y = train_y,x = train_x_mod, ntree = 1000, importance = T)
varImp(modcor2, scale = F) %>%
  rownames_to_column() %>%
  rename( 'Importance' = Overall,  'Variable' =rowname) %>%
  ggplot() + 
    geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
    geom_point(aes(x = Variable, y = Importance), size = 1) +
    labs(title = 'Variable Importance') +
    coord_flip() + 
    theme_minimal() +
    theme(
      axis.text.y = element_blank(),
      axis.title.y = element_blank(),
      axis.text = element_text(color = 'Grey25'),
      axis.title = element_text(color = 'Grey50')
    )

The “duplicate variables” plus the original X1 add up to more than the original X1 importance. Their effect isn’t completely limited to just X1.

Part C CI Trees

We perform the same excercise with CI trees, first using the standard feature set.

f_data <- train_x
f_data$y <- train_y
modci <- cforest(y ~ ., data = f_data)
df1 <- varimp(modci) %>% 
  data.frame() %>%
  rownames_to_column()
colnames(df1) <- c('Variable', 'Importance')
  ggplot(df1) + 
    geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
    geom_point(aes(x = Variable, y = Importance), size = 1) +
    labs(title = 'Variable Importance') +
    coord_flip() + 
    theme_minimal() +
    theme(
      axis.text.y = element_blank(),
      axis.title.y = element_blank(),
      axis.text = element_text(color = 'Grey25'),
      axis.title = element_text(color = 'Grey50')
    )

X3 has an importance closer to the noise variables than the other models. If we look at splits like hypothesis tests, CI trees might be more likely to have type 2 errors (don’t split when you should have split) but less likely to have type 1 errors (split when you shouldn’t split)

Now we add a variable correlated to X1:

f_data_mod <- f_data
f_data_mod$dupe1 <- train_x_mod$X1 + rnorm(300) *.1
modci1 <- cforest(y ~ ., data = f_data_mod)
df1 <- varimp(modci1, conditional = F) %>% 
  data.frame() %>%
  rownames_to_column()
colnames(df1) <- c('Variable', 'Importance')
  ggplot(df1) + 
    geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
    geom_point(aes(x = Variable, y = Importance), size = 1) +
    labs(title = 'Variable Importance') +
    coord_flip() + 
    theme_minimal() +
    theme(
      axis.text.y = element_blank(),
      axis.title.y = element_blank(),
      axis.text = element_text(color = 'Grey25'),
      axis.title = element_text(color = 'Grey50')
    )

We see the usual pattern, with this new feature taking importance away from X1

Conditional = T

df1 <- varimp(modci, conditional = T) %>% 
  data.frame() %>%
  rownames_to_column()
colnames(df1) <- c('Variable', 'Importance')
  ggplot(df1) + 
    geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
    geom_point(aes(x = Variable, y = Importance), size = 1) +
    labs(title = 'Variable Importance') +
    coord_flip() + 
    theme_minimal() +
    theme(
      axis.text.y = element_blank(),
      axis.title.y = element_blank(),
      axis.text = element_text(color = 'Grey25'),
      axis.title = element_text(color = 'Grey50')
    )

The noise variable has a profound effect on the importance of X1. The total of the two is nowhere near the original importance of X1, although all variable importances have reduced in magnitude. This overall reduction in magnitude must be cause by the adjustment for correlation.

Part D Other Models

XGBoost

library(xgboost)
xgd <- train_x %>% as.matrix %>% xgb.DMatrix(label = train_y)
modxg <- xgb.train(data = xgd, nrounds = 750, subsmaple = .6, eta = .1)
xgb.importance(model = modxg)

Using the the XGBoost feature importance function, the 5 signal features are at the top, although Features 2 and 4 (X3 and X5 in the other models) aren’t far above the noise features.

Correlated Feature (feature 10)

library(xgboost)
xgd <- train_x_mod %>% as.matrix %>% xgb.DMatrix(label = train_y)
modxg <- xgb.train(data = xgd, nrounds = 750, subsmaple = .6, eta = .1)
xgb.importance(model = modxg)

Using gain for variable importance seems to do a better job at recognizing the duplicated feature as a noise variable.

Cubist

library(Cubist)
mod_cub <- cubist(train_x, train_y, commitees = 3, neighbors = 3)
varImp(mod_cub) %>%
  rownames_to_column() %>%
  rename( 'Importance' = Overall,  'Variable' =rowname) %>%
  ggplot() + 
    geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
    geom_point(aes(x = Variable, y = Importance), size = 1) +
    labs(title = 'Variable Importance') +
    coord_flip() + 
    theme_minimal() +
    theme(
      axis.text.y = element_blank(),
      axis.title.y = element_blank(),
      axis.text = element_text(color = 'Grey25'),
      axis.title = element_text(color = 'Grey50')
)

Going back to the standard VarImp function from caret, we see that the cubist model does a very good job of identifying the correct variables

Correlated Predictor

mod_cub <- cubist(train_x_mod, train_y, commitees = 3, neighbors = 3)
varImp(mod_cub) %>%
  rownames_to_column() %>%
  rename( 'Importance' = Overall,  'Variable' =rowname) %>%
  ggplot() + 
    geom_text(aes(x = Variable, y = Importance, label = Variable), hjust = -.2, size = 3) +
    geom_point(aes(x = Variable, y = Importance), size = 1) +
    labs(title = 'Variable Importance') +
    coord_flip() + 
    theme_minimal() +
    theme(
      axis.text.y = element_blank(),
      axis.title.y = element_blank(),
      axis.text = element_text(color = 'Grey25'),
      axis.title = element_text(color = 'Grey50')
)

Cubist does well in not assigning much importance to the duplicated predictor, but the noise variable X6 is bumped up somewhat. Still, neither is close in magnitude to the signal predictors.

Problem 8.2

Our goal is to show that variables with more distinct values are more likely to be selected to split on. We start by creating 4 random variables,

X11: Continuous uniform distributed random variable X1: Rounded version of X11, making it discrete X2: Normally distributed random variable X3: Normally distributed

And a function:

\(y\quad =\quad { 2X }_{ 2 }+{ e }^{ { X }_{ 3 } }\)

X11 <- runif(1000,0,10)
X1 <- round(X11)
X2 <- rnorm(1000,10)
X3 <- rnorm(1000,1)

e <- rnorm(1000,0,10)

sim_x <- data.frame(X1,X2,X3)
sim_y <-  X2*2 + exp(X3) + e

sim_x_gran <- data.frame(X11, X2, X3)
sim_y_gran <- X2*2 + exp(X3) + e

Model with X1

mod_sim <- randomForest(x = sim_x, y = sim_y)
importance(mod_sim, scale = T)

##    IncNodePurity
## X1      11068.18
## X2      33802.57
## X3      53965.42

Model with X11

mod_sim1 <- randomForest(x = sim_x_gran, y = sim_y_gran, importance = T)
importance(mod_sim1, scale = T)

##       %IncMSE IncNodePurity
## X11  1.530631      29907.38
## X2  18.987845      35454.45
## X3  57.349556      55926.01

Even though X11 is a noise variable, its importance is about the same as X2 according to the node purity, measured as average difference in RSS between nodes. The MSE measure, calculated using OOB samples, correctly identifies it as an unimportant variable.

Importance measures are defined here: https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/importance

Problem 8.3

Part A Explination

The high bagging fraction prevents there from being much variation in the dataset between boosting iterations. The variables that explain most of the variance in the response will be chosen in the intial boosting iterations. The high learning rate will increase the contribution of each tree so our trees built with nearly the whole dataset cause the explainable variance to evaporate quickly. Few trees of few variables means fewer important predictors.

A low bagging fraction will find a variety of predictors that explain variance over different subsets of the data. The low learning rate will cause the explainable variance to not disapear as quickly so many trees wind up contributing to model. Many trees of many variables means more important predictors.

Part B Predictive Power

Stochastic gradient boosting does tend to yield better predictive performance because it tends to reduce variance by making succesive trees independent. On the other hand, I’ve had success with bagging fractions around .5 so .1 seems pretty low.

A lower learning rate is also generally benificial because it prevents individual variables from having an extreme effect on the model. .1 is in the normal range, but this does depend on the number of trees used. With these factors considered, I would expect the model with the learning rate and bagging fraction of .1 to perform better.

Part C Ineraction Depth

Increasing interaction depth will cause the variable importance to drop off faster. The variables that explain more variance will be chosen more times for splits (likely multiple times for each tree) in the initial trees so all the explainable variance will evaporate quickly. If the interaction depth were set high enough on a high learning rate and no bag fraction, the model would be close what would be producted from a standard regression tree.

8.7

Part A Test Models.

For this part, we’ll mostly be relying on crossvalidation to evaluate model performance, given the small test set sample. The test set can be used to make sure there are not any outlandish sets of predictions.

data(ChemicalManufacturingProcess)
zero_vars <- nearZeroVar(ChemicalManufacturingProcess)

ChemicalManufacturingProcess_new <- ChemicalManufacturingProcess[, -zero_vars]
pred <- dplyr::select(ChemicalManufacturingProcess_new, -ManufacturingProcess36) %>%
  mice(data = , m = 3, method = 'rf', maxit = 2)
imputed <-complete(pred)

pred <- mice(data = imputed, m = 3, method = 'rf', maxit = 2)
chem_imputed <- complete(pred)

yield <- chem_imputed$Yield
all_x <- select(chem_imputed, - Yield)
samples <- createDataPartition(yield, p = .8, list = F)

bc_trans <- preProcess(all_x, method = 'BoxCox')
cm_bc <- predict(bc_trans, all_x)


cm_train <- cm_bc[samples, ]

cm_test <- cm_bc[-samples, ]



yield_train <- yield[samples]
yield_test<- yield[-samples]

show_results <- function(model){ 
  rslts <- model$results %>%
    arrange(RMSE)
  head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
}
library(doParallel)

Random Forest

We try two different Random Forest packages, the first of which uses regulation. Random forests are fairly robust to overfitting, but having the additional hyperparameter to tune could help.

 cl <- makeCluster(detectCores())
registerDoParallel(cl)

param_grid <- expand.grid(mtry = c(2, 27, 39, 55), .coefReg = c(.1, .3, .5, .7))
rf_mod <- train(x = cm_train, 
      y = yield_train,
      method = 'RRFglobal',
      trControl = trainControl('repeatedcv',  repeats = 3),
      tunGrid = param_grid,
      preProcess = c('center', 'scale')
      )

show_results(rf_mod)

mtry	coefReg	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
28	0.505	1.157263	0.6654183	0.9110522	0.2593933	0.1143435	0.1903930
28	1.000	1.167069	0.6561134	0.9160033	0.2597896	0.1152764	0.1906972
55	0.505	1.183974	0.6249334	0.9357544	0.2821034	0.1216053	0.2015996
55	1.000	1.193772	0.6225802	0.9356461	0.2814222	0.1186544	0.1953220
28	0.010	1.251883	0.5648011	1.0057630	0.2175844	0.1643909	0.1663055
2	1.000	1.309957	0.6145841	1.0648710	0.2473953	0.1438305	0.1751985
2	0.010	1.311756	0.6114346	1.0687536	0.2471386	0.1410520	0.1749794
2	0.505	1.312801	0.6094919	1.0680927	0.2439122	0.1405857	0.1742360
55	0.010	1.479277	0.4133462	1.2049655	0.3484808	0.2123346	0.2903198

stopCluster(cl)

Test Set

predict(rf_mod, cm_test) %>%
  postResample(yield_test)

##      RMSE  Rsquared       MAE 
## 0.7622490 0.8821255 0.5641624

The lack of a fully expanded grid is a known issue with this model type, but we are still able to get an idea how it will perform.

https://github.com/topepo/caret/issues/442

 cl <- makeCluster(detectCores())
registerDoParallel(cl)

rf_mod <- train(x = cm_train, 
      y = yield_train,
      method = 'rf',
      trControl = trainControl('repeatedcv',  repeats = 3),
      tuneLength = 10,
      preProcess = c('center', 'scale')
      )

show_results(rf_mod)

mtry	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
25	1.154881	0.6627669	0.9077380	0.3054115	0.1506337	0.2325729
19	1.157538	0.6670418	0.9105194	0.3072278	0.1488363	0.2348299
31	1.158146	0.6566100	0.9118567	0.3103867	0.1533239	0.2388814
37	1.162195	0.6494036	0.9184967	0.3086253	0.1531924	0.2369912
43	1.165508	0.6444061	0.9193881	0.3099862	0.1539599	0.2419811
13	1.166363	0.6681474	0.9222304	0.2991675	0.1511563	0.2261212
49	1.176992	0.6349022	0.9295061	0.3072818	0.1572418	0.2380722
55	1.187580	0.6255677	0.9344489	0.3084683	0.1574402	0.2398400
7	1.194262	0.6606631	0.9591552	0.3008275	0.1536315	0.2288058
2	1.309829	0.6078894	1.0640405	0.2922168	0.1594827	0.2145089

stopCluster(cl)

predict(rf_mod, cm_test) %>%
  postResample(yield_test)

##      RMSE  Rsquared       MAE 
## 0.7659082 0.8861514 0.5859631

Their performance is approximately the same so we would choose the standard RF out of these two.

Cubist

 cl <- makeCluster(detectCores())
registerDoParallel(cl)

param_grid <- expand.grid(.committees = c(1,3,5), .neighbors = c(1,3,5,7,9))
cub_mod <- train(x = cm_train, 
      y = yield_train,
      method = 'cubist',
      trControl = trainControl('repeatedcv',  repeats = 3),
      tuneGrid = param_grid,
      preProcess = c('center', 'scale')
      )

show_results(cub_mod)

committees	neighbors	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
5	3	1.029583	0.7108799	0.8010303	0.2747806	0.1470178	0.2079192
5	1	1.038426	0.7111855	0.7575884	0.3550227	0.1613835	0.2169354
5	5	1.089419	0.6844687	0.8508846	0.2614677	0.1509766	0.2027682
3	3	1.114090	0.6799769	0.8316262	0.3543068	0.1602972	0.2099361
3	1	1.114557	0.6901926	0.7833258	0.4155811	0.1717713	0.2398745
5	7	1.118693	0.6651688	0.8645219	0.2593058	0.1503381	0.1993100
5	9	1.163215	0.6295794	0.9023814	0.2549453	0.1643529	0.2021574
3	5	1.168829	0.6517273	0.8763304	0.3281129	0.1597636	0.2099609
3	7	1.198960	0.6316902	0.8907295	0.3288696	0.1572190	0.2004083
3	9	1.243138	0.5971839	0.9255320	0.3223195	0.1695675	0.2090795

stopCluster(cl)

The cubist model performs much better than the Random Forest. I would think this has to do with dataset size. Parametric models tend to perform closer to nonparametric models on small datasets, and Cubist is sort of a semiparametric model since it uses linear regressions in the nodes.

Test Set

predict(cub_mod, cm_test) %>%
  postResample(yield_test)

##      RMSE  Rsquared       MAE 
## 0.6244370 0.8833883 0.4968323

Since the model performed well in cross validation, I would chalk this up to small sample size.
This is likely the result of a single bad prediction because Rsquared and MAE look better than the RMSE.

XGBoost

cl <- makeCluster(detectCores())
registerDoParallel(cl)
param_grid <- expand.grid(eta = c(.025, .05), nrounds = c(1000), max_depth = c(3,4,5), gamma = c(0),
  colsample_bytree = c(.6),
  min_child_weight = c(.6,1),
  subsample = c(.6))

xgb_mod <- train(x = cm_train, 
      y = yield_train,
      method = 'xgbTree',
      trControl = trainControl('repeatedcv',  repeats = 2),
      tuneGrid = param_grid,
      preProcess = c('center', 'scale')
      )

show_results(xgb_mod)

eta	max_depth	colsample_bytree	min_child_weight	subsample	nrounds	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
0.025	5	0.6	1.0	0.6	1000	1.064211	0.6907474	0.8453107	0.2135513	0.1538367	0.1875079
0.025	3	0.6	1.0	0.6	1000	1.066742	0.6789256	0.8562385	0.2195220	0.1646512	0.1985797
0.050	4	0.6	0.6	0.6	1000	1.069007	0.6770233	0.8481751	0.2240927	0.1566163	0.1754718
0.050	5	0.6	1.0	0.6	1000	1.073203	0.6838848	0.8555583	0.2183363	0.1508189	0.1972080
0.025	4	0.6	0.6	0.6	1000	1.075705	0.6755844	0.8470594	0.1937811	0.1520972	0.1650256
0.025	4	0.6	1.0	0.6	1000	1.076213	0.6805889	0.8522590	0.1919182	0.1502313	0.1722959
0.050	5	0.6	0.6	0.6	1000	1.087120	0.6688570	0.8631735	0.2196770	0.1543973	0.1745584
0.050	3	0.6	1.0	0.6	1000	1.089527	0.6649618	0.8613373	0.2299412	0.1578870	0.1956012
0.025	5	0.6	0.6	0.6	1000	1.089731	0.6692567	0.8541825	0.2145866	0.1424727	0.1756773
0.025	3	0.6	0.6	0.6	1000	1.090687	0.6646234	0.8667459	0.2033608	0.1457812	0.1788588

stopCluster(cl)

Test Set

predict(xgb_mod, cm_test) %>%
  postResample(yield_test)

##      RMSE  Rsquared       MAE 
## 0.7275301 0.8981322 0.5470986

XGBoost is the best so far, but isn’t too different from Cubist. We’ll try the regression trees as the base learner and see if the semi-parametric theory holds.

XGBoost DART

cl <- makeCluster(detectCores())
registerDoParallel(cl)
param_grid <- expand.grid(eta = c(.025), nrounds = c(1000), gamma = c(.1), skip_drop = c(.2,.6), rate_drop = c(.2,.6), max_depth = c(9),
  colsample_bytree = c(.6),
  min_child_weight = c(.6),
  subsample = c(.6))

xgb_dart <- train(x = cm_train, 
      y = yield_train,
      method = 'xgbDART',
      trControl = trainControl('repeatedcv',  repeats = 2),
      tuneGrid = param_grid,
      preProcess = c('center', 'scale')
      )

show_results(xgb_dart)

max_depth	eta	rate_drop	skip_drop	min_child_weight	subsample	colsample_bytree	gamma	nrounds	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
9	0.025	0.2	0.6	0.6	0.6	0.6	0.1	1000	1.060240	0.7039839	0.8364756	0.2260349	0.1061089	0.1559207
9	0.025	0.6	0.6	0.6	0.6	0.6	0.1	1000	1.072227	0.6910420	0.8374601	0.2245514	0.1138196	0.1496667
9	0.025	0.2	0.2	0.6	0.6	0.6	0.1	1000	1.161304	0.6650710	0.8966838	0.2487948	0.1104787	0.1596107
9	0.025	0.6	0.2	0.6	0.6	0.6	0.1	1000	1.180713	0.6665166	0.9212160	0.2726684	0.1264783	0.1796319

stopCluster(cl)

predict(xgb_dart, cm_test) %>%
  postResample(yield_test)

##      RMSE  Rsquared       MAE 
## 0.7288000 0.8718641 0.4993318

Even though we didn’t tune on a very big grid, the DART booster gave similar performance to the standard tree booster. It is possible that it could be tuned to yield better performance than the standard booster, but in this case we will choose the XGBTree model as our final model.

Part B Important Predictors

plot(varImp(xgb_mod), top = 10)

VS Other Nonlinears

Comparisons can be found here: http://rpubs.com/thefexexpress/486553

The top variables are similar to what we’ve seen in other nonlinear models, but the importance values are far different.
The other nonlinears had 1 additional BiologicalMaterial variable in the top 10

Part C Tree visualization

First, we visualize our XGBoost model, using the xgb.plot.multi.trees function. This will measure the frequency with which the top 4 predictors appear at each node.

dm <- xgb.DMatrix(data = cm_bc %>% as.matrix, label = yield)
params = list(eta = .025, gamma = 0, max_depth = 4,
  colsample_bytree = .6,
  min_child_weight = .6,
  subsample = .6,
  verbose = F,
  booster = 'gbtree')
final_model <- xgb.train(dm, params = params, nrounds = 1000)

f_names <- colnames(cm_bc)
xgboost::xgb.plot.multi.trees(feature_names = f_names, model = final_model, num_features = 4)

This doesn’t tell us much that the variable importance measure didn’t already reveal. The one thing to note: - Even though ManufacturingProcess32 had more than double the importance of any other feature, the distribution evens out after the root node

We’ll use a single regression tree to try and get a better visual

library(rpart.plot)
rp_mod <- rpart::rpart(Yield ~ ., data = chem_imputed, method = 'anova')
rpart.plot(rp_mod, uniform=TRUE)

We can now see the paths that lead to high and low yields. With Domain knowledge, this tree could be quite useful.

HW9

Peter Goodridge

April 20, 2019

Problem 8.1

Part A Fit Random Forst

Part B Correlated Variables

Part C CI Trees

Part D Other Models

Problem 8.2

Problem 8.3

Part A Explination

Part B Predictive Power

Part C Ineraction Depth

8.7

Part A Test Models.

Part B Important Predictors

Part C Tree visualization