#library modules
suppressWarnings(suppressMessages(library(grid)))
suppressWarnings(suppressMessages(library(gridExtra)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(caret)))
suppressWarnings(suppressMessages(library(C50)))
suppressWarnings(suppressMessages(library(rpart)))
suppressWarnings(suppressMessages(library(rpart.plot)))
suppressWarnings(suppressMessages(library(mlbench)))
suppressWarnings(suppressMessages(library(psych)))
suppressWarnings(suppressMessages(library(mi)))

# tree packages
suppressWarnings(suppressMessages(library(randomForest)))
suppressWarnings(suppressMessages(library(party)))
suppressWarnings(suppressMessages(library(gbm)))
suppressWarnings(suppressMessages(library(AppliedPredictiveModeling)))
suppressWarnings(suppressMessages(library(xgboost)))
suppressWarnings(suppressMessages(library(MLmetrics)))
suppressWarnings(suppressMessages(library(rattle)))

Overview

For each response, and commentary, provided by us, the text will be bolded, as it appears here

Chapter 8 Regression Trees and Rule-Based Models

Question 8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated1 <- mlbench.friedman1(200, sd = 1)
simulated1 <- cbind(simulated1$x, simulated1$y)
simulated1 <- as.data.frame(simulated1)
colnames(simulated1)[ncol(simulated1)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 - V10)?

It did not. The plot below shows that they are the 5 least informative variables.

set.seed(200)
model1 <- randomForest(y ~ ., data = simulated1, importance = T, ntree = 1000)

rfImp1 <- 
    varImp(model1, scale = FALSE) %>% 
    mutate(Var = rownames(.)) %>% 
    arrange(desc(Overall)) %>% 
    dplyr::select(Var, Overall)

ggplot(rfImp1) + 
  geom_col(aes(x = reorder(Var, Overall), y = Overall)) +
  labs(x = "", y = "") + 
  coord_flip()  + 
  theme(axis.text = element_text(size=20, color="black", face = "bold"), plot.title = element_text(size=16)) + 
  ggtitle("No correlation")

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

set.seed(200)
simulated2 <- 
  simulated1 %>% 
  mutate(duplicate1 = V1 + rnorm(200) * .1)

cor(simulated2$duplicate1, simulated2$V1)

## [1] 0.9497025

Fit another random forest model to these data. Did the importance score for V1 change?

Yes, the importance score did decrease for V1 from 8.61 to 6.80, and the highly correlated variable is the next important at 4.21

set.seed(200)
model2 <- randomForest(y ~ ., data = simulated2, 
                       importance = TRUE,
                       ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE) %>% 
    mutate(Var = rownames(.)) %>% 
    arrange(desc(Overall)) %>% 
    dplyr::select(Var, Overall)


ggplot(rfImp2) + 
  geom_col(aes(x = reorder(Var, Overall), y = Overall)) +
  labs(x = "", y = "") + 
  coord_flip()  + 
  theme(axis.text = element_text(size=20, color="black", face = "bold"), plot.title = element_text(size=16)) + 
  ggtitle("2 correlated variables")

What happens when you add another predictor that is also highly correlated with V1?

# add another correlated variable
set.seed(5)
simulated3 <- 
  simulated2 %>% 
  mutate(duplicate2 = V1 + rnorm(200) * .1)

cor(simulated3$duplicate2, simulated3$V1)

## [1] 0.9412195

From the table below, we can see that adding another highly correlated variable further diluted the importance scores of V1 and duplicate1.

As KJ note on p180, when variables are extremely correlated, the “choice of which to use in a sp0lit is somewhat random.”

Having highly correlated variables made the correct interpretation of predictor importance difficult. Using the model with 3 highly correlated variables, one gets the impression that V1, duplicate1 and duplicate2 are of only moderate importance to the model. However, without the other 2 variables, one would understand that the signal coming from V1 is actually most important to the model.

model3 <- randomForest(y ~ ., data = simulated3, importance = TRUE, ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE) %>%
    mutate(Var = rownames(.)) %>%
    arrange(desc(Overall)) %>%
    dplyr::select(Var, Overall)

ggplot(rfImp3) + 
  geom_col(aes(x = reorder(Var, Overall), y = Overall)) +
  labs(x = "", y = "") + 
  coord_flip()  + 
  theme(axis.text = element_text(size=20, color="black", face = "bold"), plot.title = element_text(size=16)) + 
  ggtitle("3 correlated variables")

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Yes, in the 3 plots below, the same pattern is displayed - although perhaps to a lesser to degree. The importance of V1 becomes more diluted when more highly correlated variables are present.

crf1 <- cforest(y ~ ., data = simulated1, controls = cforest_unbiased(ntree = 1000))
crf2 <- cforest(y ~ ., data = simulated2, controls = cforest_unbiased(ntree = 1000))
crf3 <- cforest(y ~ ., data = simulated3, controls = cforest_unbiased(ntree = 1000))

crf_imp <- data.frame(
  crf3 = varimp(crf3),
  crf1 = c(varimp(crf1), NA, NA), 
  crf2 = c(varimp(crf2), NA)
  ) %>% 
  dplyr::select(sort(names(.))) %>% 
  mutate(Var = row.names(.)) 

gridExtra::grid.arrange(
  ggplot(na.omit(crf_imp)) + 
    geom_col(aes(x = reorder(Var, crf1), y = crf1)) +
    labs(x = "", y = "") + 
    coord_flip()  + 
    theme(axis.text = element_text(size=20, color="black", face = "bold"), plot.title = element_text(size=16)) + 
    ggtitle("No correlation"),
  ggplot(na.omit(crf_imp[, c("Var", "crf2")])) + 
    geom_col(aes(x = reorder(Var, crf2), y = crf2)) + 
    labs(x = "", y = "") + 
    coord_flip()  + 
    theme(axis.text = element_text(size=20, color="black", face = "bold"), plot.title = element_text(size=16)) +
    ggtitle("2 correlated variables"),
  ggplot(crf_imp) + 
    geom_col(aes(x = reorder(Var, crf3), y = crf3))  + 
    labs(x = "", y = "") + 
    coord_flip()  + 
    theme(axis.text = element_text(size=20, color="black", face = "bold"), plot.title = element_text(16)) +
    ggtitle("3 correlated variables"),
  ncol = 3
)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Interestingly, V1 is only the second most important variable in the first non-correlated dataset. When we add the first highly correlated variable, V1’s position does decline some, but not as much as the first random forest model. When the second highly correlated variable is added, V1’s importance remains relatively unchanged.

gbm1 <- train(y ~ ., data = simulated1, method = "gbm", verbose = F)
gbm2 <- train(y ~ ., data = simulated2, method = "gbm", verbose = F)
gbm3 <- train(y ~ ., data = simulated3, method = "gbm", verbose = F)


gridExtra::grid.arrange(
  plot(varImp(gbm1, scale = F), main = "No correlation"),
  plot(varImp(gbm2, scale = F), main = "2 correlated variables"),
  plot(varImp(gbm3, scale = F), main = "3 correlated variables"),
  ncol = 3
)

From the plots below, the Cubist model’s variable importance is the least affected by the addition highly correlated variables. V1’s variable importance remains largely unchanged When the first highly correlated variable is added. When the second one is added, V1 does drop a position in importance, but its score only marginally declines.

cubist1 <- train(y ~ ., data = simulated1, method = "cubist", control = Cubist::cubistControl(seed = 1))
cubist2 <- train(y ~ ., data = simulated2, method = "cubist", control = Cubist::cubistControl(seed = 1))
cubist3 <- train(y ~ ., data = simulated3, method = "cubist", control = Cubist::cubistControl(seed = 1))


gridExtra::grid.arrange(
  plot(varImp(cubist1, scale = F), main = "No correlation"),
  plot(varImp(cubist2, scale = F), main = "2 correlated variables"),
  plot(varImp(cubist3, scale = F), main = "3 correlated variables"),
  ncol = 3
)

Question 8.3

Figure 8.24

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Which model do you think would be more predictive of other samples?

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Question 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models.

For the 4 models, we used 5-fold cross-validation in order to tune the hyperparameters. We will compare the models using RMSE and MAPE.

For the linear and 3 tree-based models, we created 2 preprocessed datasets. For both, we imputted the missing values using tree-based bagging. Additionally, for the elastic-net benchmark model, we removed near-zero variance variables, and for interpretability, we removed highly correlated variables. These methods reduced the number of predictor variables from 57 to 47. At first, we also attempted to center and scale the elastic-net data. However, this made inter-model comparisons difficult The RMSE units were not the same as the tree models, and centering the means at zero seemed to vastly inflate the MAPE values.

We set aside 20% of the observations to be the test dataset.

data(ChemicalManufacturingProcess)

set.seed(5)
# add mape to the summary functions
mapeSummary <- function (data,
    lev = NULL,
    model = NULL) {
    c(MAPE = MLmetrics::MAPE(data$pred, data$obs) * 100,
      RMSE = MLmetrics::RMSE(data$pred, data$obs),
      Rsquared = summary(lm(pred ~ obs, data))$r.squared)
}


measure_cv_results <- function(train_obj){
  # measures the best cv results
  colMeans(train_obj$resample[, -4])
}

measure_test_results <- function(y_pred, y_test){
  # measures the test set results
  c(MAPE = MLmetrics::MAPE(y_pred, y_test) * 100,
    RMSE = MLmetrics::RMSE(y_pred, y_test))
}

# preprocess for trees, impute missing
(chem_preprocess <- preProcess(ChemicalManufacturingProcess, method = c("bagImpute")))

## Created from 152 samples and 58 variables
## 
## Pre-processing:
##   - bagged tree imputation (58)
##   - ignored (0)

chem_df <- predict(chem_preprocess, ChemicalManufacturingProcess)

# train-test partition
training_rows <- createDataPartition(chem_df$Yield, p =.8, list = F)
x_train <- chem_df[training_rows, ]
x_test <- chem_df[-training_rows, ]
y_test <- chem_df[-training_rows, "Yield"]


#### Linear Data ####
(chem_preprocess_lm <- preProcess(ChemicalManufacturingProcess, method = c("nzv", "corr", "bagImpute"))) # "center", "scale",

## Created from 152 samples and 58 variables
## 
## Pre-processing:
##   - bagged tree imputation (47)
##   - ignored (0)
##   - removed (11)

chem_df_lm <- predict(chem_preprocess_lm, ChemicalManufacturingProcess)

x_train_lm <- chem_df_lm[training_rows, ]
x_test_lm <- chem_df_lm[-training_rows, ]
y_test_lm <- chem_df_lm[-training_rows, "Yield"]

As previously mentioned, we chose the elastic-net model for the linear-model benchmark. We tested 10 sets of alphas and lambdas to regularize the model.

#### Linear Model ####
set.seed(5)
lm_control <- trainControl(
  method = "cv",
  number = 5, #num_cvs,
  allowParallel = T,
  # verboseIter = T,
  savePredictions = "final",
  summaryFunction = mapeSummary
  )

lm_model <- train(Yield ~ ., 
                    data = x_train_lm,
                    method = "glmnet",
                    trControl = lm_control,
                    tuneLength = 10
                    )

(lm_model$bestTune)

##    alpha     lambda
## 57   0.6 0.08159858

lm_y_pred <- predict(lm_model, x_test_lm)


# sources: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/#elastic-net

For the first tree model, we used CART and found that a cost-complexity parameter of .08 was optimal.

#### CART Model ####

set.seed(5)
cart_control <- trainControl(
  method = "cv",
  number = 5, #num_cvs,
  allowParallel = T,
  # verboseIter = T,
  savePredictions = "final",
  summaryFunction = mapeSummary
  )

cart_model <- train(Yield ~ ., 
                    data = x_train,
                    method = "rpart",
                    metric="RMSE",
                    trControl = cart_control,
                    tuneLength = 10
                    )

(cart_model$bestTune)

##           cp
## 9 0.08949137

cart_y_pred <- predict(cart_model, x_test)

# http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/#loading-required-r-packages

For the second tree model, we tuned a Random Forest model. The optimal number of randomly selected features was found to be 20.

#### Random Forest Model ####
set.seed(5)
rf_control <- trainControl(
  method = "cv",
  number = 5, #num_cvs,
  allowParallel = T,
  # verboseIter = T,
  savePredictions = "final",
  summaryFunction = mapeSummary
  )

rf_grid <- expand.grid(mtry = seq(5, 30, 5))

rf_model <- train(Yield ~ ., 
                    data = x_train,
                    method = "rf",
                    metric="RMSE",
                    trControl = rf_control,
                    tuneGrid = rf_grid,
                    ntree = 1000,
                    importance = T
                    )

(rf_model$bestTune)

##   mtry
## 3   15

rf_y_pred <- predict(rf_model, x_test)

Finally, for the third tree model, we used Extreme Gradient Boosting. We tuned the learning rate, max depth and the subsample-column ratio.

#### Extreme Gradient Boosting Trees ####
set.seed(5)
xgb_control <- trainControl(method = "cv",
                            number = 5, #num_cvs,
                            allowParallel = T,
                            # verboseIter = T,
                            savePredictions = "final",
                            summaryFunction = mapeSummary
                            )

xgb_grid <-  expand.grid(nrounds=250,
                        eta = c(0.025, .05, .1), #.05, # 
                        max_depth = c(5, 10, 20), #5, #
                        colsample_bytree = seq(.25, 1, .25), #.25, #
                        gamma = 0,
                        min_child_weight = 0,
                        subsample = 1
                        )

xgb_model <- train(Yield ~ ., 
                    data = x_train,
                    method = "xgbTree",
                    trControl = xgb_control,
                    tuneGrid = xgb_grid
                    )

xgb_y_pred <- predict(xgb_model, x_test)

Which tree-based regression model gives the optimal resampling and test set performance?

In barplots shown below, we see that while the Extreme Gradient Boosting (EGB) model had the lowest RMSE and MAPE in cross-validation, the Random Forest (RF) model had the lowest RMSE and MAPE ob test datasets. On average, the RF models had only an absolute difference of 2% and an RMSE of 1 unit. The EGB model was overfit to the training dataset.

# collect all the train model objects
all_models <- list(elastic_net = lm_model, cart = cart_model, rf = rf_model, xgboost = xgb_model)

# calculate the cv results
all_cv_results <- 
  lapply(all_models, measure_cv_results) %>% 
  data.frame() %>% 
  t() %>% 
  data.frame() %>% 
  tibble::rownames_to_column("model")

# collect all the y predictions
all_y_preds <- list(elastic_net = lm_y_pred, cart = cart_y_pred, rf = rf_y_pred, xgboost = xgb_y_pred)

# calculate the test results
all_test_results <- 
  lapply(all_y_preds, function(x) measure_test_results(x, y_test)) %>% 
  data.frame() %>% 
  t() %>% 
  data.frame() %>% 
  tibble::rownames_to_column("model")


gridExtra::grid.arrange(
  ggplot(all_cv_results) + geom_col(aes(x = reorder(model, -MAPE), y = MAPE)) + coord_flip() + xlab("") + theme(axis.text = element_text(size=20, color="black", face = "bold"), plot.title = element_text(size=16)) + ggtitle("CV"),
  ggplot(all_cv_results) + geom_col(aes(x = reorder(model, -RMSE), y = RMSE)) + coord_flip() + xlab("") + theme(axis.text = element_text(size=20, color="black", face = "bold"), plot.title = element_text(size=16)) + ggtitle("CV"),
  ggplot(all_test_results) + geom_col(aes(x = reorder(model, -MAPE), y = MAPE)) + coord_flip() + xlab("") + theme(axis.text = element_text(size=20, color="black", face = "bold"), plot.title = element_text(size=16)) + ggtitle("Test"),
  ggplot(all_test_results) + geom_col(aes(x = reorder(model, -RMSE), y = RMSE)) + coord_flip() + xlab("") + theme(axis.text = element_text(size=20, color="black", face = "bold"), plot.title = element_text(size=16)) + ggtitle("Test"),
  ncol = 2
)

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list?

In the RF model, ManufacturingProcess32 is the most important variable. It is approximately twice as important as the next variable. Overall, the manufacturing process variables compose 6 of the top 10 most important features. However, because we did not remove highly correlated variables from the RF model, this ranking will likely not hold when we examine the elastic-net model.

rf_var_imp <- varImp(rf_model)
plot(rf_var_imp)

How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

As suspected, the top 10 important predictors and the importance distribution in general for the elastic-net (EN) model are vastly different. The ManufacturingProcess is by several orders of magnitude the most important predictor, and the top predictors are all manufacturing-process variables. In the EN model, approximately only 5 features are substantial sources of information, while the rest are near zero. This is in stark contrast to the much more evenly distributed importance in the RF model. Because we removed the correlation, the EN model is a much more accurate representation of the predictor importance.

gridExtra::grid.arrange(
  plot(rf_var_imp, main = "Random Forest"),
  plot(varImp(lm_model), main = "ElasticNet"),
  ncol = 2
)

Plot the optimal single tree with the distribution of Yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with Yield?

No, the plot of hypertuned CART model does not provide any additional information about the predictors. The tree model had only one split on the same ManufacturingProcess32 variable that was important to the RF model. If the ManufacturingProcess32 value was great than 160, then the Yield was predicted to be 39 as opposed to 42.

fancyRpartPlot(cart_model$finalModel)

Applied Predictive Modeling

DATA 624 Group 2 Homework

John Grando, Kyle Gilde, Jonathan Hernandez, Albert Gilharry, Neil Hwang

January 27, 2019

Overview

Chapter 8 Regression Trees and Rule-Based Models

Question 8.1

Question 8.3

Question 8.7