library(dplyr)
library(forecast)
library(ggplot2)
library(VIM) #missing data visualization
library(tidyr)
library(mice)
library(corrplot)
library(MASS)
library(mlbench)
library(caret)
library(randomForest)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(partykit)

Background

The purpose of this assignment was to explore Regression Trees and Rule-Based Model exercises from Applied Predictive Modeling.


8.1

Recreate the simulated data from Exercise 7.2

set.seed(333)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a.

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 - V10)?

rfImp1
##          Overall
## V1   7.953508251
## V2   7.686331812
## V3   1.367135564
## V4  13.814340695
## V5   2.930750358
## V6   0.119768734
## V7   0.002923786
## V8  -0.072054632
## V9  -0.056535920
## V10  0.097763537

Based on the variable importance scores from above: no, the random forest model did not signicantly use the uninformative predictors. The higher the score, the more significant the variable was in the model. From above we see that V4 is our most important variable while V6 - V10 are our least significant.

b.

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9406464

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          5.31221962
## V2          7.30129078
## V3          1.26616343
## V4         13.18023036
## V5          2.28287156
## V6          0.15279471
## V7          0.03102843
## V8          0.05251663
## V9          0.04134801
## V10         0.03980573
## duplicate1  3.80228553

V1’s variable importance score reduced by 2 points and the duplicate became a relatively important variable.

If we add another predictor, we’d anticipate more of the same: further reduction in V1’s importance and another relatively important variable added into the fold:

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
#cor(simulated$duplicate2, simulated$V1)

model3 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
##                Overall
## V1          4.83837467
## V2          7.12536811
## V3          0.92893853
## V4         14.71740817
## V5          2.46830532
## V6          0.16500678
## V7         -0.06673372
## V8         -0.05832928
## V9         -0.10428316
## V10         0.03491811
## duplicate1  2.50789007
## duplicate2  3.10902853

We did observe more of the same but it was to less an extent than I’d anticipated. It may be that the first variable in the model carries more weight than later variables that are highly correlated …

c.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobel et al. (2007). Do these importances show the same pattern as the traditional random forest model?

cf_model <- cforest(y ~., data = simulated)
varimp(cf_model)
##           V1           V2           V3           V4           V5           V6 
##  5.392779814  6.035491423  0.714714499 12.720077988  2.641277048  0.222679079 
##           V7           V8           V9          V10   duplicate1   duplicate2 
##  0.004022258 -0.066346780 -0.177580241 -0.164574272  3.927698174  4.401368343

Aside from the importance score for V3, the patterns are very similar with V4 being the most important factor and V6 - V10 being insignificant.

d.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Trees

gbm_model <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(gbm_model)

##                   var    rel.inf
## V4                 V4 39.5938867
## V2                 V2 16.4944305
## V1                 V1 12.0812703
## V5                 V5 10.7043080
## duplicate2 duplicate2  7.7875330
## V3                 V3  7.0188408
## duplicate1 duplicate1  4.5213129
## V7                 V7  1.0065544
## V9                 V9  0.6335745
## V10               V10  0.1582889
## V6                 V6  0.0000000
## V8                 V8  0.0000000

Yes, it appears that the same pattern occurs. For our Boosted Tree (gbm), we observe that V4 remains our most influential variable while V6 - V10 are least influential.

Cubist

#subset x columns
simulated_x <- subset(simulated, select = -c(y))

#run model and check variable importance
cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)
varImp(cubist_model)
##            Overall
## V3            43.5
## V2            58.0
## V1            57.5
## duplicate2    15.5
## V4            44.0
## V5            29.5
## V9             8.5
## duplicate1     5.5
## V6             3.5
## V8             2.5
## V7             0.5
## V10            0.0

Although the rankings / scoring are different, the same pattern emerges.

8.2

Use a simulation to show tree bias with different granularities.

set.seed(333) #ensure replicability

#specify features with different granularities
hd <- sample(0:100000, 1000, replace = T)
md <- sample(0:1000, 1000, replace = T)
ld <- sample(0:10, 1000, replace = T)

#create y as a combination of our features with the addition of a random term
y <- hd + md + ld + rnorm(200) 

df <- data.frame(hd, md, ld, y) #merge features into df for RF

rf_sim_model <- randomForest(y ~., data = df, importance = TRUE, ntree = 1000)
varImp(rf_sim_model, scale = FALSE)
##         Overall
## hd 1535806646.9
## md   -3021267.6
## ld     488092.8

We take y as the summation of our different features (with differing granularities) and see the output above that as we go from low to high definiton (ld –> hd) the variable importance factor score increases. It appears that the case is strong for greater granularity, holding all else equal, leading to greater variable importance.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Applied Predictive Modeling, Figure 8.24

a.

Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors.

For both boosting models a lower bagging fraction means a smaller subset of the data is chosen at random to comprise the training data whereas a lower shrinkage parameter means a slower learning rate, more iterations and each iteration carrying a smaller impact. From this we can extend that the model on the right is “greedy” (a rapid learner) and likely prone to over-fitting. It’s one that takes larger steps down the gradient descent, giving lower weight to more predictors, for sake of “learning faster”. This is why it gives importance to just a few variables. Conversely, the model on the left considers a smaller subset of data and iterates through more slowly. Iterating through in this way, and in series, reduces error and considers a larger subset of variables important for the very sake of having a more robust, less error prone model.

b.

Which model do you think would be more predictive of other samples?

I think that the model on the left, the one with a slower error rate and a smaller subset of data to train off of, would provide more accurate predictions. That’s because it could learn slower and reduce error across more iterations, while giving less weight to a greater breadth of predictors. Thus, reducing error and improving accuracy.

c.

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction depth is used to specify the maximum nodes per tree, the number of splits to be performed on a tree. Increasing interaction depth then, would likely increase the level of importance given to each variable since a greater level of depth / number of nodes would create a more expansive consideration. As such, I’d imagine it would have a dampening effect on the slope and provide more of an “even spread” of importance across variables.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

As in Exercises 6.3 and 7.5, we load in our data, convert it to a matrix, impute missing values, remove highly correlated features, center and scale, convert to dataframe (so that we have column names), and then train test split our data using a 75/25 split (75% training data):

# Load the data
library(AppliedPredictiveModeling)
data(package = "AppliedPredictiveModeling")$results[, "Item"]
data(ChemicalManufacturingProcess)

#convert data to matrix
cmp_matrix <- as.matrix(ChemicalManufacturingProcess)

#impute missing values
set.seed(333)
imputed_cmp_matrix <- mice(cmp_matrix, printFlag=F, method="pmm")
cmp_comp <- complete(imputed_cmp_matrix)

#remove highly correlated features
highly_correlated = findCorrelation(cor(cmp_comp), 0.80)
cmp_comp1 = cmp_comp[, -highly_correlated]

#center and scale
cmp_comp2 <- scale(cmp_comp1, center = TRUE, scale = TRUE)

#convert to dataframe - COMMENTED OUT
cmp_df <- as.data.frame(cmp_comp2) #convert to df

#select features from model identified during stepAIC optimization - COMMENTED OUT
# cmp_df_simplified <- cmp_df[, c('Yield','ManufacturingProcess10','ManufacturingProcess06','ManufacturingProcess28', 'BiologicalMaterial05', 'ManufacturingProcess43', 'ManufacturingProcess20', 'ManufacturingProcess07', 'ManufacturingProcess02', 'ManufacturingProcess04', 'ManufacturingProcess13', 'ManufacturingProcess39', 'BiologicalMaterial10', 'BiologicalMaterial03', 'ManufacturingProcess34', 'ManufacturingProcess37', 
# 'ManufacturingProcess09', 'ManufacturingProcess32')]
# cmp_df_simplified #verify 

#train test split - 75/25 split
sample_size = round(nrow(cmp_df)*.75)
index <- sample(seq_len(nrow(cmp_df)), size = sample_size)
 
train <- cmp_df[index, ]
test <- cmp_df[-index, ]

Random Forest

We remove yield from each set so that we have training / x variable data and then set out to train our random forest model with our number of trees specified as 1000:

set.seed(333)

#remove yield from sets to feed models
train.x <- train[-c(1)]
test.x <- test[-c(1)]

#fit the model
rf_model <- randomForest(train.x, train$Yield, importance = TRUE, ntrees = 1000)
rf_model
## 
## Call:
##  randomForest(x = train.x, y = train$Yield, importance = TRUE,      ntrees = 1000) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 0.3845978
##                     % Var explained: 58.34

We see the model’s output statistics above which indicate that we may have a good model it just doesn’t explain a large proportion of the variables. We consider the accuracy statistics on test data:

rfPred <- predict(rf_model, newdata = test.x)
postResample(pred = rfPred, obs = test$Yield)
##      RMSE  Rsquared       MAE 
## 0.6426172 0.7131018 0.4846558

Ideally, we’d want an RMSE of 0.2 - 0.5, an Rsquare >= 0.75, and an MAE as close to 0 as possible.

What we see is, generally, our model performs well. Just the RMSE value is a bit high and indicative of greater variance and thus a little to be gained with regard to fit / predictive accuracy.

With this in mind, we consider a boosted model …

Boosted Trees

We use the generalized boosting regression models package gbm and specify a “gaussian” distribution because the Yield variable is a continuous response variable:

gbm_model <- gbm.fit(train.x, train$Yield, distribution = "gaussian")
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.9223             nan     0.0010    0.0007
##      2        0.9218             nan     0.0010    0.0005
##      3        0.9212             nan     0.0010    0.0004
##      4        0.9203             nan     0.0010    0.0007
##      5        0.9195             nan     0.0010    0.0006
##      6        0.9188             nan     0.0010    0.0006
##      7        0.9183             nan     0.0010    0.0005
##      8        0.9176             nan     0.0010    0.0007
##      9        0.9170             nan     0.0010    0.0004
##     10        0.9163             nan     0.0010    0.0004
##     20        0.9106             nan     0.0010    0.0005
##     40        0.8976             nan     0.0010    0.0006
##     60        0.8854             nan     0.0010    0.0005
##     80        0.8740             nan     0.0010    0.0006
##    100        0.8618             nan     0.0010    0.0006
gbm_model
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 40 predictors of which 8 had non-zero influence.

We verify some the model’s fitting / output statistics and see that a small step size was used (by default) for 100 iterations and as such we’d like to believe that error was reduced through each of the iterations of the boosting model to create a stronger resulting model. We consider the accuracy statistics on test data:

gbmPred <- predict(gbm_model, newdata = test.x)
postResample(pred = gbmPred, obs = test$Yield)
##      RMSE  Rsquared       MAE 
## 1.0666441 0.5085017 0.8376582

Our boosting model had higher error and lower R-squared than the Random Forest model and so we rule out Gradient Boosting for this application.

Cubist

cube_model <- cubist(train.x, train$Yield)
cube_model
## 
## Call:
## cubist.default(x = train.x, y = train$Yield)
## 
## Number of samples: 132 
## Number of predictors: 40 
## 
## Number of committees: 1 
## Number of rules: 3

For background on cubist models:

“Cubist is a rule-based model that is an extension of Quinlan’s M5 model tree. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is “smoothed” by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification." - Cubist Regression Models

We note the model’s statistics and see that 1 committee with 3 rules was the result of this out-of-the-box model. We consider the accuracy statistics on test data:

cubePred <- predict(cube_model, newdata = test.x)
postResample(pred = cubePred, obs = test$Yield)
##      RMSE  Rsquared       MAE 
## 0.7480486 0.5561850 0.6085062

The Cubist model appears to perform better than our out-of-the-box Boosting model but still not better than our Random Forest model.

a.

Which tree-based regression model gives the optimal resampling and test set performance?

When we consider optimal resamping and test set performance, our Random Forest model was the strongest. Tuning the models may have led to a different result but I wanted to compare at a simple, base form and in this comparison the Random Forest model with an RMSE of 0.64, Rsquared of 0.71 and MAE of 0.48 had the best performance.

Out of the box, this is quite a good model and with some tuning (ie. mtry) more-likely-than-not it could have been further improved.

b.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

I had explored variable importance first as a numeric output but it was unranked and more difficult to make sense of so I instead adapted and elected to utilize the varImpPlot() function:

rf_imp <- varImp(rf_model, scale = FALSE)
#rf_imp #too unclear

varImpPlot(rf_model)

What we can extend from the above variable importance plot(s) are that:

  • ManufacturingProcess32 is our most important variable, and
  • of the top 10 most important variables, 7 are manufacturing.

As was the case for our linear and nonlinear regression models, the process variables dominate the list.

When we consider the top 10 variables for our nonlinear regression model (HW8) v. our linear regression model (HW7), we observe that ManufacturingProcess32, ManufacturingProcess13, ManufacturingProcess09, and ManufacturingProcess06 were in both models whereas ManufacturingProcess17, ManufacturingProcess36, BiologicalMaterial03, BiologicalMaterial11, ManufacturingProcess11, andManufacturingProcess30 are unique to the nonlinear SVM model.

When we next consider our regression tree model, we see that ManufacturingProcess06 ranks lower than our linear and nonlinear regression models and that ManufacturingProcess11 and ManufacturingProcess30 rank lower than our nonlinear regression model. Otherwise, while the earlier variable rankings are more-or-less consistent (aside from the above points), the biggest difference is that although the variable importance is quite low as we reach the bottom of the rankings, our Random Forest model considers all 40 variables.

This may be one reason that this form of bagging was most robust on test data …

c.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

To plot the optimal single tree, we utilize the rpart() function in combination with as.party():

#train single tree model
rpart_tree <- rpart(Yield ~., data = train)

#produce tree plot
rpart_tree2 <- as.party(rpart_tree)
plot(rpart_tree2)

The resulting tree diagram does provide additional knowledge about the biological or process predictors and their relationship with yield.

From above we can see which variables rank highest up in the tree (higher up –> greater importance) as well as the values at which sub-branches from variables are created. This information is useful for understanding which process variables are most influential on improving yield. We can also determine at what level specifically we should tune a process variable to to increase yield. The node charts at the bottom were a bit unclear to me but I believe tracing through is meant to provide indication over what would lead to greater v. lesser Yield (ie. 2 is better v. -2 is worse) but some marker on the node charts at the bottom would be useful in this regard.

Overall, I believe the past 3 assignments have re-iterated that focusing on the higher order manufacturing processes (ie. 32, 9, 11, 17 from the above chart) would lead to the greatest net positive Yield.