library(mlbench)
library(mice)
library(randomForest)
library(caret)
library(ggplot2)
library(party)
library(gbm)
library(Cubist)
library(dplyr)
library(rpart)
library("partykit")
8.1. Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
We can see that V1 is the most significant variable followed by V4, but V6 - V10 are the least significant. Therefore, the random forest model did not significantly use the uninformative predictors.
(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
The previous important score V1 was 8.732. In the second model, V1’s variable importance score is 5.691 which is reduced by 3 points. The duplicate became a relatively important variable.
If we add another predictor, we would anticipate more reduction in V1’s importance.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
#cor(simulated$duplicate2, simulated$V1)
model3 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
## Overall
## V1 4.91687329
## V2 6.52816504
## V3 0.58711552
## V4 7.04870917
## V5 2.03115561
## V6 0.14213148
## V7 0.10991985
## V8 -0.08405687
## V9 -0.01075028
## V10 0.09230576
## duplicate1 3.80068234
## duplicate2 1.87721959
As expected, V1’s variable importance score is reduced.
cf_model <- cforest(y ~., data = simulated)
varimp(cf_model)
## V1 V2 V3 V4 V5 V6
## 5.42152778 5.79075433 -0.01425384 5.96294311 1.87652947 -0.06822257
## V7 V8 V9 V10 duplicate1 duplicate2
## 0.07696942 -0.17067028 0.02155362 -0.13376134 5.27935126 2.92224516
With the exception of the importance score for V3, the patterns are similar with V4 is the most important factor and V6 - V10 being insignificant.
Generalized Boosted Regression Modeling (GBM)
gbm_model <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(gbm_model)
## var rel.inf
## V4 V4 30.8498158
## V2 V2 20.9639117
## V1 V1 16.1293953
## duplicate1 duplicate1 11.5441703
## V5 V5 11.0492470
## V3 V3 8.5487055
## duplicate2 duplicate2 0.7137245
## V6 V6 0.2010299
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
V4 is ranked as a top predictor, followed by V2.
The uninformative predictors V6- V10 are still ranked at the least significant.
The GBM model seems to indicate that V1 is much more important than duplicated1. This is because the trees from boosting are dependent on each other and will have correlated structures.
Cubist
#subset x columns
simulated_x <- subset(simulated, select = -c(y))
#run model and check variable importance
cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)
varImp(cubist_model)
## Overall
## V1 50.5
## V3 50.5
## V2 63.0
## duplicate2 20.5
## V4 46.0
## duplicate1 28.5
## V8 3.5
## V5 28.5
## V6 8.5
## V7 0.0
## V9 0.0
## V10 0.0
The rankings are different above.
8.2. Use a simulation to show tree bias with different granularities.
set.seed(555)
#specify features with different granularities
x1 <- sample(0:100000, 1000, replace = T)
x2 <- sample(0:1000, 1000, replace = T)
x3 <- sample(0:10, 1000, replace = T)
#Create y as a combination of our features with the addition of a random term
y <- x1 + x2 + x3 + rnorm(200)
#Merge features into df for RF
df <- data.frame(x1, x2, x3, y)
rf_sim_model <- randomForest(y ~., data = df, importance = TRUE, ntree = 1000)
varImp(rf_sim_model, scale = FALSE)
## Overall
## x1 1473915120.6
## x2 2102414.2
## x3 -584484.8
I generated random samples for 3 variables (x1, x2, x3). The target variable y is created by adding x1, x2, and x3 and see the output above that as we go from low to high definition (x3 to x1) the variable importance factor score increases.
It seems that the case is strong for greater granularity.
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
knitr::include_graphics("Figure8.24.PNG")
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Gradient boosting trees make predictions by adding the predictions of each subsequent tree. These trees are dependent on each other - each tree is built upon the errors of previous tree. Therefore they have correlated structures. Suppose one of the tree split on one feature frequently. Its subsequent tree, which is built using its errors, will likely split on that feature frequently as well. As a result, many of the same predictors will be selected across the trees, increasing their contribution to the importance metric.
Both boosting models a lower bagging fraction means a smaller subset of the data is chosen at random to comprise the training data whereas a lower shrinkage parameter means a slower learning rate, more iterations and each iteration carrying a smaller impact. From this we can extend that the model on the right is “greedy” (a rapid learner) and likely prone to over-fitting. It’s one that takes larger steps down the gradient descent, giving lower weight to more predictors, for sake of “learning faster”. This is why it gives importance to just a few variables. Conversely, the model on the left considers a smaller subset of data and iterates through more slowly. Iterating through in this way, and in series, reduces error and considers a larger subset of variables important for the very sake of having a more robust, less error prone model.
(b) Which model do you think would be more predictive of other samples?
I think that the model on the left, the one with a slower error rate and a smaller subset of data to train off of, would provide more accurate predictions. That’s because it could learn slower and reduce error across more iterations, while giving less weight to a greater breadth of predictors. Thus, reducing error and improving accuracy.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Interaction the interaction depth is used to specify the maximum nodes per tree, the number of splits to be performed on a tree. Increasing interaction depth then, would increase the level of importance given to each variable since a greater level of depth / number of nodes would create a more expansive consideration. Therefore, increasing the depth reduce the slope of the importance plot.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
#Load the data
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.1.3
data("ChemicalManufacturingProcess")
#convert data to matrix
Data_matrix <- as.matrix(ChemicalManufacturingProcess)
#impute missing values
set.seed(333)
imputed_Data_matrix <- mice(Data_matrix, printFlag=F, method="pmm")
## Warning: Number of logged events: 675
data_comp <- complete(imputed_Data_matrix)
#remove highly correlated features
highly_correlated = findCorrelation(cor(data_comp), 0.80)
data_comp1 = data_comp[, -highly_correlated]
#center and scale
data_comp2 <- scale(data_comp1, center = TRUE, scale = TRUE)
#convert to dataframe - COMMENTED OUT
cmp_df <- as.data.frame(data_comp2) #convert to df
#train test split - 75/25 split
sample_size = round(nrow(cmp_df)*.75)
index <- sample(seq_len(nrow(cmp_df)), size = sample_size)
train <- cmp_df[index, ]
test <- cmp_df[-index, ]
#remove yield from sets to feed models
train.x <- train[-c(1)]
test.x <- test[-c(1)]
set.seed(333)
#fit the model
rf_model <- randomForest(train.x, train$Yield, importance = TRUE, ntrees = 1000)
rf_model
##
## Call:
## randomForest(x = train.x, y = train$Yield, importance = TRUE, ntrees = 1000)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 0.3845978
## % Var explained: 58.34
This might be a good model it just but we still need to check the accuracy statistics on test data.
rfPred <- predict(rf_model, newdata = test.x)
postResample(pred = rfPred, obs = test$Yield)
## RMSE Rsquared MAE
## 0.6426172 0.7131018 0.4846558
The RF model gives an RMSE of 0.64 and MAE of 0.484. We are looking for an RMSE of 0.2 - 0.5, an Rsquare >= 0.75, and an MAE as close to 0 as possible.
gbm_model <- gbm.fit(train.x, train$Yield, distribution = "gaussian")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.9223 nan 0.0010 0.0007
## 2 0.9218 nan 0.0010 0.0005
## 3 0.9212 nan 0.0010 0.0004
## 4 0.9203 nan 0.0010 0.0007
## 5 0.9195 nan 0.0010 0.0006
## 6 0.9188 nan 0.0010 0.0006
## 7 0.9183 nan 0.0010 0.0005
## 8 0.9176 nan 0.0010 0.0007
## 9 0.9170 nan 0.0010 0.0004
## 10 0.9163 nan 0.0010 0.0004
## 20 0.9106 nan 0.0010 0.0005
## 40 0.8976 nan 0.0010 0.0006
## 60 0.8854 nan 0.0010 0.0005
## 80 0.8740 nan 0.0010 0.0006
## 100 0.8618 nan 0.0010 0.0006
#gbm_model
gbmPred <- predict(gbm_model, newdata = test.x)
## Using 100 trees...
postResample(pred = gbmPred, obs = test$Yield)
## RMSE Rsquared MAE
## 1.0666441 0.5085017 0.8376582
The boosting model is giving a higher error and lower R-squared than the Random Forest model.
cube_model <- cubist(train.x, train$Yield)
cube_model
##
## Call:
## cubist.default(x = train.x, y = train$Yield)
##
## Number of samples: 132
## Number of predictors: 40
##
## Number of committees: 1
## Number of rules: 3
cubePred <- predict(cube_model, newdata = test.x)
postResample(pred = cubePred, obs = test$Yield)
## RMSE Rsquared MAE
## 0.7480486 0.5561850 0.6085062
The Cubist model performs better than the Boosting model but it is not better than Random Forest model.
(a) Which tree-based regression model gives the optimal resampling and test set performance?
Random Forest model with an RMSE of 0.64, Rsquared of 0.71 and MAE of 0.48 had the best performance. Random Forest model was the strongest model for optimal resampling and test set performance.
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
The predictor importance can be found using the varImp function:
rf_imp <- varImp(rf_model)
rf_imp
## Overall
## BiologicalMaterial03 13.32003757
## BiologicalMaterial05 4.99087118
## BiologicalMaterial07 0.00000000
## BiologicalMaterial09 7.18392726
## BiologicalMaterial10 0.83173075
## BiologicalMaterial11 10.03737068
## ManufacturingProcess01 4.29511831
## ManufacturingProcess02 5.42691254
## ManufacturingProcess03 0.71751741
## ManufacturingProcess04 2.74021695
## ManufacturingProcess05 0.07638607
## ManufacturingProcess06 6.03235126
## ManufacturingProcess07 0.98781428
## ManufacturingProcess08 -0.53741719
## ManufacturingProcess09 11.65000301
## ManufacturingProcess10 2.79321544
## ManufacturingProcess11 7.40268624
## ManufacturingProcess12 1.85381976
## ManufacturingProcess13 10.37593848
## ManufacturingProcess16 1.73644414
## ManufacturingProcess17 9.08009916
## ManufacturingProcess19 1.43832058
## ManufacturingProcess20 5.17072854
## ManufacturingProcess21 1.05840688
## ManufacturingProcess22 3.42967163
## ManufacturingProcess23 2.76171735
## ManufacturingProcess24 3.89228496
## ManufacturingProcess26 4.35926201
## ManufacturingProcess28 7.29283763
## ManufacturingProcess30 4.55310702
## ManufacturingProcess32 22.95513243
## ManufacturingProcess34 2.38355098
## ManufacturingProcess35 -0.86557592
## ManufacturingProcess36 6.98750655
## ManufacturingProcess37 2.46940060
## ManufacturingProcess38 0.74832282
## ManufacturingProcess39 5.07388574
## ManufacturingProcess41 -0.01032035
## ManufacturingProcess43 5.16455119
## ManufacturingProcess44 2.99567777
ManufacturingProcess 32 is the most important variable.
varImpPlot(rf_model)
From the above variable importance plot, we can include that of the top 10 most important variables, 7 are manufacturing.As was the case for our linear and nonlinear regression models, the process variables dominate the list.
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
#Train single tree model
rpart_tree <- rpart(Yield ~., data = train)
#Produce tree plot
rpart_tree2 <- as.party(rpart_tree)
plot(rpart_tree2)
The tree above provides more information about the biological or process predictors and their relationship with yield. We can see which variables rank highest up in the tree (higher up –> greater importance) as well as the values at which sub-branches from variables are created. We can also determine at what level specifically we should tune a process variable to to increase yield. Overall, I believe the past 3 assignments have re-iterated that focusing on the higher order manufacturing processes (32, 9, 11, 17 from the above chart) would lead to the greatest net positive Yield.