Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
rfImp1
## Overall
## V1 8.83890885
## V2 6.49023056
## V3 0.67583163
## V4 7.58822553
## V5 2.27426009
## V6 0.17436781
## V7 0.15136583
## V8 -0.03078937
## V9 -0.02989832
## V10 -0.08529218
No V6-V10 predictors do not have very high importance, V1-V5 have much higher importance.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 6.29780744
## V2 6.08038134
## V3 0.58410718
## V4 6.93924427
## V5 2.03104094
## V6 0.07947642
## V7 -0.02566414
## V8 -0.11007435
## V9 -0.08839463
## V10 -0.00715093
## duplicate1 3.56411581
The importance score of V1 has decreased when we introduced highly correlated variable “duplicate1”.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9312569
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
## Overall
## V1 5.656397024
## V2 6.957366954
## V3 0.539700105
## V4 7.280227792
## V5 2.094226861
## V6 0.141163232
## V7 0.092792498
## V8 -0.096325566
## V9 -0.007463533
## V10 0.016839393
## duplicate1 2.566313355
## duplicate2 2.654958084
The importance score of V1 has decreased even more when we introduced another highly correlated variable “duplicate2”.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
simulated <- subset(simulated, select=-c(duplicate1,duplicate2))
model4 <- cforest(y ~., data = simulated)
rfImp4 <- varimp(model4, conditional = TRUE)
rfImp4
## V1 V2 V3 V4 V5 V6
## 6.13813531 5.38693691 -0.01588944 6.37698904 1.39631552 -0.23028740
## V7 V8 V9 V10
## -0.15979153 -0.43336111 -0.20556298 -0.32988314
Yes, the performance is similar to the traditional random forest trees. V6-V10 predictors do not have very high importance, V1, V2, V4, and V5 have much higher importance.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)
## var rel.inf
## V4 V4 29.1936158
## V1 V1 28.0197047
## V2 V2 23.1225041
## V5 V5 10.7920685
## V3 V3 7.9672867
## V6 V6 0.7108707
## V7 V7 0.1939495
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
Yes, the performance is similar to the traditional random forest trees and confitional inference trees. V6-V10 predictors do not have very high importance, V1 - V5 have much higher importance.
simulated2 <- subset(simulated, select=-c(y))
CubistMod <- cubist(simulated2, simulated$y)
summary(CubistMod)
##
## Call:
## cubist.default(x = simulated2, y = simulated$y)
##
##
## Cubist [Release 2.07 GPL Edition] Sat Nov 21 17:42:51 2020
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 200 cases (11 attributes) from undefined.data
##
## Model:
##
## Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
##
## outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
##
##
## Evaluation on training data (200 cases):
##
## Average |error| 2.224012
## Relative |error| 0.55
## Correlation coefficient 0.84
##
##
## Attribute usage:
## Conds Model
##
## 100% V1
## 100% V2
## 100% V4
## 100% V5
##
##
## Time: 0.0 secs
Yes, the performance is similar to the traditional random forest trees, confitional inference trees, and boosted trees. V6-V10 predictors do not have very high importance, V1, V2, V4, and V5 are highest importance and are used in Cubist Model.
Use a simulation to show tree bias with different granularities.
As out textbook indicates: “Basic regression trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors.”
I will simulate a simple dataset with 300 rows and 3 columns, there will be two predictors X1 and X3. X3 will include a random sample of values between 1 and 10, X1 will be genrated by adding variance to the same numbers. The sum of these variables plus some variance will be used to generate Y.
set.seed(200)
X1 <- sample(1:10, 300, replace=TRUE) + rnorm(300, mean=0, sd=1)
X3 <- sample(1:10, 300, replace=TRUE)
Y <- X1+X3 + rnorm(300) * .1
simulated2 <- as.data.frame(cbind(X1, X3, Y))
rpartTree <- rpart(Y ~ ., data = simulated2)
varImp(rpartTree)
## Overall
## X1 3.389582
## X3 5.311540
Variable X3 is showing a significantly higher Varianble Importance due to the fact that it has lower variance even though both variables are directly correlated to Y.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
According to out textbook “the trees from boosting are dependent on each other and hence will have correlated structures as the method follows by the gradient. Therefore many of the same predictors will be selected across the trees, increasing their contribution to the importance metric.” Since the idea behind boosting involves combining and boosting predictors, bagging fraction parameter (randomly selected fraction of the training data) produces a model with a higher number of important parameters. Small values of the learning rate work best and require high computation time and memory resources - but as a result the model is less greedy and is able to catch a higher quanity of importrant parameters.
Which model do you think would be more predictive of other samples?
The model on the left will be more predictive of other samples, since it is less greedy, so likely to select a better model and capture and recognize more of the important parameters.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
I am assuming that increasing the interaction depth will give more parameters a chance to demonstrate theit importance, since each tree will have higher depth. Therefore, increasing the depth will reduce the slope of the importance plot. Many of the same predictors will be selected across the trees, increasing their contribution to the importance metric.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
The first step is to read the data into a dataframe and using knn to impute the data.
data(ChemicalManufacturingProcess)
df1<-ChemicalManufacturingProcess
dim(df1)
## [1] 176 58
#head(df1)
df1pp <- preProcess(df1[,2:ncol(df1)], method=c('knnImpute'))
df1 <- cbind(df1$Yield,predict(df1pp, df1[,2:ncol(df1)]))
colnames(df1)[1] <- "Yield"
The second step is splitting the data into test and train datasets - I will use 80%/20% split.
set.seed(333)
trainingRows <- createDataPartition(df1$Yield, p = 0.8, list = FALSE)
df1_train <- df1[trainingRows, ]
df1_test <- df1[-trainingRows, ]
set.seed(333)
rpartTune <- train(x = df1_train[,2:ncol(df1_train)],
y = df1_train$Yield,
method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"))
Evaluating the model’s performance:
rpartPred <- predict(rpartTune, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = rpartPred, obs = df1_test$Yield)
## RMSE Rsquared MAE
## 1.1952529 0.6376276 0.9207985
set.seed(333)
baggedTree <- ipredbagg(df1_train$Yield, df1_train[,2:ncol(df1_train)])
baggedTreePred <- predict(baggedTree, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = baggedTreePred, obs = df1_test$Yield)
## RMSE Rsquared MAE
## 0.9762990 0.8048250 0.7108411
set.seed(333)
rfModel <- randomForest(df1_train[,2:ncol(df1_train)], df1_train$Yield,
importance = TRUE,
ntrees = 1000)
rfModelPred <- predict(rfModel, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = rfModelPred, obs = df1_test$Yield)
## RMSE Rsquared MAE
## 0.9668035 0.8240946 0.6793100
set.seed(333)
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode = c(5, 10))
gbmTune <- suppressWarnings(train(df1_train[,2:ncol(df1_train)], df1_train$Yield,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE))
gbmTunePred <- predict(gbmTune, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = gbmTunePred, obs = df1_test$Yield)
## RMSE Rsquared MAE
## 0.9089947 0.8251410 0.6251523
cubistMod <- cubist(df1_train[,2:ncol(df1_train)], df1_train$Yield)
cubistModPred <- predict(cubistMod, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = cubistModPred, obs = df1_test$Yield)
## RMSE Rsquared MAE
## 1.1659034 0.6501737 0.8872566
Let’s comparing the models and determine which has the best performance:
z<- rbind(
postResample(pred = rpartPred, obs = df1_test$Yield),
postResample(pred = baggedTreePred, obs = df1_test$Yield),
postResample(pred = rfModelPred, obs = df1_test$Yield),
postResample(pred = gbmTunePred, obs = df1_test$Yield),
postResample(pred = cubistModPred, obs = df1_test$Yield)
)
data.frame(z,row.names = c('single tree', 'bagged tree', 'random forest', 'boosted tree', 'cubist'))
## RMSE Rsquared MAE
## single tree 1.1952529 0.6376276 0.9207985
## bagged tree 0.9762990 0.8048250 0.7108411
## random forest 0.9668035 0.8240946 0.6793100
## boosted tree 0.9089947 0.8251410 0.6251523
## cubist 1.1659034 0.6501737 0.8872566
Boosted Tree model has the best result with an R-Squared of 0.694 - that model took a very long time to tune and was very computationally intense - so it is possible that Cubist model would more useable even though R Squared is lower (0.662).
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
varImp(gbmTune)
## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## BiologicalMaterial12 29.749
## ManufacturingProcess13 22.124
## ManufacturingProcess09 21.421
## ManufacturingProcess06 20.227
## BiologicalMaterial03 16.322
## ManufacturingProcess17 16.252
## BiologicalMaterial09 11.750
## BiologicalMaterial05 9.811
## ManufacturingProcess01 9.287
## ManufacturingProcess31 8.783
## BiologicalMaterial11 7.704
## ManufacturingProcess24 7.207
## BiologicalMaterial06 7.181
## ManufacturingProcess30 6.812
## ManufacturingProcess37 6.783
## ManufacturingProcess43 6.237
## BiologicalMaterial08 6.132
## ManufacturingProcess05 6.040
## ManufacturingProcess04 5.939
The list is dominated by Manufacturing Process variables which matches our linear and non-linear model findings.
MARS Coefficients
set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTune <- train(df1_train[,2:ncol(df1_train)],
y = df1_train$Yield,
method = "earth",
preProc=c("center", "scale"),
tuneGrid= marsGrid,
trControl = trainControl(method = "cv"))
marsTune
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 129, 130, 129, 132, 129, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.469694 0.4099538 1.1524962
## 1 3 1.270394 0.5461842 1.0135638
## 1 4 1.237360 0.5645598 0.9940006
## 1 5 1.243428 0.5465747 0.9978504
## 1 6 1.243240 0.5625398 0.9675818
## 1 7 1.281018 0.5448316 1.0106055
## 1 8 1.261312 0.5602795 1.0181434
## 1 9 1.267466 0.5564907 1.0329659
## 1 10 1.302172 0.5403139 1.0488237
## 1 11 1.296341 0.5445436 1.0586088
## 1 12 1.273445 0.5541629 1.0393424
## 1 13 1.252598 0.5698303 1.0246146
## 1 14 1.229202 0.5807273 1.0091250
## 1 15 1.232702 0.5796333 1.0146955
## 1 16 1.231305 0.5806299 1.0147462
## 1 17 1.235551 0.5768138 1.0097048
## 1 18 1.235551 0.5768138 1.0097048
## 1 19 1.235551 0.5768138 1.0097048
## 1 20 1.235551 0.5768138 1.0097048
## 2 2 1.469694 0.4099538 1.1524962
## 2 3 1.281032 0.5539317 1.0244959
## 2 4 1.245999 0.5686795 1.0027749
## 2 5 1.239588 0.5755069 1.0083431
## 2 6 1.277475 0.5596173 1.0404891
## 2 7 1.290410 0.5601103 1.0537775
## 2 8 1.284861 0.5524504 1.0509238
## 2 9 1.278701 0.5694702 1.0456271
## 2 10 1.294947 0.5617347 1.0661414
## 2 11 1.251266 0.5946481 1.0199754
## 2 12 1.296675 0.5954595 1.0328179
## 2 13 1.313246 0.5910848 1.0133991
## 2 14 1.310340 0.5943687 1.0090514
## 2 15 1.311538 0.5921334 1.0090578
## 2 16 1.320320 0.5905494 1.0068345
## 2 17 1.336474 0.5847754 1.0142035
## 2 18 1.326958 0.5945070 1.0169338
## 2 19 1.314935 0.5768725 1.0182396
## 2 20 1.354841 0.5597531 1.0389166
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 1.
Linear Model Coefficients
df1_enet <- train(df1_train[,2:ncol(df1_train)], df1_train$Yield,
method='enet',
metric='RMSE',
trControl = trainControl(method = "repeatedcv"),
preProcess=c('center','scale'))
enetCoef<- predict(df1_enet$finalModel, newx = as.matrix(df1_test[,2:ncol(df1_test)]),
s = .1, mode = "fraction",
type = "coefficients")
head(sort(enetCoef$coefficients),3)
## ManufacturingProcess13 ManufacturingProcess37 ManufacturingProcess17
## -0.25153249 -0.11488070 -0.06413573
tail(sort(enetCoef$coefficients),7)
## BiologicalMaterial03 ManufacturingProcess44 ManufacturingProcess06
## 0.07826640 0.09057314 0.10343127
## ManufacturingProcess15 ManufacturingProcess34 ManufacturingProcess09
## 0.15819222 0.17985524 0.51029411
## ManufacturingProcess32
## 0.72011230
Process 32, 9, 34, 15, 6, 44, 3(biol), 17, 37, and 13 were the stongest predictors in my ENET linear model. 6 out of those 10 are present in Boosted Tree model’s most important variables.
Process 32, 9, 13, 33, 28, 15, 39, 4, 3(biol), 23 were the stongest predictors in my MARS non-linear model. 4 out of those 10 are present in Boosted Tree model’s most important variables.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
rpartTree2 <- as.party(rpartTune$finalModel)
plot(rpartTree2, gp=gpar(fontsize = 8))
This view confirms that for a single tree, Manufacturing Processes are of higher importance that biological predictors. Very few branchess are relying on Biological materials and the initial split is based on Manufacturing process 32.
Sources: https://stackoverflow.com/questions/13751962/how-to-plot-a-large-ctree-to-avoid-overlapping-nodes