Data 624 - Predictive Analytics
Chapter 8
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.83890885
## V2 6.49023056
## V3 0.67583163
## V4 7.58822553
## V5 2.27426009
## V6 0.17436781
## V7 0.15136583
## V8 -0.03078937
## V9 -0.02989832
## V10 -0.08529218
No it did not. The predictors V6 ~ V10 have a very small importance, compared to other predictors such as V1, V2, V4, and V5.
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cforestModel <- cforest(y ~ ., data=simulated)
varimp(cforestModel) %>% sort(decreasing = T)
## V4 V1 V2 duplicate1 V5 V3
## 7.485796796 6.571632689 6.110131511 2.722380892 1.889117623 0.010450332
## V9 V7 V10 V6 V8
## 0.008976088 0.007081381 0.005356254 -0.006751589 -0.036022685
varimp(cforestModel, conditional=T) %>% sort(decreasing = T)
## V4 V2 V1 V5 duplicate1 V6
## 6.109465803 4.807391436 3.077532075 1.253301514 0.910467826 0.012782225
## V9 V3 V10 V7 V8
## 0.009321716 0.009202518 0.003628121 0.002904234 0.002410231
The new importance calculated (with conditional set to TRUE) is slightly different than the traditional measurement. The uninformative predictors (V6 ~ V10) are still rated low importances. The importance of other predictors are reduced, with V1 and its highly correlated duplicated1 being reduced the most. Both measurement rated V4 as the top most important predictor, followed by V2.
For the gradient boosted trees model, I used the gbm function. The model is built using its default input parameters. The summary function can be used to get the relative influence of the predictors on the model.
library(gbm)
## Loaded gbm 2.1.8
gbmModel <- gbm(y ~ ., data=simulated, distribution='gaussian')
summary(gbmModel)
## var rel.inf
## V4 V4 30.2787835
## V2 V2 22.4240399
## V1 V1 19.5630117
## V5 V5 11.1666898
## V3 V3 8.3490611
## duplicate1 duplicate1 7.7807889
## V7 V7 0.2851571
## V6 V6 0.1524680
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
The same pattern occurs. The uninformative predictors V6 ~ V10 are still rated very low. V4 is still the top rated predictor, followed by V2. The correlated predictors V1 and duplicate1, though, have higher differences in their rated importance. The GBM model seems to detect that V1 is much more important than duplicated1. This is because the trees from boosting are dependent on each other and will have correlated structures. Many same predictors will be selected across the trees. As a result, the magnitude of their contribution to the importance metric looks more “stretched” compared to other tree models.
For the Cubist model, I used the cubist function. I set the committees to 100, to match that of the default number of trees in the gbm model. The varImp function calculates the variable importance of a Cubist model.
library(Cubist)
cubistModel <- cubist(x=simulated[,-(ncol(simulated)-1)], y=simulated$y, committees=100)
varImp(cubistModel)
## Overall
## V1 64.5
## V3 41.0
## V2 60.0
## V4 48.0
## V5 31.0
## V6 9.0
## duplicate1 6.0
## V8 2.0
## V10 0.5
## V7 0.0
## V9 0.0
Again, the uninformative predictors V6 ~ V10 are rated low in their importance. The top rated predictor, is not V4 this time, but V2. This is different than the random forest and GBM models. For V1 and its correlated predictor duplicated1, the Cubist model values V1 are higher than duplicated1 just like the GBM model.
The tree models suffer from selection bias that the predictors having a higher number of distinct values are favored over more granular predictors.
Below, I simulated 4 variables using the sample function with the replacement. All 4 variables are values ranging from 0 to 1, with the following granularity:
The target variable y is created by adding x1 and x4, plus a random error term. x2 and x3 are not used to generate the target.
set.seed(1)
x1 <- sample(0:10000 / 10000, 200, replace = T)
x2 <- sample(0:1000 / 1000, 200, replace = T)
x3 <- sample(0:100 / 100, 200, replace = T)
x4 <- sample(0:10 / 10, 200, replace = T)
y <- x1 + x4 + rnorm(200)
df <- data.frame(x1, x2, x3, x4, y)
str(df)
## 'data.frame': 200 obs. of 5 variables:
## $ x1: num 0.102 0.8 0.477 0.972 0.846 ...
## $ x2: num 0.126 0.809 0.571 0.999 0.96 0.283 0.333 0.03 0.267 0.092 ...
## $ x3: num 0.44 0.89 0.39 0.65 0.4 0.07 0.24 0.94 0.19 0.83 ...
## $ x4: num 0.8 0 0 0.9 0.7 0.2 0 0.3 0.4 0.6 ...
## $ y : num 0.92 2.119 0.412 1.172 2.083 ...
Below, a regression tree is fitted using rpart, and the variable importance is calculated using varImp.
library(rpart)
rpartTree <- rpart(y ~ ., data=df)
varImp(rpartTree)
## Overall
## x1 0.7443663
## x2 0.7563594
## x3 0.5903005
## x4 0.4806487
As you can see, the tree uses x1 mostly to split; while it uses x4 the least. Interestingly, the tree model also uses x2 and x3 to split, even though they are not used to generate the target. x2 and x3 all have more distinct values than x4. This demonstrates that there is a selection bias in the tree model, that it favors predictors with more distinct values.
Gradient boosting trees make predictions by adding the predictions of each subsequent tree. These trees are dependent on each other - each tree is built upon the errors of previous tree. Therefore they have correlated structures. Suppose one of the tree split on one feature frequently. Its subsequent tree, which is built using its errors, will likely split on that feature frequently as well. As a result, many of the same predictors will be selected across the trees, increasing their contribution to the importance metric.
Learning rate controls the fraction of the predictions of each tree being added. A higher learning rate means that larger fraction of each tree’s predictions are added to the final prediction. This effectively means that a higher learning rate increases the dependent / correlation structure. More of the same predictors will be selected among the trees. This is why the right-hand plot has its importance focus on just the first few of the predictors, and look very steep.
Bagging fraction is the fraction of data being used in each iteration of the trees. When you have a small bagging fraction, say 0.1, on each iteration just 10% of the full data is randomly sampled. So each tree may be built using very different dataset. Since the dataset are very different, the trees will be splitting very differently from each other. On the contrast, when you have large bagging fraction, say 0.9, essentially on each iteration the trees are seeing the same dataset - they will likely split similarly. This means that larger bagging fraction increases the dependent / correlated structure in the boosting trees. Therefore, the right-hand plot with a larger bagging fraction has its importance focus on just the first few of the predictors.
Learning rate and bagging fraction are important parameters to control the overfitting of the gradient boosting model that requires tuning. A smaller learning rate and bagging fraction leads to better generalization ability over unseen samples. If I have to guess, the model with 0.1 learning rate and bagging fraction will be more predictive of out of bag samples. However, since this invovles a trade off between bias-variance, I can only confirm using cross validation or a test set.
Below, I have trained two GBM models with all the same parameters except for the interaction depth. Model 1 has interaction depth of 1, while Model 2 has interaction depth of 10. The predictor importance of both models are plotted:
library(AppliedPredictiveModeling)
data(solubility)
grid1 <- expand.grid(n.trees=100, interaction.depth=1, shrinkage=0.1, n.minobsinnode=10)
gbm1 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = grid1, verbose = FALSE)
grid2 <- expand.grid(n.trees=100, interaction.depth=10, shrinkage=0.1, n.minobsinnode=10)
gbm2 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = grid2, verbose = FALSE)
plot(varImp(gbm1))
plot(varImp(gbm2))
As you can see, increasing the interaction depth seems to spread out the importance more, since each tree now can grow deeper, and has more chance for other features to be involved in the splitting process. Therefore, increasing the depth reduce the slope of the importance plot.
The missing values in the ChemicalManufacturingProcess data are imputed using the bagImpute method. The train test set are splitted, with 20% of the data assigned to the test set.
data(ChemicalManufacturingProcess)
(cmpImpute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute')))
## Created from 152 samples and 57 variables
##
## Pre-processing:
## - bagged tree imputation (57)
## - ignored (0)
cmp <- predict(cmpImpute, ChemicalManufacturingProcess[,-c(1)])
# Train/test splitting data, 20% testing
set.seed(1)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X.train <- cmp[trainRow, ]
y.train <- ChemicalManufacturingProcess$Yield[trainRow]
X.test <- cmp[-trainRow, ]
y.test <- ChemicalManufacturingProcess$Yield[-trainRow]
Below, 4 regression tree models are trained: single tree, random forest, gradient boosting, and cubist. The bootstrapped resampling method is used with 25 repetition to determine the final models. RMSE is used as the metric. There is no need for centering and scaling for the tree models.
Single tree, tuned over the complexity parameter:
set.seed(1)
rpartModel <- train(x = X.train,
y = y.train,
method = "rpart",
tuneLength = 10,
control = rpart.control(maxdepth=2))
rpartModel
## CART
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.008397041 1.549770 0.3812237 1.216080
## 0.008859356 1.549770 0.3812237 1.216080
## 0.018524970 1.549770 0.3812237 1.216080
## 0.020481868 1.549770 0.3812237 1.216080
## 0.035948795 1.548684 0.3821330 1.214237
## 0.037010750 1.548684 0.3821330 1.214237
## 0.040274452 1.545683 0.3841690 1.212441
## 0.070117057 1.559539 0.3700052 1.218758
## 0.075640038 1.567100 0.3650647 1.222131
## 0.422604552 1.710135 0.3513290 1.347456
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.04027445.
Random forest, tuned over number of features to use in each tree:
set.seed(1)
rfModel <- train(x = X.train,
y = y.train,
method = 'rf',
tuneLength = 10)
rfModel
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.378648 0.5802953 1.0843105
## 8 1.272058 0.6180565 0.9910083
## 14 1.256792 0.6154582 0.9757093
## 20 1.253456 0.6096284 0.9695865
## 26 1.258766 0.5986575 0.9709039
## 32 1.257386 0.5953348 0.9698568
## 38 1.262774 0.5866870 0.9693972
## 44 1.275029 0.5766315 0.9773552
## 50 1.277282 0.5725979 0.9777703
## 57 1.288198 0.5637389 0.9856393
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.
Gradient boosting, tuned over the number of trees, interaction depth, shrinkage rate, and minimum terminal node size:
set.seed(1)
grid <- expand.grid(n.trees=c(50, 100, 150, 200),
interaction.depth=c(1, 5, 10, 15),
shrinkage=c(0.01, 0.1, 0.5),
n.minobsinnode=c(5, 10, 15))
gbmModel <- train(x = X.train,
y = y.train,
method = 'gbm',
tuneGrid = grid,
verbose = FALSE)
plot(gbmModel)
gbmModel$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 76 200 10 0.1 5
gbmModel$finalModel
## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 57 predictors of which 55 had non-zero influence.
Cubist, tuned over the number of committees and neighbors:
set.seed(1)
cubistModel <- train(x = X.train,
y = y.train,
method = 'cubist')
cubistModel
## Cubist
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.974730 0.3276865 1.3415098
## 1 5 1.953650 0.3397400 1.3197269
## 1 9 1.961401 0.3350359 1.3269185
## 10 0 1.319770 0.5487338 1.0117335
## 10 5 1.294999 0.5647492 0.9812996
## 10 9 1.307784 0.5568598 0.9955234
## 20 0 1.248259 0.5934930 0.9678102
## 20 5 1.222524 0.6085878 0.9385537
## 20 9 1.234662 0.6013848 0.9509907
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
The resampling performance of all the models are calculated below:
resamp <- resamples(list(SingleTree=rpartModel, RandomForest=rfModel, GradientBoosting=gbmModel, Cubist=cubistModel))
summary(resamp)
##
## Call:
## summary.resamples(object = resamp)
##
## Models: SingleTree, RandomForest, GradientBoosting, Cubist
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## SingleTree 1.0158765 1.1529965 1.2181336 1.2124410 1.2834938 1.511266
## RandomForest 0.7789125 0.9015179 0.9654665 0.9695865 1.0133921 1.193207
## GradientBoosting 0.7473860 0.8650054 0.9133498 0.9263050 0.9786561 1.158549
## Cubist 0.7494319 0.8574981 0.9325365 0.9385537 0.9713097 1.133009
## NA's
## SingleTree 0
## RandomForest 0
## GradientBoosting 0
## Cubist 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SingleTree 1.259445 1.490117 1.568376 1.545683 1.626500 1.815942 0
## RandomForest 1.021405 1.170212 1.254259 1.253456 1.332717 1.502504 0
## GradientBoosting 1.010651 1.098105 1.209567 1.205794 1.291562 1.485723 0
## Cubist 1.004272 1.163104 1.211461 1.222524 1.294660 1.446199 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## SingleTree 0.2335658 0.3185924 0.3740543 0.3841690 0.4188642 0.6073257
## RandomForest 0.4717475 0.5586765 0.5947119 0.6096284 0.6734519 0.7394027
## GradientBoosting 0.5069428 0.5560189 0.6209324 0.6157570 0.6835510 0.7247879
## Cubist 0.4539216 0.5552589 0.6115826 0.6085878 0.6516084 0.7465633
## NA's
## SingleTree 0
## RandomForest 0
## GradientBoosting 0
## Cubist 0
Looking at the Mean of the RMSE metric, it appears the gradient boosting model is optimal.
The test set performance is calculated below:
testPerf <- function(models, testData, testTarget) {
method <- c()
res <- data.frame()
for(model in models){
method <- c(method, model$method)
pred <- predict(model, newdata=testData)
res <- rbind(res, t(postResample(pred=pred, obs=testTarget)))
}
row.names(res) <- method
return(res)
}
models <- list(rpartModel, rfModel, gbmModel, cubistModel)
performance <- testPerf(models, X.test, y.test)
performance
## RMSE Rsquared MAE
## rpart 1.4606072 0.2824136 1.1857749
## rf 0.9243633 0.7077904 0.7535632
## gbm 0.9428919 0.6969765 0.7243978
## cubist 0.7248858 0.8187655 0.6265254
The test set performance suggests that the cubist model is the best. Here, I would trust the resampled metrics than the metric reported by the test set, since the size of the data is very small and the test set performance may not be a close approximation of the true model performance. There, I would choose grandient boosting model.
The predictor importance can be found using the varImp function:
varImp(gbmModel)
## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17 15.501
## ManufacturingProcess09 15.159
## ManufacturingProcess13 13.372
## ManufacturingProcess15 12.283
## BiologicalMaterial08 10.304
## BiologicalMaterial12 9.952
## BiologicalMaterial03 9.437
## ManufacturingProcess31 8.831
## ManufacturingProcess24 8.797
## ManufacturingProcess06 8.394
## ManufacturingProcess21 7.675
## BiologicalMaterial05 7.542
## BiologicalMaterial06 6.152
## BiologicalMaterial09 5.763
## ManufacturingProcess39 5.695
## ManufacturingProcess30 5.512
## ManufacturingProcess14 5.297
## BiologicalMaterial10 4.912
## ManufacturingProcess43 4.435
varImp(cubistModel)
## cubist variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17 88.235
## BiologicalMaterial06 58.824
## ManufacturingProcess09 49.412
## ManufacturingProcess13 49.412
## BiologicalMaterial12 48.235
## BiologicalMaterial03 43.529
## ManufacturingProcess39 38.824
## BiologicalMaterial02 32.941
## ManufacturingProcess33 25.882
## ManufacturingProcess04 18.824
## BiologicalMaterial11 14.118
## ManufacturingProcess29 12.941
## ManufacturingProcess14 9.412
## ManufacturingProcess25 7.059
## ManufacturingProcess37 7.059
## ManufacturingProcess28 5.882
## ManufacturingProcess35 5.882
## BiologicalMaterial09 5.882
## ManufacturingProcess27 5.882
varImp(rfModel)
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## BiologicalMaterial12 32.125
## BiologicalMaterial03 25.852
## ManufacturingProcess13 22.917
## BiologicalMaterial06 21.530
## ManufacturingProcess17 15.573
## ManufacturingProcess36 14.563
## ManufacturingProcess09 12.247
## BiologicalMaterial02 10.393
## ManufacturingProcess31 9.802
## ManufacturingProcess06 8.120
## BiologicalMaterial11 8.019
## BiologicalMaterial05 7.511
## BiologicalMaterial09 7.208
## ManufacturingProcess33 7.175
## BiologicalMaterial08 5.870
## BiologicalMaterial04 5.649
## ManufacturingProcess39 5.255
## ManufacturingProcess28 4.964
## ManufacturingProcess25 4.587
plot(varImp(gbmModel),top = 20)
Out of top 20 important variables, 13 are ManufacturingProcess predictors and 7 are BiologicalMaterial predictors. The ManufacturingProcess predictors dominated the list
As you can see, ManufacturingProcess32 is the top most important predictor. In the optimal linear model, almost all of the top 10 predictors are from the ManufacturingProcess predictors; while for the optimal tree and nonlinear models, the top 10 predictors are not as dominated by the ManufacturingProcess predictors.
Below, the optimal single tree model is plotted. It appears the model made one single split using ManufacturingProcess32 at the value of 159.5. If it’s less than 159.5, yield is assigned to be 39.12369. If it’s equal or more than 159.5, the yield is assigned to be 41.59750.
rpartModel$finalModel
## n= 144
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 144 506.83480 40.15444
## 2) ManufacturingProcess32< 159.5 84 145.58300 39.12369
## 4) BiologicalMaterial12< 19.735 36 47.40003 38.34361 *
## 5) BiologicalMaterial12>=19.735 48 59.84592 39.70875 *
## 3) ManufacturingProcess32>=159.5 60 147.06110 41.59750
## 6) ManufacturingProcess17>=33.45 44 76.39399 41.13341 *
## 7) ManufacturingProcess17< 33.45 16 35.12938 42.87375 *
plot(rpartModel$finalModel)
text(rpartModel$finalModel)
This view provide two additional knowledge about the relationships:
trainData <- data.frame(X.train, y.train)
less <- subset(trainData, ManufacturingProcess32 < 159.5)
more <- subset(trainData, ManufacturingProcess32 >= 159.5)
mean(less$y.train)
## [1] 39.12369
mean(more$y.train)
## [1] 41.5975