library(caret)
library(mlbench)
library(randomForest)
library(party)
library(dplyr)
library(gbm)
library(Cubist)
Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model_random <- randomForest(y ~ ., data= simulated, importance = TRUE, ntree= 1000)
rf_imp <- varImp(model_random, scale=FALSE)
rf_imp
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6-V10)?
THe predictors V6 - V10 has smaller weights as compared with V1 - V5.
cforest_model <- cforest(y ~ ., data= simulated)
varimp(cforest_model) %>% sort(decreasing = TRUE) # Without conditional
## V4 V2 duplicate1 V1 V5
## 7.6223892727 6.0579730772 5.0941897280 4.6171158805 1.7161194047
## V7 V9 V3 V6 V10
## 0.0465374951 0.0046062409 0.0003116115 -0.0289427183 -0.0310326410
## V8
## -0.0380965511
varimp(cforest_model, conditional = TRUE) %>% sort(decreasing=TRUE) # With conditional
## V4 V2 duplicate1 V1 V5 V6
## 6.190578351 4.688980360 1.926751729 1.807953145 1.051666850 0.028174759
## V9 V3 V8 V7 V10
## 0.015118505 0.012878752 -0.004356587 -0.011709437 -0.022190587
There are slight differences on the calculation between traditional and conditional inference trues. Although V4, V2 , duplicate, V1 and V5 are still highly important variables but their importance dropped with conditional inference.
gbm_model <- gbm(y ~ ., data=simulated, distribution = 'gaussian')
summary(gbm_model)
## var rel.inf
## V4 V4 29.9389549
## V2 V2 21.6012918
## V1 V1 15.6771729
## duplicate1 duplicate1 13.2189845
## V5 V5 10.0440211
## V3 V3 9.0948053
## V6 V6 0.4247697
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
I still see the same pattern with V4, V2, duplicate1 and V1 being the important predictors while V6 - V9 are not the important predictors. I think the only difference between gbm model and previous ones are the importance got even more increased but the patterns is almost same. Now let’s test Cubist and see if the pattern persists here too.
cubist_model <- cubist(simulated[,-11], simulated[, 11])
varImp(cubist_model)
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## duplicate1 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
Here seems like the pattern slightly changed as now V1, V2, V4, V5 and Duplicate1 are the top predictors while the others are not.
Use a simulation to show tree bias with different granularities
library(rpart)
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
set.seed(388)
X1 <- sample(1:10, 100, replace = TRUE)
X2 <- rnorm(100)
X3 <- rnorm(100, mean = 0, sd = 4)
X4 <- sample(1:10/10, 100, replace = TRUE)
Y <- X1 + rnorm(100, mean = 0, sd = .5)
df <- data.frame(X1, X2, X3, X4, Y)
# Rpart
rpart_m <- rpart(Y ~., data=df)
varImp(rpart_m)
## Overall
## X1 3.2553667
## X2 0.2997476
## X3 0.8363086
## X4 0.3748365
THe order of importance matches with last model as X1 - X4 are the important predictors using rpart. X1 is used more than X4 to split and hence we can say that there may be selection bias in the tree model. Overall, X1 is a highest important predictor as its value is significantly higher than X2-X4.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Because the model on the right has bigger learning rate and bagging function and it basically means that more of it is used for the final prediction. It gives a better picture of important predictors. 0.1 value of learning parameter is ideally in terms of fitting.
I cannot tell for sure because there may be chances of overfitting unless it is tested. Sometimes an apparent good model may be overfitting which may not give the best results.
Interaction depth is also same like tree depth. As the interaction depth increases the RMSE error will lower and the steeper the slop of importance of predictors.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Imputing the missing values using bagImpute
cmp_impute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute'))
# Replacing
cmp <- predict(cmp_impute, ChemicalManufacturingProcess[,-c(1)])
# Splitting the data into training and test datasets
set.seed(3559)
train_r <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X_train <- cmp[train_r,]
y_train <- ChemicalManufacturingProcess$Yield[train_r]
X_test <- cmp[-train_r,]
y_test <- ChemicalManufacturingProcess$Yield[-train_r]
Now that the data is preprocessed and splitted into training and test datasets I am going to use some other models to see how they work.
set.seed(2330)
rpart_m <- train(x= X_train, y= y_train, method="rpart", tuneLength=10, control= rpart.control(maxdepth=2))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
rpart_m
## CART
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.008448771 1.531011 0.3633249 1.192349
## 0.026985533 1.531011 0.3633249 1.192349
## 0.027870628 1.531011 0.3633249 1.192349
## 0.028155890 1.531011 0.3633249 1.192349
## 0.031178734 1.531011 0.3633249 1.192349
## 0.041325211 1.537862 0.3582518 1.197905
## 0.053936668 1.536746 0.3578463 1.196370
## 0.071685404 1.534111 0.3596184 1.191464
## 0.077778984 1.540440 0.3524684 1.194357
## 0.402184913 1.708874 0.3075143 1.361557
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.03117873.
set.seed(2044)
randomf_m <- train(X_train, y_train, method='rf', tuneLength = 10)
randomf_m
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.357171 0.5863109 1.0831068
## 8 1.263995 0.6172568 0.9960572
## 14 1.247450 0.6174306 0.9760008
## 20 1.245978 0.6110851 0.9690348
## 26 1.247355 0.6051196 0.9642880
## 32 1.251446 0.5972517 0.9661479
## 38 1.259134 0.5894453 0.9691093
## 44 1.267043 0.5810307 0.9747482
## 50 1.277367 0.5724495 0.9787299
## 57 1.282848 0.5669017 0.9810368
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.
set.seed(19828)
grid <- expand.grid(n.trees=c(50, 100, 150, 200),
interaction.depth=c(1, 5, 10, 15),
shrinkage=c(0.01, 0.1, 0.5),
n.minobsinnode=c(5, 10, 15))
gbm_m <- train(x = X_train,y = y_train, method = 'gbm',tuneGrid = grid, verbose = FALSE)
gbm_m$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 76 200 10 0.1 5
Let’s resample and see distribution of the models
summary(resamples(list(Single_True = rpart_m, Random_Forest = randomf_m, Gradient_Boosting=gbm_m)))
##
## Call:
## summary.resamples(object = resamples(list(Single_True = rpart_m,
## Random_Forest = randomf_m, Gradient_Boosting = gbm_m)))
##
## Models: Single_True, Random_Forest, Gradient_Boosting
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Single_True 0.9444254 1.1065223 1.1844621 1.1923487 1.2964423 1.400361
## Random_Forest 0.7007020 0.9130536 0.9774114 0.9690348 1.0121849 1.210727
## Gradient_Boosting 0.7048638 0.8483657 0.9129005 0.9053819 0.9460865 1.127389
## NA's
## Single_True 0
## Random_Forest 0
## Gradient_Boosting 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Single_True 1.2232769 1.404920 1.554384 1.531011 1.627351 1.797655 0
## Random_Forest 0.9070558 1.158586 1.263932 1.245978 1.337436 1.595936 0
## Gradient_Boosting 0.9365434 1.106733 1.154823 1.164502 1.206801 1.428635 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Single_True 0.2166224 0.2787529 0.3634825 0.3633249 0.4349596 0.6043007
## Random_Forest 0.4092469 0.5701930 0.6147885 0.6110851 0.6675758 0.6936807
## Gradient_Boosting 0.5302200 0.5914836 0.6203363 0.6299045 0.6693364 0.7546486
## NA's
## Single_True 0
## Random_Forest 0
## Gradient_Boosting 0
It seems like Gradient Boosting is doing better as it’s RMSE value is better than the others. I won’t select rsquared as it does not penalize with adding insignificant predictors. Now let’s test it out on test set.
test_perf <- function(models, testData, testTarget) {
method <- c()
res <- data.frame()
for(model in models){
method <- c(method, model$method)
pred <- predict(model, newdata=testData)
res <- rbind(res, t(postResample(pred=pred, obs=testTarget)))
}
row.names(res) <- method
return(res)
}
models <- list(rpart_m, randomf_m, gbm_m)
performance <- test_perf(models, X_test, y_test)
performance
## RMSE Rsquared MAE
## rpart 1.3408087 0.3804309 1.1170494
## rf 0.9892451 0.6391462 0.7023174
## gbm 1.0198520 0.6318556 0.7686820
Random forest performed slightly better than Gradient Boosting as it’s value dropped on test performance. I would add more data and test it with more tests and see which one is consistent. Random Forest is consistent across the training and test dataset and I would choose Random forest.
varImp(randomf_m)
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13 32.681
## BiologicalMaterial03 27.716
## BiologicalMaterial12 25.506
## ManufacturingProcess09 19.357
## ManufacturingProcess17 18.443
## ManufacturingProcess06 16.333
## BiologicalMaterial06 15.269
## ManufacturingProcess31 11.872
## ManufacturingProcess28 11.836
## BiologicalMaterial11 11.301
## BiologicalMaterial02 11.010
## BiologicalMaterial09 9.054
## BiologicalMaterial08 8.160
## BiologicalMaterial04 8.030
## ManufacturingProcess36 7.620
## BiologicalMaterial05 6.614
## BiologicalMaterial01 6.000
## ManufacturingProcess39 5.258
## ManufacturingProcess27 5.164
ManufacturingProcess32 is consistently an important predictor leading with BiologicalMaterial which is slightly consistent with previously tested linear and non-linear regression models. I would say none of them are dominating but they both are important predictors in the model.
rpart_m$finalModel
## n= 144
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 144 510.06710 40.17806
## 2) ManufacturingProcess32< 159.5 81 137.14040 39.12543
## 4) BiologicalMaterial11< 145.47 44 50.63688 38.50932 *
## 5) BiologicalMaterial11>=145.47 37 49.93917 39.85811 *
## 3) ManufacturingProcess32>=159.5 63 167.78540 41.53143
## 6) ManufacturingProcess06< 208.1 37 81.27967 40.86622 *
## 7) ManufacturingProcess06>=208.1 26 46.83320 42.47808 *
plot(rpart_m$finalModel)
text(rpart_m$finalModel)
A tree model shows that one single split using ManufacturingProcess32 at 159.5 is an optimal. It also means that as the value of ManufacturingProcess32 increases, the value of yield will also increase.