Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.83890885
## V2 6.49023056
## V3 0.67583163
## V4 7.58822553
## V5 2.27426009
## V6 0.17436781
## V7 0.15136583
## V8 -0.03078937
## V9 -0.02989832
## V10 -0.08529218
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
Random forest deemed variables 1, 2, 4, and 5 as most important while giving little significance to V6-V10.
party
package to fit a random forest model using conditional inference trees. The party package function varimp
can calculate predictor importance. The conditional
argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?# controls <- cforest_control(ntree = 1000)
cf <- cforest(y ~ ., data = simulated[,1:11])
cfImp1 <- varimp(cf)
ccfImp1 <- varimp(cf,conditional=T)
cf <- cforest(y ~ ., data = simulated)
cfImp2 <- varimp(cf)
ccfImp2 <- varimp(cf,conditional=T)
df1 <- data.frame(rfImp1)
df2 <- data.frame(rfImp2)
df3 <- data.frame(cfImp1)
df4 <- data.frame(cfImp2)
df5 <- data.frame(ccfImp1)
df6 <- data.frame(ccfImp2)
res <- cbind(df1$Overall,df2$Overall,
df3$cfImp1,df4$cfImp2,
df5$ccfImp1,df6$ccfImp2)
colnames(res) <- c('rfImp1','rfImp2','cfImp1','cfImp2','ccfImp1','ccfImp2')
rownames(res) <- c('v1','v2','v3','v4','v5','v6','v7','v8','v9','v10','duplicated')
res <- data.frame(res)
res['duplicated','rfImp1'] <- 'n/a'
res['duplicated','cfImp1'] <- 'n/a'
res['duplicated','ccfImp1'] <- 'n/a'
res
## rfImp1 rfImp2 cfImp1 cfImp2
## v1 8.83890884687881 6.29780744 8.24398027002184 6.997928079
## v2 6.49023055671384 6.08038134 6.25754168767664 5.721403952
## v3 0.675831632065687 0.58410718 0.238955295737507 0.176865757
## v4 7.58822552962725 6.93924427 7.11821066817076 6.679439805
## v5 2.27426008558579 2.03104094 2.08822910203085 2.164754408
## v6 0.174367811427058 0.07947642 -0.0339077123981816 0.002175835
## v7 0.151365830084575 -0.02566414 0.132461148719195 0.158646126
## v8 -0.0307893674609138 -0.11007435 -0.203131687868431 -0.072300812
## v9 -0.0298983168753266 -0.08839463 -0.042280789046421 -0.025558718
## v10 -0.0852921822465674 -0.00715093 -0.233782680168299 -0.100624738
## duplicated n/a 3.56411581 n/a 4.358041616
## ccfImp1 ccfImp2
## v1 6.16756829289277 3.71232062
## v2 5.3518538178137 4.81247129
## v3 0.153015077857874 0.02150080
## v4 6.1535726781082 5.80857084
## v5 1.45791688301616 1.48899900
## v6 -0.263060836697065 -0.26393405
## v7 -0.240956852289023 -0.03673253
## v8 -0.384942286104933 -0.33199344
## v9 -0.244790640262602 -0.24349702
## v10 -0.27229862015385 -0.27427204
## duplicated n/a 1.82777831
Yes, cforest importances show similar pattern, adding a duplicated variable seem to always result in the drop in V1 importance.
set.seed(200)
bag1 <- bagging(y ~ ., simulated[,1:11], nbag = 50)
bag2 <- bagging(y ~ ., simulated, nbag = 50)
(bag1Imp <- varImp(bag1))
## Overall
## V1 1.8013841
## V10 0.7482432
## V2 2.2479822
## V3 1.1596239
## V4 2.6335381
## V5 2.5472188
## V6 0.9449452
## V7 0.9698789
## V8 0.6338984
## V9 0.7085276
(bag2Imp <- varImp(bag2))
## Overall
## duplicate1 1.5253939
## V1 1.7771897
## V10 0.6422615
## V2 1.9752538
## V3 1.1272768
## V4 2.6417971
## V5 2.3159016
## V6 0.7537269
## V7 0.9205175
## V8 0.4518462
## V9 0.4889416
boost1 <- gbm(y ~ ., simulated[,1:11], n.trees = 100, distribution = "gaussian")
boost2 <- gbm(y ~ ., simulated, n.trees = 100, distribution = "gaussian")
(boost1Imp <- varImp(boost1))
## Error in 1:n.trees: argument of length 0
(boost2Imp <- varImp(boost2))
## Error in 1:n.trees: argument of length 0
library(Cubist)
cubist1 <- cubist(x=simulated[,1:10], y=simulated$y, committees = 100)
cubist2 <- cubist(x=simulated, y=simulated$y, committees = 100)
(cubist1Imp <- varImp(cubist1))
## Overall
## V1 71.5
## V3 47.0
## V2 58.5
## V4 48.0
## V5 33.0
## V6 13.0
## V7 0.0
## V8 0.0
## V9 0.0
## V10 0.0
(cubist2Imp <- varImp(cubist2))
## Overall
## y 50
## V1 0
## V2 0
## V3 0
## V4 0
## V5 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
## duplicate1 0
Use a simulation to show tree bias with different granularities.
Per Kuhn and Johnson, trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors that have more potential splits. The below simulates two variables, one with 3 distinct categories and greater variance, second is just random continuous data which is more granular as it has more split points. Although the variable x1 is related to target variable (only differs by random noise), more granular variable is given more importance.
set.seed(200)
x1 <- rep(1:3, each=100) # 3 categories, 100 records for each
e1 <- rnorm(300,mean=0, sd=3) # random error to be added to target variable
y <- x1+e1 # closely related to x1
set.seed(200)
x2 <- rnorm(300, mean=0, sd=2) # no categories, just continuous random data, no relation to y
sim <- data.frame(y=y, x1=x1, x2=x2)
fit <- rpart(y ~ ., data = sim)
varImp(fit)
## Overall
## x1 1.485791
## x2 3.812263
The greater variance in a variable might
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Per Kuhn and Johnson, boosting employs the greedy strategy of choosing the optimal weak learner at each iteration, and one way to constrain the greediness is by adjusting learning rate. Small values of the learning parameter (<0.01) typically work best but as learning rate increases so does the computational time required to find the model. Stochastic gradient boosting and bagging techniques employ random sampling of data and select a fraction at the first iteration, while the models in the remaining step are only based on the sample of the data. The greater the size (fraction), the less computational power is required, that method, however, also results in less predictors being selected.
Per Ridgeway, smaller learning parameters tend to perform better. Selecting smaller fraction of data might also reduce selection bias and allow for a more flexible model that is less dependent on a few variables.
Trees from boosting are dependent on past trees, have minimum depth, and contribute to final model unequally. If interaction depth was increased, that would resemple more of a single tree, in which case same variables could be selected at different iterations. At the same time, variable importance is dependent on reduction in squared error due to each predictor, and if all variables have a chance at being selected at each iteration, all variables can equally contribute to the improvement in squared error so the importance would be spread over all predictors as opposed to giving more weights to a few that would’ve been selected with smaller depth.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data(ChemicalManufacturingProcess)
features <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select=Yield)
prep <- preProcess(features, method=c('scale','center','knnImpute'))
prep_features <- predict(prep,features)
# Train and test
set.seed(1)
split <- createDataPartition(yield$Yield,p=0.75,list=FALSE)
x_train <- prep_features[split,]
y_train <- yield[split,]
x_test <- prep_features[-split,]
y_test <- yield[-split,]
# Additional preprocessing
## remove near zero variance predictors that carry no information
pred_to_remove <- nearZeroVar(features)
x_train <- x_train[-pred_to_remove]
x_test <- x_test[-pred_to_remove]
## Remove highly correlated features
corThresh <- 0.9
tooHigh <- findCorrelation(cor(x_train),corThresh)
x_train <- x_train[,-tooHigh]
x_test <- x_test[,-tooHigh]
set.seed(1)
ctrl <- trainControl(method='cv',number=10)
cart_model <- train(x=x_train,y=y_train,
method='rpart2',
metric='Rsquared',
trControl = ctrl)
cart_model
## CART
##
## 132 samples
## 46 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 119, 118, 119, 118, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 1.370534 0.4384761 1.099251
## 2 1.325843 0.4672979 1.076872
## 3 1.362245 0.4667858 1.066312
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 2.
plot(cart_model)
set.seed(1)
predictions <- predict(cart_model,x_test)
postResample(obs = y_test,pred = predictions)
## RMSE Rsquared MAE
## 1.6472041 0.3466194 1.2919750
# rf_model <- randomForest(x=x_train,y=y_train,
# importance = TRUE,
# ntree = 1000)
rf_grid<- expand.grid(mtry=c(1,3,5,7,10))
rf_model <- train(x=x_train,y=y_train,
method='rf',
metric='Rsquared',
importance=TRUE,
tuneGrid=rf_grid,
trControl = ctrl)
rf_model
## Random Forest
##
## 132 samples
## 46 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 119, 118, 119, 118, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 1.313275 0.6320500 1.0759019
## 3 1.144113 0.6929042 0.9334935
## 5 1.125750 0.6840603 0.9130703
## 7 1.096581 0.6998949 0.8820174
## 10 1.080599 0.7022848 0.8748689
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
plot(rf_model)
set.seed(1)
predictions <- predict(rf_model,x_test)
postResample(obs = y_test,pred = predictions)
## RMSE Rsquared MAE
## 1.3387788 0.6217914 1.0128653
library(gbm)
set.seed(1)
gbmGrid <- expand.grid(interaction.depth = seq(1,5,by=1),
n.trees=c(10,20,50,100),
shrinkage=c(0.01,0.1,0.5),
n.minobsinnode=10)
gbm_model <- train(x=x_train,y=y_train,
method='gbm',
metric='Rsquared',
verbose=FALSE,
tuneGrid=gbmGrid,
trControl = ctrl)
gbm_model
## Stochastic Gradient Boosting
##
## 132 samples
## 46 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 119, 118, 119, 118, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.01 1 10 1.723398 0.4983709 1.4013663
## 0.01 1 20 1.671157 0.4968049 1.3552942
## 0.01 1 50 1.548643 0.5255066 1.2532318
## 0.01 1 100 1.400241 0.5692484 1.1282749
## 0.01 2 10 1.716561 0.5148675 1.3957618
## 0.01 2 20 1.655387 0.5197523 1.3436512
## 0.01 2 50 1.496011 0.5815619 1.2117334
## 0.01 2 100 1.333279 0.6118879 1.0788588
## 0.01 3 10 1.705582 0.5340190 1.3845724
## 0.01 3 20 1.635189 0.5595889 1.3250517
## 0.01 3 50 1.477134 0.5922749 1.1976969
## 0.01 3 100 1.308911 0.6228345 1.0626236
## 0.01 4 10 1.698382 0.5814462 1.3801726
## 0.01 4 20 1.630054 0.5718858 1.3233621
## 0.01 4 50 1.465891 0.5952168 1.1883195
## 0.01 4 100 1.309565 0.6147817 1.0637739
## 0.01 5 10 1.708137 0.5035807 1.3883022
## 0.01 5 20 1.639681 0.5492787 1.3295971
## 0.01 5 50 1.473248 0.5938854 1.1913155
## 0.01 5 100 1.314841 0.6126844 1.0630924
## 0.10 1 10 1.426153 0.5243323 1.1566885
## 0.10 1 20 1.261989 0.5966798 1.0225136
## 0.10 1 50 1.127329 0.6554557 0.9029270
## 0.10 1 100 1.064618 0.6840756 0.8437091
## 0.10 2 10 1.358825 0.5607452 1.0935824
## 0.10 2 20 1.221661 0.5933675 0.9695901
## 0.10 2 50 1.120007 0.6403005 0.8725922
## 0.10 2 100 1.085203 0.6573020 0.8460159
## 0.10 3 10 1.322121 0.5427148 1.0690831
## 0.10 3 20 1.208183 0.5861849 0.9686540
## 0.10 3 50 1.111969 0.6535825 0.8947736
## 0.10 3 100 1.059770 0.6823334 0.8622593
## 0.10 4 10 1.330432 0.5722127 1.0901729
## 0.10 4 20 1.221179 0.5954643 0.9847109
## 0.10 4 50 1.116395 0.6547404 0.8751232
## 0.10 4 100 1.067728 0.6859512 0.8476542
## 0.10 5 10 1.307958 0.6060999 1.0544101
## 0.10 5 20 1.177054 0.6214995 0.9509815
## 0.10 5 50 1.067697 0.6809763 0.8483713
## 0.10 5 100 1.024568 0.7003654 0.8269520
## 0.50 1 10 1.255636 0.5464585 1.0209971
## 0.50 1 20 1.255687 0.5488650 0.9879220
## 0.50 1 50 1.237069 0.5670153 0.9698515
## 0.50 1 100 1.312456 0.5304760 1.0298776
## 0.50 2 10 1.213052 0.5467724 0.9518224
## 0.50 2 20 1.286085 0.5125172 1.0208341
## 0.50 2 50 1.374955 0.4604876 1.0888086
## 0.50 2 100 1.328536 0.4991827 1.0538630
## 0.50 3 10 1.292069 0.5277342 1.0269943
## 0.50 3 20 1.270056 0.5547592 1.0142672
## 0.50 3 50 1.245095 0.6051454 0.9506360
## 0.50 3 100 1.232185 0.6126716 0.9562459
## 0.50 4 10 1.207302 0.5641461 0.9609271
## 0.50 4 20 1.185843 0.5871817 0.9315298
## 0.50 4 50 1.220040 0.5788122 0.9858341
## 0.50 4 100 1.216712 0.5715531 0.9656392
## 0.50 5 10 1.183431 0.5964186 0.9494210
## 0.50 5 20 1.149551 0.5884861 0.9614074
## 0.50 5 50 1.195672 0.5599751 0.9817420
## 0.50 5 100 1.170901 0.5764055 0.9560094
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
## 5, shrinkage = 0.1 and n.minobsinnode = 10.
plot(gbm_model)
set.seed(1)
predictions <- predict(gbm_model,x_test)
postResample(obs = y_test,pred = predictions)
## RMSE Rsquared MAE
## 1.2930695 0.5994154 0.9755731
set.seed(1)
cubGrid <- expand.grid(committees=c(1,3,5,7,10),
neighbors=c(1,3,5))
cub_model <- train(x=x_train,y=y_train,
method='cubist',
metric='Rsquared',
tuneGrid=cubGrid,
trControl = ctrl,
verbose=FALSE)
cub_model
## Cubist
##
## 132 samples
## 46 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 119, 118, 119, 118, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 1 1.175820 0.6380818 0.8675950
## 1 3 1.187623 0.6000023 0.9152934
## 1 5 1.220173 0.5830305 0.9667796
## 3 1 1.129601 0.6389493 0.8634972
## 3 3 1.117596 0.6328648 0.8709977
## 3 5 1.160379 0.6098827 0.9225344
## 5 1 1.079827 0.6731328 0.8199351
## 5 3 1.058706 0.6748209 0.8441410
## 5 5 1.108418 0.6528887 0.9082350
## 7 1 1.082860 0.6667072 0.8214243
## 7 3 1.038260 0.6842429 0.8249929
## 7 5 1.094210 0.6578758 0.8899572
## 10 1 1.042537 0.6865912 0.7929366
## 10 3 1.002615 0.7022529 0.7934951
## 10 5 1.049291 0.6813160 0.8472975
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were committees = 10 and neighbors = 3.
plot(cub_model)
set.seed(1)
predictions <- predict(cub_model,x_test)
postResample(obs = y_test,pred = predictions)
## RMSE Rsquared MAE
## 0.9869493 0.7798385 0.6792144
The following R squared were achieved for train/test respectively for each of the models:
rf 0.7024581/0.6400612
gbm 0.7003654/0.5994154
cubist 0.7022529/0.7798385
Although all three produced approximately the same Rsq on train, cubist resulted in highest R squared on test set and lowest RMSE than all other models including previously developed linear and non-linear.
plot(varImp(cub_model, scale = FALSE), top=10,scales = list(y = list(cex = 0.8)))
All models identify similar variables as important but in different order, however, in all cases, MP32, MP09, MP17, BM12, BM03 appear more frequently in top 5.
plot(as.party(cart_model$finalModel), gp=gpar(fontsize = 8))
Using optimal CART model, MP32 is on the top as the most important variable which is consistent with what we’ve been seeing with many models. Lower values of MP32 tend to be associated to lower yield. At the same time, biological materials start to contribute on the lower spectrum.