Textbook: Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.
library(caret)
library(mlbench)
library(randomForest)
library(party)
library(dplyr)
library(gbm)
library(Cubist)
library(rpart)
library(partykit)
Recreate the simulated data from Exercise 7.2:
set.seed(420)
simulated = mlbench.friedman1(200, sd = 1)
simulated = cbind(simulated$x, simulated$y)
simulated = as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] = "y"
model_rf = randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rf_Imp = varImp(model_rf, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
rf_Imp
## Overall
## V1 6.29089217
## V2 7.23706200
## V3 1.52229636
## V4 9.93245576
## V5 0.82685129
## V6 0.11773907
## V7 -0.01531565
## V8 0.23933146
## V9 0.01159528
## V10 0.17604773
The predictors V6-V10 have much smaller weights than the other predictors. It does not look like the random forest model significantly used the uninformative predictors.
model_cforest <- cforest(y ~ ., data= simulated)
cf_imp3<-varimp(model_cforest) %>% sort(decreasing = TRUE) # Without conditional
cf_imp4<-varimp(model_cforest, conditional = TRUE) %>% sort(decreasing=TRUE) # With conditional
as.data.frame(cbind(Model2 = rf_imp2, Conditional = cf_imp3, Unconditional = cf_imp4))
## Overall Conditional Unconditional
## V1 4.8648919382 8.84785379 7.4280605
## V2 7.3372450126 6.89128208 5.5693755
## V3 1.2728087767 5.02776556 3.7467140
## V4 9.8316668348 3.49451211 1.5008438
## V5 0.9354855143 1.01746720 0.5642202
## V6 0.1025747835 0.62881291 0.2146344
## V7 -0.0006555202 0.58448420 0.0739921
## V8 0.2455059911 -0.03204360 -0.1407056
## V9 -0.1369595734 -0.04878733 -0.1698899
## V10 0.0811743538 -0.11795138 -0.2001284
## duplicate1 3.3725652358 -0.24604449 -0.2212346
Importance of variable V1 increased when wen compare traditional to conditional. Importance of V4 and duplicate decreased. It looks like in conditional, the correlation is taken into account and variable importance is adjusted accordingly.
#Running gbm
gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2),
n.trees = seq(100, 1000, by = 100),
shrinkage = 0.1,
n.minobsinnode = 5)
model_gbm = train(y ~ ., data = simulated,
tuneGrid = gbmGrid, verbose = FALSE,
method = 'gbm')
gbm_imp<-varImp(model_gbm)
#Running cubist
model_cubist <- cubist(simulated[,-11], simulated[, 11])
cubist_imp<-varImp(model_cubist)
#Compare
df = as.data.frame(cbind(gbm_imp$importance, cubist_imp))
names(df) = c("boosted", "cubist")
df
## boosted cubist
## V1 56.637970 94.5
## V2 72.026872 78.0
## V3 37.037956 79.5
## V4 100.000000 60.5
## V5 17.725744 39.5
## V6 1.451492 3.5
## V7 1.067405 0.0
## V8 1.136399 0.0
## V9 0.000000 0.0
## V10 3.029049 0.0
## duplicate1 12.501278 0.0
We see that V6-V10 are still among the lowest in importance just like the traditional and conditional models. V4 is still the top predictor. The pattern is the same.
Below are 5 distinct simulated variables with different granularities. The response variable is X1 with some noise.
set.seed(450)
X1 <- sample(1:1000 / 1000, 252, replace = TRUE) # 1000 distinct values
X2 <- sample(1:100 / 100, 252, replace = TRUE) # 100 distinct values
X3 <-sample(1:10 / 10, 252, replace = TRUE) # 10 distinct values
X4 <- sample(1:5 / 5, 252, replace = TRUE) # 5 distinct values
X5<- rep(1, 252) # no distinct values
Y <- X1 + rnorm(100, mean = 0, sd = 1.5)
df <- data.frame(X1, X2, X3, X4, X5, Y)
# Rpart
rpart_model <- rpart(Y ~., data=df)
varImp(rpart_model)
## Overall
## X1 0.8433728
## X2 1.1844165
## X3 0.7082408
## X4 0.7733144
## X5 0.0000000
X1 is the most important predictor with the highest level of granularity. This is followed by X2 and X4. There seems to be present a selection bias with more distinct values favoured.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The model on the right focuses on a few variables as it has a higher learning rate along with a higher bagging rate. This means that a larger portion of the data is used, increasing the correlation at each iteration. Thus only a few of the variables have high importance.
The model with larger learning and bagging will most likely overfit as it considers fewer variables. The model on the left has a higher chance of being more predictive of other samples.
Interaction depth is also same like tree depth. As the interaction depth increases, more predictors are included. The RMSE error will lower and the steeper the slop of importance of predictors.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Impute the missing values using bagImpute
cmp_impute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute'))
# Replace
cmp <- predict(cmp_impute, ChemicalManufacturingProcess[,-c(1)])
# Splitting the data into training and test datasets
set.seed(480)
train_r <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X_train <- cmp[train_r,]
y_train <- ChemicalManufacturingProcess$Yield[train_r]
X_test <- cmp[-train_r,]
y_test <- ChemicalManufacturingProcess$Yield[-train_r]
set.seed(330)
model_rpart <- train(x= X_train, y= y_train, method="rpart", tuneLength=10, control= rpart.control(maxdepth=2))
model_rpart
## CART
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01639564 1.480376 0.3607456 1.147913
## 0.02137409 1.480376 0.3607456 1.147913
## 0.02341864 1.480376 0.3607456 1.147913
## 0.02532733 1.482527 0.3601484 1.148927
## 0.03111746 1.485388 0.3573289 1.151740
## 0.03600407 1.485388 0.3573289 1.151740
## 0.04559790 1.497955 0.3500961 1.159016
## 0.06391370 1.512886 0.3413150 1.171772
## 0.09164355 1.500473 0.3448356 1.169280
## 0.39332469 1.657623 0.2971253 1.320311
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02341864.
set.seed(340)
model_rf3<- train(X_train, y_train, method='rf', tuneLength = 10)
model_rf3
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.306399 0.5247818 1.0456889
## 8 1.235512 0.5490750 0.9683824
## 14 1.220206 0.5526758 0.9492727
## 20 1.214414 0.5510626 0.9382298
## 26 1.214831 0.5469368 0.9329945
## 32 1.219346 0.5416017 0.9360704
## 38 1.221729 0.5372881 0.9360276
## 44 1.227097 0.5317287 0.9368421
## 50 1.232757 0.5261384 0.9411051
## 57 1.234035 0.5242386 0.9425201
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.
set.seed(350)
grid <- expand.grid(n.trees=c(50, 100, 150, 200),
interaction.depth=c(1, 5, 10, 15),
shrinkage=c(0.01, 0.1, 0.5),
n.minobsinnode=c(5, 10, 15))
model_gbm1 <- train(x = X_train,y = y_train, method = 'gbm',tuneGrid = grid, verbose = FALSE)
model_gbm1$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 88 200 15 0.1 5
Let us resample and have a look
summary(resamples(list(Single_True = model_rpart, Random_Forest = model_rf3, Gradient_Boosting=model_gbm1)))
##
## Call:
## summary.resamples(object = resamples(list(Single_True =
## model_rpart, Random_Forest = model_rf3, Gradient_Boosting = model_gbm1)))
##
## Models: Single_True, Random_Forest, Gradient_Boosting
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Single_True 0.9445823 1.0825643 1.1771915 1.1479128 1.2229221 1.318403
## Random_Forest 0.7726775 0.8568233 0.9438754 0.9382298 0.9756329 1.170187
## Gradient_Boosting 0.8104290 0.8763024 0.9236562 0.9237093 0.9625195 1.039068
## NA's
## Single_True 0
## Random_Forest 0
## Gradient_Boosting 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Single_True 1.2451998 1.405136 1.484287 1.480376 1.577538 1.636725 0
## Random_Forest 0.9779076 1.127082 1.196840 1.214414 1.265581 1.542786 0
## Gradient_Boosting 1.0682624 1.140261 1.188020 1.204190 1.239991 1.469959 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Single_True 0.1808142 0.2998683 0.3552829 0.3607456 0.4372972 0.5070873
## Random_Forest 0.3940632 0.4918491 0.5656946 0.5510626 0.6242199 0.6959368
## Gradient_Boosting 0.4210686 0.5505479 0.5950396 0.5802401 0.6193195 0.6888321
## NA's
## Single_True 0
## Random_Forest 0
## Gradient_Boosting 0
Gradient Boosting seems to be the best model as the RMSE value looks better than those from the other models. We need to test the model on the test set.
#Function for test data
test_performance <- function(models, testData, testTarget) {
method <- c()
res <- data.frame()
for(model in models){
method <- c(method, model$method)
pred <- predict(model, newdata=testData)
res <- rbind(res, t(postResample(pred=pred, obs=testTarget)))
}
row.names(res) <- method
return(res)
}
#List te models
models <- list(model_rpart, model_rf3, model_gbm1)
#Run the function
performance <- test_performance(models, X_test, y_test)
performance
## RMSE Rsquared MAE
## rpart 1.564727 0.3765960 1.2801338
## rf 1.102755 0.7517902 0.8670284
## gbm 1.005104 0.7617203 0.7987109
Again, Gradient boosting gives the lowest RMSE and comes up as the best model.
From the previous homeworks on linear and non linear models, pls and svm were the best models that came up. We are running it here again to compare
model_pls <- train(x = X_train,y = y_train, method='pls', metric='RMSE',
tuneLength=20, trControl = trainControl(method='cv'))
(pls_imp = varImp(model_pls))
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial03 36.80
## ManufacturingProcess14 36.58
## ManufacturingProcess27 35.84
## ManufacturingProcess24 34.87
## ManufacturingProcess28 33.07
## ManufacturingProcess35 32.72
## ManufacturingProcess26 30.38
## ManufacturingProcess15 30.33
## ManufacturingProcess06 29.02
## BiologicalMaterial06 28.19
## ManufacturingProcess25 28.17
## ManufacturingProcess04 27.54
## BiologicalMaterial02 27.34
## ManufacturingProcess02 21.66
## ManufacturingProcess05 18.25
## ManufacturingProcess33 17.54
## ManufacturingProcess19 16.02
## ManufacturingProcess20 15.11
## ManufacturingProcess18 15.04
set.seed(424)
svm_model <- train(x = X_train,y = y_train,
method = "svmRadial",
tuneLength=10,
preProc = c("center", "scale"))
(svm_imp = varImp(svm_model))
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## BiologicalMaterial06 100.00
## ManufacturingProcess13 98.34
## ManufacturingProcess32 96.95
## BiologicalMaterial03 81.49
## ManufacturingProcess17 75.31
## BiologicalMaterial02 73.06
## BiologicalMaterial12 71.59
## ManufacturingProcess09 70.96
## ManufacturingProcess31 67.21
## ManufacturingProcess36 65.94
## ManufacturingProcess06 55.50
## BiologicalMaterial04 54.37
## BiologicalMaterial11 53.94
## ManufacturingProcess33 49.48
## BiologicalMaterial08 48.82
## BiologicalMaterial01 46.53
## ManufacturingProcess29 40.97
## ManufacturingProcess02 38.37
## ManufacturingProcess11 36.10
## ManufacturingProcess12 32.92
Let us compare the three lists of important variables
p1<-plot(svm_imp, top=10, main='SVM')
p2<-plot(pls_imp, top=10, main='PLS')
gbm_imp<-varImp(model_gbm1)
p3<-plot(gbm_imp, top=10, main='GBM')
gridExtra::grid.arrange(p1, p2,p3, ncol = 3)
Gradient Boosting and SVM have both Biological and Manufacturing predictors in the top 10, whereas Partial least squares linear model is dominated by manufacturing predictors. Manufacturing Process32 seems to be an important variable in all three models.
model_rpart$finalModel
## n= 144
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 144 472.70180 40.17111
## 2) ManufacturingProcess32< 159.5 83 142.65450 39.19699
## 4) BiologicalMaterial11< 145.075 39 42.89152 38.55615 *
## 5) BiologicalMaterial11>=145.075 44 69.55090 39.76500 *
## 3) ManufacturingProcess32>=159.5 61 144.12200 41.49656
## 6) BiologicalMaterial06< 51.61 34 60.34664 40.74559 *
## 7) BiologicalMaterial06>=51.61 27 40.45527 42.44222 *
rpart.plot::rpart.plot(model_rpart$finalModel, box.palette = "RdBu", shadow.col = "blue", nn = TRUE)
The tree shows that the split begins with ManfucturingPRocess32 at 160. If it is less than 160, the yield will be 39. If it is more than 160, the yield will be 41. The tree branches to terminal nodes and the yield can improve upto 42. The tree shows the relationship with the predictors and their relationship with yield. The higher the values, the higher the yield according to this tree.