Textbook: Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.

library(caret)
library(mlbench)
library(randomForest)
library(party)
library(dplyr)
library(gbm)
library(Cubist)
library(rpart)
library(partykit)

Exercise 8.1

Recreate the simulated data from Exercise 7.2:

set.seed(420)
simulated = mlbench.friedman1(200, sd = 1)
simulated = cbind(simulated$x, simulated$y)
simulated = as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] = "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores

model_rf = randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rf_Imp = varImp(model_rf, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

rf_Imp
##         Overall
## V1   6.29089217
## V2   7.23706200
## V3   1.52229636
## V4   9.93245576
## V5   0.82685129
## V6   0.11773907
## V7  -0.01531565
## V8   0.23933146
## V9   0.01159528
## V10  0.17604773

The predictors V6-V10 have much smaller weights than the other predictors. It does not look like the random forest model significantly used the uninformative predictors.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9377547

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model_rf2 <- randomForest(y ~ ., data= simulated, importance= TRUE, ntree= 1000)
rf_imp2 <- varImp(model_rf2, scale=FALSE)
rf_imp2
##                  Overall
## V1          4.8648919382
## V2          7.3372450126
## V3          1.2728087767
## V4          9.8316668348
## V5          0.9354855143
## V6          0.1025747835
## V7         -0.0006555202
## V8          0.2455059911
## V9         -0.1369595734
## V10         0.0811743538
## duplicate1  3.3725652358

The importance of V1 dropped after adding another predictor. The correlation between V1 and duplicate1 is very high at 0.94. If we use model2, more predictors may be selected than necessary and variable importance is affected.

(c) Use the cforest function in the party package to fit a random forest model using the conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the tradtional random forest model?

model_cforest <- cforest(y ~ ., data= simulated)
cf_imp3<-varimp(model_cforest) %>% sort(decreasing = TRUE) # Without conditional
cf_imp4<-varimp(model_cforest, conditional = TRUE) %>% sort(decreasing=TRUE) # With conditional 
as.data.frame(cbind(Model2 = rf_imp2, Conditional = cf_imp3, Unconditional = cf_imp4))
##                  Overall Conditional Unconditional
## V1          4.8648919382  8.84785379     7.4280605
## V2          7.3372450126  6.89128208     5.5693755
## V3          1.2728087767  5.02776556     3.7467140
## V4          9.8316668348  3.49451211     1.5008438
## V5          0.9354855143  1.01746720     0.5642202
## V6          0.1025747835  0.62881291     0.2146344
## V7         -0.0006555202  0.58448420     0.0739921
## V8          0.2455059911 -0.03204360    -0.1407056
## V9         -0.1369595734 -0.04878733    -0.1698899
## V10         0.0811743538 -0.11795138    -0.2001284
## duplicate1  3.3725652358 -0.24604449    -0.2212346

Importance of variable V1 increased when wen compare traditional to conditional. Importance of V4 and duplicate decreased. It looks like in conditional, the correlation is taken into account and variable importance is adjusted accordingly.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

#Running gbm
gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2), 
                      n.trees = seq(100, 1000, by = 100), 
                      shrinkage = 0.1, 
                      n.minobsinnode = 5)
model_gbm = train(y ~ ., data = simulated, 
                  tuneGrid = gbmGrid, verbose = FALSE, 
                  method = 'gbm')
gbm_imp<-varImp(model_gbm)
#Running cubist
model_cubist <- cubist(simulated[,-11], simulated[, 11])
cubist_imp<-varImp(model_cubist)
#Compare
df = as.data.frame(cbind(gbm_imp$importance, cubist_imp))
names(df) = c("boosted", "cubist")
df
##               boosted cubist
## V1          56.637970   94.5
## V2          72.026872   78.0
## V3          37.037956   79.5
## V4         100.000000   60.5
## V5          17.725744   39.5
## V6           1.451492    3.5
## V7           1.067405    0.0
## V8           1.136399    0.0
## V9           0.000000    0.0
## V10          3.029049    0.0
## duplicate1  12.501278    0.0

We see that V6-V10 are still among the lowest in importance just like the traditional and conditional models. V4 is still the top predictor. The pattern is the same.

Exercise 8.2

(a) Use a simulation to show tree bias with different granularities

Below are 5 distinct simulated variables with different granularities. The response variable is X1 with some noise.

set.seed(450)
X1 <- sample(1:1000 / 1000, 252, replace = TRUE) # 1000 distinct values
X2 <- sample(1:100 / 100, 252, replace = TRUE)     # 100 distinct values
X3 <-sample(1:10 / 10, 252, replace = TRUE)   # 10 distinct values
X4 <- sample(1:5 / 5, 252, replace = TRUE)       # 5 distinct values
X5<- rep(1, 252)                                # no distinct values
Y <- X1 + rnorm(100, mean = 0, sd = 1.5)

df <- data.frame(X1, X2, X3, X4, X5, Y)

# Rpart
rpart_model <- rpart(Y ~., data=df)
varImp(rpart_model)
##      Overall
## X1 0.8433728
## X2 1.1844165
## X3 0.7082408
## X4 0.7733144
## X5 0.0000000

X1 is the most important predictor with the highest level of granularity. This is followed by X2 and X4. There seems to be present a selection bias with more distinct values favoured.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right focuses on a few variables as it has a higher learning rate along with a higher bagging rate. This means that a larger portion of the data is used, increasing the correlation at each iteration. Thus only a few of the variables have high importance.

(b) Which model do you think would be more predictive of other samples

The model with larger learning and bagging will most likely overfit as it considers fewer variables. The model on the left has a higher chance of being more predictive of other samples.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig 8.24?

Interaction depth is also same like tree depth. As the interaction depth increases, more predictors are included. The RMSE error will lower and the steeper the slop of importance of predictors.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Impute the missing values using bagImpute
cmp_impute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute'))
# Replace
cmp <- predict(cmp_impute, ChemicalManufacturingProcess[,-c(1)])
# Splitting the data into training and test datasets
set.seed(480)
train_r <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X_train <- cmp[train_r,]
y_train <- ChemicalManufacturingProcess$Yield[train_r]
X_test <- cmp[-train_r,]
y_test <- ChemicalManufacturingProcess$Yield[-train_r]

Single Tree

set.seed(330)
model_rpart <- train(x= X_train, y= y_train, method="rpart", tuneLength=10, control= rpart.control(maxdepth=2))
model_rpart
## CART 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.01639564  1.480376  0.3607456  1.147913
##   0.02137409  1.480376  0.3607456  1.147913
##   0.02341864  1.480376  0.3607456  1.147913
##   0.02532733  1.482527  0.3601484  1.148927
##   0.03111746  1.485388  0.3573289  1.151740
##   0.03600407  1.485388  0.3573289  1.151740
##   0.04559790  1.497955  0.3500961  1.159016
##   0.06391370  1.512886  0.3413150  1.171772
##   0.09164355  1.500473  0.3448356  1.169280
##   0.39332469  1.657623  0.2971253  1.320311
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02341864.

Random Forest

set.seed(340)
model_rf3<- train(X_train, y_train, method='rf', tuneLength = 10)
model_rf3
## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.306399  0.5247818  1.0456889
##    8    1.235512  0.5490750  0.9683824
##   14    1.220206  0.5526758  0.9492727
##   20    1.214414  0.5510626  0.9382298
##   26    1.214831  0.5469368  0.9329945
##   32    1.219346  0.5416017  0.9360704
##   38    1.221729  0.5372881  0.9360276
##   44    1.227097  0.5317287  0.9368421
##   50    1.232757  0.5261384  0.9411051
##   57    1.234035  0.5242386  0.9425201
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.

Gradient Boosting

set.seed(350)
grid <- expand.grid(n.trees=c(50, 100, 150, 200), 
                    interaction.depth=c(1, 5, 10, 15), 
                    shrinkage=c(0.01, 0.1, 0.5), 
                    n.minobsinnode=c(5, 10, 15))
model_gbm1 <- train(x = X_train,y = y_train, method = 'gbm',tuneGrid = grid, verbose = FALSE)

model_gbm1$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 88     200                15       0.1              5

(a) Which tree-based regression model gives the optimal resampling and test set performance

Let us resample and have a look

summary(resamples(list(Single_True = model_rpart, Random_Forest = model_rf3, Gradient_Boosting=model_gbm1)))
## 
## Call:
## summary.resamples(object = resamples(list(Single_True =
##  model_rpart, Random_Forest = model_rf3, Gradient_Boosting = model_gbm1)))
## 
## Models: Single_True, Random_Forest, Gradient_Boosting 
## Number of resamples: 25 
## 
## MAE 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.     Max.
## Single_True       0.9445823 1.0825643 1.1771915 1.1479128 1.2229221 1.318403
## Random_Forest     0.7726775 0.8568233 0.9438754 0.9382298 0.9756329 1.170187
## Gradient_Boosting 0.8104290 0.8763024 0.9236562 0.9237093 0.9625195 1.039068
##                   NA's
## Single_True          0
## Random_Forest        0
## Gradient_Boosting    0
## 
## RMSE 
##                        Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## Single_True       1.2451998 1.405136 1.484287 1.480376 1.577538 1.636725    0
## Random_Forest     0.9779076 1.127082 1.196840 1.214414 1.265581 1.542786    0
## Gradient_Boosting 1.0682624 1.140261 1.188020 1.204190 1.239991 1.469959    0
## 
## Rsquared 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Single_True       0.1808142 0.2998683 0.3552829 0.3607456 0.4372972 0.5070873
## Random_Forest     0.3940632 0.4918491 0.5656946 0.5510626 0.6242199 0.6959368
## Gradient_Boosting 0.4210686 0.5505479 0.5950396 0.5802401 0.6193195 0.6888321
##                   NA's
## Single_True          0
## Random_Forest        0
## Gradient_Boosting    0

Gradient Boosting seems to be the best model as the RMSE value looks better than those from the other models. We need to test the model on the test set.

#Function for test data
test_performance <- function(models, testData, testTarget) {
  method <- c()
  res <- data.frame()
  for(model in models){
    method <- c(method, model$method)
    pred <- predict(model, newdata=testData)
    res <- rbind(res, t(postResample(pred=pred, obs=testTarget)))
  }
  row.names(res) <- method
  return(res)
}
#List te models
models <- list(model_rpart, model_rf3, model_gbm1)
#Run the function
performance <- test_performance(models, X_test, y_test)
performance
##           RMSE  Rsquared       MAE
## rpart 1.564727 0.3765960 1.2801338
## rf    1.102755 0.7517902 0.8670284
## gbm   1.005104 0.7617203 0.7987109

Again, Gradient boosting gives the lowest RMSE and comes up as the best model.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

From the previous homeworks on linear and non linear models, pls and svm were the best models that came up. We are running it here again to compare

model_pls <- train(x = X_train,y = y_train, method='pls', metric='RMSE',
                   tuneLength=20, trControl = trainControl(method='cv'))
(pls_imp = varImp(model_pls))
## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial03     36.80
## ManufacturingProcess14   36.58
## ManufacturingProcess27   35.84
## ManufacturingProcess24   34.87
## ManufacturingProcess28   33.07
## ManufacturingProcess35   32.72
## ManufacturingProcess26   30.38
## ManufacturingProcess15   30.33
## ManufacturingProcess06   29.02
## BiologicalMaterial06     28.19
## ManufacturingProcess25   28.17
## ManufacturingProcess04   27.54
## BiologicalMaterial02     27.34
## ManufacturingProcess02   21.66
## ManufacturingProcess05   18.25
## ManufacturingProcess33   17.54
## ManufacturingProcess19   16.02
## ManufacturingProcess20   15.11
## ManufacturingProcess18   15.04
set.seed(424)
svm_model <- train(x = X_train,y = y_train,
                        method = "svmRadial",
                        tuneLength=10,
                        preProc = c("center", "scale"))
(svm_imp = varImp(svm_model))
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## BiologicalMaterial06    100.00
## ManufacturingProcess13   98.34
## ManufacturingProcess32   96.95
## BiologicalMaterial03     81.49
## ManufacturingProcess17   75.31
## BiologicalMaterial02     73.06
## BiologicalMaterial12     71.59
## ManufacturingProcess09   70.96
## ManufacturingProcess31   67.21
## ManufacturingProcess36   65.94
## ManufacturingProcess06   55.50
## BiologicalMaterial04     54.37
## BiologicalMaterial11     53.94
## ManufacturingProcess33   49.48
## BiologicalMaterial08     48.82
## BiologicalMaterial01     46.53
## ManufacturingProcess29   40.97
## ManufacturingProcess02   38.37
## ManufacturingProcess11   36.10
## ManufacturingProcess12   32.92

Let us compare the three lists of important variables

p1<-plot(svm_imp, top=10, main='SVM')
p2<-plot(pls_imp, top=10, main='PLS')
gbm_imp<-varImp(model_gbm1)
p3<-plot(gbm_imp, top=10, main='GBM')
gridExtra::grid.arrange(p1, p2,p3,  ncol = 3)

Gradient Boosting and SVM have both Biological and Manufacturing predictors in the top 10, whereas Partial least squares linear model is dominated by manufacturing predictors. Manufacturing Process32 seems to be an important variable in all three models.

(c) Plot the optimal single true with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

model_rpart$finalModel
## n= 144 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 144 472.70180 40.17111  
##   2) ManufacturingProcess32< 159.5 83 142.65450 39.19699  
##     4) BiologicalMaterial11< 145.075 39  42.89152 38.55615 *
##     5) BiologicalMaterial11>=145.075 44  69.55090 39.76500 *
##   3) ManufacturingProcess32>=159.5 61 144.12200 41.49656  
##     6) BiologicalMaterial06< 51.61 34  60.34664 40.74559 *
##     7) BiologicalMaterial06>=51.61 27  40.45527 42.44222 *
rpart.plot::rpart.plot(model_rpart$finalModel, box.palette = "RdBu", shadow.col = "blue", nn = TRUE)

The tree shows that the split begins with ManfucturingPRocess32 at 160. If it is less than 160, the yield will be 39. If it is more than 160, the yield will be 41. The tree branches to terminal nodes and the yield can improve upto 42. The tree shows the relationship with the predictors and their relationship with yield. The higher the values, the higher the yield according to this tree.