library(caret)
library(mlbench)
library(randomForest)
library(party)
library(dplyr)
library(gbm)
library(Cubist)

Exercise 8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores

model_random <- randomForest(y ~ ., data= simulated, importance = TRUE, ntree= 1000)

rf_imp <- varImp(model_random, scale=FALSE)
rf_imp
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6-V10)?

THe predictors V6 - V10 has smaller weights as compared with V1 - V5.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model_random2 <- randomForest(y ~ ., data= simulated, importance= TRUE, ntree= 1000)

rf_imp2 <- varImp(model_random2, scale=FALSE)
rf_imp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

According to results, duplicate1 and V1 have high correlation which is 0.94. V1 although decreased from almost 8.6 to around 6. Important predictors are V1, V2, and V5.

(c) Use the cforest function in the party package to fit a random forest model using the conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the tradtional random forest model?

cforest_model <- cforest(y ~ ., data= simulated)

varimp(cforest_model) %>% sort(decreasing = TRUE) # Without conditional
##            V4            V2    duplicate1            V1            V5 
##  7.6223892727  6.0579730772  5.0941897280  4.6171158805  1.7161194047 
##            V7            V9            V3            V6           V10 
##  0.0465374951  0.0046062409  0.0003116115 -0.0289427183 -0.0310326410 
##            V8 
## -0.0380965511
varimp(cforest_model, conditional = TRUE) %>% sort(decreasing=TRUE) # With conditional 
##           V4           V2   duplicate1           V1           V5           V6 
##  6.190578351  4.688980360  1.926751729  1.807953145  1.051666850  0.028174759 
##           V9           V3           V8           V7          V10 
##  0.015118505  0.012878752 -0.004356587 -0.011709437 -0.022190587

There are slight differences on the calculation between traditional and conditional inference trues. Although V4, V2 , duplicate, V1 and V5 are still highly important variables but their importance dropped with conditional inference.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

gbm_model <- gbm(y ~ ., data=simulated, distribution = 'gaussian')
summary(gbm_model)

##                   var    rel.inf
## V4                 V4 29.9389549
## V2                 V2 21.6012918
## V1                 V1 15.6771729
## duplicate1 duplicate1 13.2189845
## V5                 V5 10.0440211
## V3                 V3  9.0948053
## V6                 V6  0.4247697
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

I still see the same pattern with V4, V2, duplicate1 and V1 being the important predictors while V6 - V9 are not the important predictors. I think the only difference between gbm model and previous ones are the importance got even more increased but the patterns is almost same. Now let’s test Cubist and see if the pattern persists here too.

cubist_model <- cubist(simulated[,-11], simulated[, 11])
varImp(cubist_model)
##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## duplicate1      50
## V3               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0

Here seems like the pattern slightly changed as now V1, V2, V4, V5 and Duplicate1 are the top predictors while the others are not.

Exercise 8.2

Use a simulation to show tree bias with different granularities

library(rpart)
library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
set.seed(388)
X1 <- sample(1:10, 100, replace = TRUE)
X2 <- rnorm(100)
X3 <- rnorm(100, mean = 0, sd = 4)
X4 <- sample(1:10/10, 100, replace = TRUE)
Y <- X1 + rnorm(100, mean = 0, sd = .5)

df <- data.frame(X1, X2, X3, X4, Y)

# Rpart
rpart_m <- rpart(Y ~., data=df)
varImp(rpart_m)
##      Overall
## X1 3.2553667
## X2 0.2997476
## X3 0.8363086
## X4 0.3748365

THe order of importance matches with last model as X1 - X4 are the important predictors using rpart. X1 is used more than X4 to split and hence we can say that there may be selection bias in the tree model. Overall, X1 is a highest important predictor as its value is significantly higher than X2-X4.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Because the model on the right has bigger learning rate and bagging function and it basically means that more of it is used for the final prediction. It gives a better picture of important predictors. 0.1 value of learning parameter is ideally in terms of fitting.

(b) Which model do you think would be more predictive of other samples

I cannot tell for sure because there may be chances of overfitting unless it is tested. Sometimes an apparent good model may be overfitting which may not give the best results.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig 8.24?

Interaction depth is also same like tree depth. As the interaction depth increases the RMSE error will lower and the steeper the slop of importance of predictors.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

# Imputing the missing values using bagImpute
cmp_impute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute'))

# Replacing
cmp <- predict(cmp_impute, ChemicalManufacturingProcess[,-c(1)])

# Splitting the data into training and test datasets
set.seed(3559)
train_r <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X_train <- cmp[train_r,]
y_train <- ChemicalManufacturingProcess$Yield[train_r]
X_test <- cmp[-train_r,]
y_test <- ChemicalManufacturingProcess$Yield[-train_r]

Several tree models

Now that the data is preprocessed and splitted into training and test datasets I am going to use some other models to see how they work.

Single Tree

set.seed(2330)
rpart_m <- train(x= X_train, y= y_train, method="rpart", tuneLength=10, control= rpart.control(maxdepth=2))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
rpart_m
## CART 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   cp           RMSE      Rsquared   MAE     
##   0.008448771  1.531011  0.3633249  1.192349
##   0.026985533  1.531011  0.3633249  1.192349
##   0.027870628  1.531011  0.3633249  1.192349
##   0.028155890  1.531011  0.3633249  1.192349
##   0.031178734  1.531011  0.3633249  1.192349
##   0.041325211  1.537862  0.3582518  1.197905
##   0.053936668  1.536746  0.3578463  1.196370
##   0.071685404  1.534111  0.3596184  1.191464
##   0.077778984  1.540440  0.3524684  1.194357
##   0.402184913  1.708874  0.3075143  1.361557
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.03117873.

Random Forest

set.seed(2044)
randomf_m <- train(X_train, y_train, method='rf', tuneLength = 10)
randomf_m
## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.357171  0.5863109  1.0831068
##    8    1.263995  0.6172568  0.9960572
##   14    1.247450  0.6174306  0.9760008
##   20    1.245978  0.6110851  0.9690348
##   26    1.247355  0.6051196  0.9642880
##   32    1.251446  0.5972517  0.9661479
##   38    1.259134  0.5894453  0.9691093
##   44    1.267043  0.5810307  0.9747482
##   50    1.277367  0.5724495  0.9787299
##   57    1.282848  0.5669017  0.9810368
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.

Gradient Boosting

set.seed(19828)
grid <- expand.grid(n.trees=c(50, 100, 150, 200), 
                    interaction.depth=c(1, 5, 10, 15), 
                    shrinkage=c(0.01, 0.1, 0.5), 
                    n.minobsinnode=c(5, 10, 15))
gbm_m <- train(x = X_train,y = y_train, method = 'gbm',tuneGrid = grid, verbose = FALSE)

gbm_m$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 76     200                10       0.1              5

(a) Which tree-based regression model gives the optimal resampling and test set performance

Let’s resample and see distribution of the models

summary(resamples(list(Single_True = rpart_m, Random_Forest = randomf_m, Gradient_Boosting=gbm_m)))
## 
## Call:
## summary.resamples(object = resamples(list(Single_True = rpart_m,
##  Random_Forest = randomf_m, Gradient_Boosting = gbm_m)))
## 
## Models: Single_True, Random_Forest, Gradient_Boosting 
## Number of resamples: 25 
## 
## MAE 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.     Max.
## Single_True       0.9444254 1.1065223 1.1844621 1.1923487 1.2964423 1.400361
## Random_Forest     0.7007020 0.9130536 0.9774114 0.9690348 1.0121849 1.210727
## Gradient_Boosting 0.7048638 0.8483657 0.9129005 0.9053819 0.9460865 1.127389
##                   NA's
## Single_True          0
## Random_Forest        0
## Gradient_Boosting    0
## 
## RMSE 
##                        Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## Single_True       1.2232769 1.404920 1.554384 1.531011 1.627351 1.797655    0
## Random_Forest     0.9070558 1.158586 1.263932 1.245978 1.337436 1.595936    0
## Gradient_Boosting 0.9365434 1.106733 1.154823 1.164502 1.206801 1.428635    0
## 
## Rsquared 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Single_True       0.2166224 0.2787529 0.3634825 0.3633249 0.4349596 0.6043007
## Random_Forest     0.4092469 0.5701930 0.6147885 0.6110851 0.6675758 0.6936807
## Gradient_Boosting 0.5302200 0.5914836 0.6203363 0.6299045 0.6693364 0.7546486
##                   NA's
## Single_True          0
## Random_Forest        0
## Gradient_Boosting    0

It seems like Gradient Boosting is doing better as it’s RMSE value is better than the others. I won’t select rsquared as it does not penalize with adding insignificant predictors. Now let’s test it out on test set.

test_perf <- function(models, testData, testTarget) {
  method <- c()
  res <- data.frame()
  for(model in models){
    method <- c(method, model$method)
    pred <- predict(model, newdata=testData)
    res <- rbind(res, t(postResample(pred=pred, obs=testTarget)))
  }
  row.names(res) <- method
  return(res)
}

models <- list(rpart_m, randomf_m, gbm_m)

performance <- test_perf(models, X_test, y_test)
performance
##            RMSE  Rsquared       MAE
## rpart 1.3408087 0.3804309 1.1170494
## rf    0.9892451 0.6391462 0.7023174
## gbm   1.0198520 0.6318556 0.7686820

Random forest performed slightly better than Gradient Boosting as it’s value dropped on test performance. I would add more data and test it with more tests and see which one is consistent. Random Forest is consistent across the training and test dataset and I would choose Random forest.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(randomf_m)
## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13  32.681
## BiologicalMaterial03    27.716
## BiologicalMaterial12    25.506
## ManufacturingProcess09  19.357
## ManufacturingProcess17  18.443
## ManufacturingProcess06  16.333
## BiologicalMaterial06    15.269
## ManufacturingProcess31  11.872
## ManufacturingProcess28  11.836
## BiologicalMaterial11    11.301
## BiologicalMaterial02    11.010
## BiologicalMaterial09     9.054
## BiologicalMaterial08     8.160
## BiologicalMaterial04     8.030
## ManufacturingProcess36   7.620
## BiologicalMaterial05     6.614
## BiologicalMaterial01     6.000
## ManufacturingProcess39   5.258
## ManufacturingProcess27   5.164

ManufacturingProcess32 is consistently an important predictor leading with BiologicalMaterial which is slightly consistent with previously tested linear and non-linear regression models. I would say none of them are dominating but they both are important predictors in the model.

(c) Plot the optimal single true with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpart_m$finalModel
## n= 144 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 144 510.06710 40.17806  
##   2) ManufacturingProcess32< 159.5 81 137.14040 39.12543  
##     4) BiologicalMaterial11< 145.47 44  50.63688 38.50932 *
##     5) BiologicalMaterial11>=145.47 37  49.93917 39.85811 *
##   3) ManufacturingProcess32>=159.5 63 167.78540 41.53143  
##     6) ManufacturingProcess06< 208.1 37  81.27967 40.86622 *
##     7) ManufacturingProcess06>=208.1 26  46.83320 42.47808 *
plot(rpart_m$finalModel)
text(rpart_m$finalModel)

A tree model shows that one single split using ManufacturingProcess32 at 159.5 is an optimal. It also means that as the value of ManufacturingProcess32 increases, the value of yield will also increase.