8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

A.Fit a random forest model to all of the predictors, then estimate the variable importance scores:

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The random forest model dd not use the uninformative predictors significantly.

B. Now add an additional predictor that is highly correlated with one of the informative predictors.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change?

The importance score lowered for V1 while the scores for V2 and V4 increased and became more important than V1.

model2 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

rfImp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

What happens when you add another predictor that is also highly correlated with V1?

When another highly correlated predictor is added, the important scores for the other variables increase while the importance score for V1 decreased even more.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)

rfImp3
##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

C

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

These importance scores show almost similar patterns as the traditional random forest model. In the first model, V1 was considered the most important and V4 was the second most important. In the cforest model, these variables switch places.

model4 <- cforest(y ~ ., data = simulated[, c(1:11)])

rfImp4 <- varimp(model4, conditional = TRUE)

rfImp4
##          V1          V2          V3          V4          V5          V6 
##  6.35767602  5.33898936  0.14934113  6.13948255  1.38546210 -0.22927142 
##          V7          V8          V9         V10 
## -0.12173581 -0.18135897 -0.02920453 -0.20105653

D

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur

Boosted trees

gbm_model <- gbm(y ~ ., data=simulated, distribution = 'gaussian')
summary(gbm_model)

##                   var    rel.inf
## V4                 V4 29.8141647
## V2                 V2 24.4292714
## V1                 V1 16.0407649
## V5                 V5 10.6369734
## duplicate1 duplicate1  9.8929397
## V3                 V3  8.2372162
## duplicate2 duplicate2  0.4203948
## V6                 V6  0.3658364
## V8                 V8  0.1624385
## V7                 V7  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

I still see the same pattern with V4, V2, duplicate1 and V1 being the important predictors while V6 - V9 are not the important predictors. I think the only difference between gbm model and previous ones are the importance got even more increased but the patterns is almost same. Now let’s test Cubist and see if the pattern persists here too.

Cubist

cubist_model <- cubist(simulated[,-11], simulated[, 11])
varImp(cubist_model)
##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## duplicate1      50
## V3               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0
## duplicate2       0

Here seems like the pattern slightly changed as now V1, V2, V4, V5 and Duplicate1 are the top predictors while the others are not.

8.2

Use a simulation to show tree bias with different granularities

set.seed(624)

a <- sample(1:10 / 10, 500, replace = TRUE)
b <- sample(1:100 / 100, 500, replace = TRUE)
c <- sample(1:1000 / 1000, 500, replace = TRUE)
d <- sample(1:10000 / 10000, 500, replace = TRUE)
e <- sample(1:100000 / 100000, 500, replace = TRUE)

y <- a + b + c + d + e

simData <- data.frame(a,b,c,d,e,y) 

rpartTree <- rpart(y ~ ., data = simData)
rpart_m <- rpart(y ~., data=simData)
plot(as.party(rpartTree), gp = gpar(fontsize = 7))

varImp(rpart_m)
##    Overall
## a 1.702702
## b 4.769899
## c 3.551946
## d 3.205420
## e 3.957698

THe order of importance matches with last model as X1 - X4 are the important predictors using rpart. X1 is used more than X4 to split and hence we can say that there may be selection bias in the tree model. Overall, X1 is a highest important predictor as its value is significantly higher than X2-X4.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

knitr::include_graphics('downloadIM.png')

## a Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The bagging fraction is the fraction of the training data used, whereas, the learning rate is the fraction of the current predicted value that is added to the previous iteration’s predicted value. A lower learning rate is optimal, as it means there are more iterations. The model on the left has a smaller learning rate and bagging fraction, which means it learns slower and takes more computation time, thus performing better. It also uses a smaller subset of the data. The model on the right is most likely overfitting since it has a higher learning rate and bagging fraction. It uses more of the training data with each iteration and is learning faster. Since it is learning faster, it increases the weight or contribution of each predictor, hence focuses its importance on just the first few predictors. ## B Which model do you think would be more predictive of other samples?

The model on the left would be more predictive of other samples, as there are more iterations, thus decreasing the weight of each predictor. It generalizes better, making it more accurate.

C

Interaction depth is the number of splits to perform on a tree, or the maximum nodes per tree. When the interaction depth increases, the importance of the predictors increases, allowing the smaller important predictors to contribute more. Hence, the slope would become steeper or increase.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data(ChemicalManufacturingProcess)
cmp_impute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute'))

# Replacing
cmp <- predict(cmp_impute, ChemicalManufacturingProcess[,-c(1)])

# Splitting the data into training and test datasets
set.seed(3559)
train_r <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X_train <- cmp[train_r,]
y_train <- ChemicalManufacturingProcess$Yield[train_r]
X_test <- cmp[-train_r,]
y_test <- ChemicalManufacturingProcess$Yield[-train_r]

Single Tree

set.seed(2330)
rpart_m <- train(x= X_train, y= y_train, method="rpart", tuneLength=10, control= rpart.control(maxdepth=2))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
rpart_m
## CART 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   cp           RMSE      Rsquared   MAE     
##   0.008448771  1.531011  0.3633249  1.192349
##   0.026985533  1.531011  0.3633249  1.192349
##   0.028078955  1.531011  0.3633249  1.192349
##   0.028155890  1.531011  0.3633249  1.192349
##   0.031178734  1.531011  0.3633249  1.192349
##   0.041325211  1.537862  0.3582518  1.197905
##   0.053936668  1.536746  0.3578463  1.196370
##   0.071685404  1.534111  0.3596184  1.191464
##   0.077778984  1.540440  0.3524684  1.194357
##   0.402184913  1.708874  0.3075143  1.361557
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.03117873.

Random Forest

set.seed(2044)
randomf_m <- train(X_train, y_train, method='rf', tuneLength = 10)
randomf_m
## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.357513  0.5865459  1.0841770
##    8    1.261666  0.6180566  0.9930581
##   14    1.246031  0.6195084  0.9752118
##   20    1.248092  0.6098999  0.9704285
##   26    1.250049  0.6023135  0.9667885
##   32    1.256488  0.5933266  0.9700237
##   38    1.263279  0.5860354  0.9709287
##   44    1.267245  0.5806719  0.9721596
##   50    1.281082  0.5697832  0.9806418
##   57    1.285406  0.5648538  0.9830251
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 14.

Gradient Boosting

set.seed(19828)
grid <- expand.grid(n.trees=c(50, 100, 150, 200), 
                    interaction.depth=c(1, 5, 10, 15), 
                    shrinkage=c(0.01, 0.1, 0.5), 
                    n.minobsinnode=c(5, 10, 15))
gbm_m <- train(x = X_train,y = y_train, method = 'gbm',tuneGrid = grid, verbose = FALSE)
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
gbm_m$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 76     200                10       0.1              5

Cubist

set.seed(100)
cubistTuned <- train(X_train, y_train, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, X_train)

postResample(cubistPred, y_train)
##      RMSE  Rsquared       MAE 
## 0.1778292 0.9924303 0.1350261

A

Which tree-based regression model gives the optimal resampling and test set performance

summary(resamples(list(Single_True = rpart_m, Random_Forest = randomf_m, Gradient_Boosting=gbm_m, Cubist = cubistTuned)))
## 
## Call:
## summary.resamples(object = resamples(list(Single_True = rpart_m,
##  Random_Forest = randomf_m, Gradient_Boosting = gbm_m, Cubist = cubistTuned)))
## 
## Models: Single_True, Random_Forest, Gradient_Boosting, Cubist 
## Number of resamples: 25 
## 
## MAE 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.     Max.
## Single_True       0.9444254 1.1065223 1.1844621 1.1923487 1.2964423 1.400361
## Random_Forest     0.6937116 0.9214488 0.9832082 0.9752118 1.0181030 1.215950
## Gradient_Boosting 0.7354389 0.8622051 0.9133852 0.9176087 0.9611315 1.120374
## Cubist            0.6964094 0.7935556 0.8785461 0.8803166 0.9345770 1.090198
##                   NA's
## Single_True          0
## Random_Forest        0
## Gradient_Boosting    0
## Cubist               0
## 
## RMSE 
##                        Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## Single_True       1.2232769 1.404920 1.554384 1.531011 1.627351 1.797655    0
## Random_Forest     0.8990020 1.142196 1.263823 1.246031 1.340165 1.592363    0
## Gradient_Boosting 0.9808989 1.110621 1.168824 1.177305 1.220824 1.409576    0
## Cubist            0.8324293 1.075671 1.150489 1.157567 1.245419 1.492862    0
## 
## Rsquared 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Single_True       0.2166224 0.2787529 0.3634825 0.3633249 0.4349596 0.6043007
## Random_Forest     0.4117737 0.5772708 0.6265109 0.6195084 0.6742267 0.7068351
## Gradient_Boosting 0.5131900 0.5714078 0.6173689 0.6193288 0.6596906 0.7327710
## Cubist            0.4128364 0.5707274 0.6326029 0.6166150 0.6830906 0.7875603
##                   NA's
## Single_True          0
## Random_Forest        0
## Gradient_Boosting    0
## Cubist               0

Gradient Boosting is doing better as it’s RMSE value is better than the others.

Perform the test on the test set.

test_perf <- function(models, testData, testTarget) {
  method <- c()
  res <- data.frame()
  for(model in models){
    method <- c(method, model$method)
    pred <- predict(model, newdata=testData)
    res <- rbind(res, t(postResample(pred=pred, obs=testTarget)))
  }
  row.names(res) <- method
  return(res)
}

models <- list(rpart_m, randomf_m, gbm_m, cubistTuned)

performance <- test_perf(models, X_test, y_test)
performance
##             RMSE  Rsquared       MAE
## rpart  1.3408087 0.3804309 1.1170494
## rf     0.9768228 0.6534475 0.7017081
## gbm    0.9865217 0.6505681 0.6990646
## cubist 0.9215937 0.7004253 0.7143488

Random forest performed slightly better than Gradient Boosting as it’s value dropped on test performance. Random Forest is consistent across the training and test dataset and I would choose Random forest.

B

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(randomf_m)
## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13  39.661
## BiologicalMaterial03    35.566
## BiologicalMaterial12    34.529
## BiologicalMaterial06    28.473
## ManufacturingProcess09  26.066
## ManufacturingProcess17  23.584
## ManufacturingProcess28  21.755
## ManufacturingProcess06  18.692
## ManufacturingProcess31  16.636
## BiologicalMaterial02    15.598
## BiologicalMaterial11    15.042
## BiologicalMaterial04    12.363
## BiologicalMaterial09    10.890
## ManufacturingProcess36  10.506
## BiologicalMaterial08     9.648
## BiologicalMaterial05     8.754
## BiologicalMaterial01     8.183
## ManufacturingProcess11   7.465
## ManufacturingProcess27   6.937

ManufacturingProcess32 is consistently an important predictor leading with BiologicalMaterial which is slightly consistent with previously tested linear and non-linear regression models. I would say none of them are dominating but they both are important predictors in the model.

C

Plot the optimal single true with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpart_m$finalModel
## n= 144 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 144 510.06710 40.17806  
##   2) ManufacturingProcess32< 159.5 81 137.14040 39.12543  
##     4) BiologicalMaterial11< 145.47 44  50.63688 38.50932 *
##     5) BiologicalMaterial11>=145.47 37  49.93917 39.85811 *
##   3) ManufacturingProcess32>=159.5 63 167.78540 41.53143  
##     6) ManufacturingProcess06< 208.1 37  81.27967 40.86622 *
##     7) ManufacturingProcess06>=208.1 26  46.83320 42.47808 *
plot(rpart_m$finalModel)
text(rpart_m$finalModel)