Load packages

library(tidyverse)
library(imputeTS)
library(lubridate)
library(caret)
library(mlbench)
library(randomForest)
library(party)
library(partykit)
library(rJava)
library(RWeka)
library(rpart)
library(gbm)
library(xgboost)
library(Cubist)
library(rpart)
library(rpart.plot)
library(AppliedPredictiveModeling)

Homework 9

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

Question 8.1

Recreate the simulated data from Exercise 7.2:

Start R and use these commands to load the data

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

The random forest model does not significantly use predictors V6-V10, shown by their low overall importance and weights.

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Impotance score for V1 did decrease. As more predictors highly correlated to V1 are included, V1 importance decreases.

model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959
  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Comparing the model created by the cforest function, the importances are not exact but do show a similar pattern of importance values in terms of magnitude.

cforestmodel <- cforest(y ~ ., data = simulated)
varimp(cforestmodel)
##          V1          V2          V3          V4          V5          V6 
##  5.42152778  5.79075433 -0.01425384  5.96294311  1.87652947 -0.06822257 
##          V7          V8          V9         V10  duplicate1  duplicate2 
##  0.07696942 -0.17067028  0.02155362 -0.13376134  5.27935126  2.92224516
  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

The same relative pattern occurs where V6-V10 are consistently low in terms of variable importance. There are differences in each model, however, such as the rpartModel having low importance for V3. Another example is the Cubist model having a higher importance for V6, at approximately 26, while V7-V10 are all less than 2. For all models (given that the data still contains the highly correlated variables to V1), V4 is the highest ranked in terms of importance. In some models, like the xgbModel, the importance of the duplicate variables significantly decrease (e.g. V1 has an overall variable importance of approximately 72, while duplicate 1 has a lower variable importance of approximately 30, and then duplicate 2 has an even lower variable importance at approximately 0.6)

rpartmodel <- rpart(y ~ ., data = simulated)
varImp(rpartmodel)
##              Overall
## duplicate1 1.6165532
## duplicate2 0.3521296
## V1         1.5149230
## V10        0.5254491
## V2         2.0690787
## V3         0.5805123
## V4         2.9099218
## V5         2.3438469
## V6         0.3027516
## V7         0.5113927
## V9         0.2054104
## V8         0.0000000
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian", n.trees = 100)
varImp <- summary(gbmModel, n.trees = 100)

print(varImp)
##                   var    rel.inf
## V4                 V4 29.0834679
## V2                 V2 23.2667156
## duplicate1 duplicate1 15.6028982
## V1                 V1 12.9259358
## V5                 V5  9.9657720
## V3                 V3  8.6571358
## V9                 V9  0.3081622
## V6                 V6  0.1899125
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V10               V10  0.0000000
## duplicate2 duplicate2  0.0000000
#xgb is throwing a deprecation error from the source code in the package; 
#suppressing these errors
invisible(capture.output({
  xgbModel <- train(y ~ ., data = simulated, method = "xgbTree")
  }))
varImp(xgbModel)
## xgbTree variable importance
## 
##              Overall
## V4         100.00000
## V2          76.40826
## duplicate1  59.28970
## V1          40.06618
## V5          31.60700
## V3          30.13565
## V6           1.38945
## duplicate2   1.03944
## V10          0.51353
## V7           0.06086
## V9           0.05165
## V8           0.00000
cubistModel <- train(y ~ ., data = simulated, method = "cubist")
varImp(cubistModel)
## cubist variable importance
## 
##            Overall
## V2         100.000
## V1          77.698
## V4          71.942
## V5          54.676
## V3          46.043
## duplicate1  35.971
## duplicate2  35.971
## V6          14.388
## V8           4.317
## V10          0.000
## V9           0.000
## V7           0.000

Question 8.2

Use a simulation to show tree bias with different granularities.

From Applied Predictive Modeling, the text states that individual trees tend to suffer from selection bias, where “predictors with a higher number of distinct values are favored over more granular predictors”. This sentence is a bit difficulty to interpret, as it seems like granularity is implying “good” predictors even though the word granular may be interpreted to also mean distinct. As such, my understanding of this would be that this indicates that a predictor with many different values may be favored over more highly correlated predictors. To simulate this, I create a dataset with 200 outcome samples y, randomly sampling from 1-10 with decimal places allowed. I then create a column, duplicatey, which creates a highly correlated variable to the outcome y. This column duplicatey is rounded to the nearest 10s place. I then create another column, distinct1, which is uncorrelated to y but consists of many distinct randomly selected values. When modeling with rpart() and then viewing the variable importance, I can see that duplicatey is more important but the model still splits on distinct1 despite being uncorrelated. I then run the same simulation but do not round duplicatey. The result is a model that does not split on distinct1 at all. The importance of duplicatey also increases more, when compared to the increase in importance of distinct1. The second model also had a higher correlation between y and duplicatey, along with its higher number of distinct values. A final comparison can be made with a third model that also rounds the uncorrelated variable, distinct1. This decreases the number of distinct values, and as expected this also decreases the importance of distinct1 when creating the rpart() model and listing the variable importance.

set.seed(200)
simulatedData <- data.frame()
for(i in (1:200)){
  simulatedData <- rbind(sample(1:10,1, replace = TRUE),simulatedData)
}
colname <- colnames(simulatedData)
simulatedData <- simulatedData %>% rename(y = colname)
simulatedData$duplicatey <- round(simulatedData$y + rnorm(200),-1)
cor(simulatedData$duplicatey, simulatedData$y)
## [1] 0.8185743
simulatedData$distinct1 <- rnorm(200)
cor(simulatedData$distinct1, simulatedData$y)
## [1] -0.04767912
rpartModelsim <- rpart(y ~ ., data = simulatedData)
varImp(rpartModelsim)
##               Overall
## distinct1  0.07705602
## duplicatey 0.67006384
rpart.plot(rpartModelsim)

simulatedData <- data.frame()
for(i in (1:200)){
  simulatedData <- rbind(sample(1:10,1, replace = TRUE),simulatedData)
}
colname <- colnames(simulatedData)
simulatedData <- simulatedData %>% rename(y = colname)
simulatedData$duplicatey <- simulatedData$y + rnorm(200)
cor(simulatedData$duplicatey, simulatedData$y)
## [1] 0.928973
simulatedData$distinct1 <- rnorm(200)
cor(simulatedData$distinct1, simulatedData$y)
## [1] 0.05930572
rpartModelsim <- rpart(y ~ ., data = simulatedData)
varImp(rpartModelsim)
##             Overall
## distinct1  0.143018
## duplicatey 2.024346
rpart.plot(rpartModelsim)

simulatedData <- data.frame()
for(i in (1:200)){
  simulatedData <- rbind(sample(1:10,1, replace = TRUE),simulatedData)
}
colname <- colnames(simulatedData)
simulatedData <- simulatedData %>% rename(y = colname)
simulatedData$duplicatey <- simulatedData$y + rnorm(200)
cor(simulatedData$duplicatey, simulatedData$y)
## [1] 0.9467756
simulatedData$distinct1 <- round(rnorm(200))
cor(simulatedData$distinct1, simulatedData$y)
## [1] -0.0233493
rpartModelsim <- rpart(y ~ ., data = simulatedData)
varImp(rpartModelsim)
##             Overall
## distinct1  0.122513
## duplicatey 2.543658
rpart.plot(rpartModelsim)

Question 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Gradient boosting is an iterative process, and the learning rate defines the fraction of each tree’s prediction added to the previous iteration’s predictive value. The purpose of the learning rate, aka shrinkage, is to prevent the algorithm from selecting the same predictors at each iteration. A learning rate of 0.9 would be a high fraction (90%) added to the previous iteration, and would result in less protection against selecting the same predictor. The bagging fraction refers to the fraction of training data used to build each tree. A bagging fraction of 0.9 would indicate 90% of the training data is used in each iteration. Since each iteration includes most of the training data, this again would select for the same dominant predictors as they are likely to be in each training dataset.

  1. Which model do you think would be more predictive of other samples?

Since the learning rate was introduced to prevent overfitting, I would think that the model with 0.1 learning rate would be more predictive of other samples. The 0.1 bagging fraction would also help overfitting as only a fraction of the training data is used at each step, which may help the overall model in predicting other samples that may not look very similar to the training data as a whole.

c, How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth, aka tree depth, would likely increase the slope of predictor importance for either model as a shallow depth would indicate fewer splits. If there are fewer splits, the splits will occur on only the most dominant variables and therefore lessen the importance of the other variables.

Question 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Load the Data:

data(ChemicalManufacturingProcess)
set.seed(63)

KNN imputation was used to fill in these missing values:

set.seed(63)
ChemicalManufacturingProcess_preProc <- 
  preProcess(ChemicalManufacturingProcess, 
             method = "knnImpute")
transformed_ChemMan <- 
  predict(ChemicalManufacturingProcess_preProc, 
          newdata = ChemicalManufacturingProcess)
df <- as.data.frame(transformed_ChemMan$Yield) %>% 
  rename(Yield = `transformed_ChemMan$Yield`) 

Split the data into a training and a test set, pre-process the data:

set.seed(63)
smp_size <- floor(0.80 * nrow(ChemicalManufacturingProcess))
trainingDataindex <- 
  sample(seq_len(nrow(df)), 
         size = smp_size)

trainY <- df[trainingDataindex,]
testY <- df[-trainingDataindex,]
trainX <- 
  transformed_ChemMan[trainingDataindex,] %>% 
  select(-Yield,-BiologicalMaterial07)
testX <- 
  transformed_ChemMan[-trainingDataindex,] %>% 
  select(-Yield,-BiologicalMaterial07)

Train several Tree based models:

set.seed(63)
rpartTune <- train(x = trainX,
 y = trainY,
 method = "rpart2",
 tuneLength = 10,
 trControl = trainControl(method = "cv"))
rpartTune
## CART 
## 
## 140 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 126, 126, 126, 127, 125, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE       Rsquared   MAE      
##    1        0.8246060  0.3990210  0.6385586
##    2        0.8585395  0.3294589  0.6737085
##    3        0.8478948  0.3358086  0.6692808
##    4        0.8429965  0.3509759  0.6484788
##    5        0.7995522  0.4085217  0.6155064
##    6        0.7963984  0.4224709  0.6041594
##    7        0.7969954  0.4292998  0.5960959
##    8        0.8053248  0.4181225  0.5992211
##    9        0.8031675  0.4215811  0.6101907
##   10        0.8028446  0.4226631  0.6068626
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 6.
rfModel <- randomForest(trainX,trainY, 
                        importance = TRUE, 
                        mtry = 3, 
                        ntrees = 1000)
rfModel
## 
## Call:
##  randomForest(x = trainX, y = trainY, mtry = 3, importance = TRUE,      ntrees = 1000) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.4607105
##                     % Var explained: 55.76
gbmTune <- train(trainX, trainY,
                 method = "gbm",
                 tuneLength = 10,
                 trControl = trainControl(method = "cv"),
                 verbose = FALSE)
gbmTune
## Stochastic Gradient Boosting 
## 
## 140 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 125, 126, 126, 127, 128, 126, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   MAE      
##    1                  50      0.6619548  0.6347425  0.5207302
##    1                 100      0.6553629  0.6453852  0.5216436
##    1                 150      0.6560975  0.6508306  0.5313237
##    1                 200      0.6440480  0.6594225  0.5244150
##    1                 250      0.6528182  0.6568716  0.5305933
##    1                 300      0.6506265  0.6573907  0.5286091
##    1                 350      0.6481291  0.6615171  0.5266143
##    1                 400      0.6516427  0.6570397  0.5279734
##    1                 450      0.6568860  0.6540144  0.5349976
##    1                 500      0.6523359  0.6611039  0.5319127
##    2                  50      0.6464292  0.6481027  0.5116357
##    2                 100      0.6426869  0.6478847  0.5085874
##    2                 150      0.6468147  0.6488001  0.5172338
##    2                 200      0.6549768  0.6451463  0.5238394
##    2                 250      0.6519943  0.6470021  0.5184311
##    2                 300      0.6531169  0.6470905  0.5186146
##    2                 350      0.6506190  0.6488732  0.5176903
##    2                 400      0.6488769  0.6499406  0.5178891
##    2                 450      0.6470029  0.6526123  0.5150915
##    2                 500      0.6470525  0.6523934  0.5156345
##    3                  50      0.6381306  0.6432709  0.5035347
##    3                 100      0.6251244  0.6551793  0.4984719
##    3                 150      0.6106679  0.6694123  0.4852842
##    3                 200      0.6090971  0.6715847  0.4832230
##    3                 250      0.6088921  0.6701510  0.4819457
##    3                 300      0.6083596  0.6702040  0.4818891
##    3                 350      0.6089424  0.6703035  0.4815626
##    3                 400      0.6082414  0.6712286  0.4817171
##    3                 450      0.6070799  0.6723708  0.4814129
##    3                 500      0.6064531  0.6725437  0.4811732
##    4                  50      0.6214456  0.6760762  0.4914931
##    4                 100      0.6241328  0.6686813  0.4983421
##    4                 150      0.6203860  0.6767502  0.4971338
##    4                 200      0.6198565  0.6783092  0.4925379
##    4                 250      0.6163597  0.6802872  0.4888787
##    4                 300      0.6145497  0.6828694  0.4873391
##    4                 350      0.6137432  0.6834633  0.4862273
##    4                 400      0.6138293  0.6829844  0.4865868
##    4                 450      0.6134375  0.6834863  0.4859357
##    4                 500      0.6130335  0.6840680  0.4854928
##    5                  50      0.6116931  0.6811965  0.4968391
##    5                 100      0.5979165  0.6877146  0.4907408
##    5                 150      0.5972286  0.6928620  0.4869924
##    5                 200      0.5940537  0.6973794  0.4848480
##    5                 250      0.5948442  0.6953157  0.4846855
##    5                 300      0.5943929  0.6941034  0.4845107
##    5                 350      0.5933019  0.6947074  0.4824759
##    5                 400      0.5934477  0.6948879  0.4834339
##    5                 450      0.5927216  0.6952168  0.4831185
##    5                 500      0.5926664  0.6952811  0.4830603
##    6                  50      0.6174232  0.6728348  0.4913589
##    6                 100      0.6119428  0.6840052  0.4920060
##    6                 150      0.6048565  0.6905264  0.4895501
##    6                 200      0.6032738  0.6934546  0.4907144
##    6                 250      0.6005979  0.6960954  0.4881325
##    6                 300      0.5988501  0.6986023  0.4873794
##    6                 350      0.5978103  0.6995812  0.4868447
##    6                 400      0.5980040  0.6998468  0.4872227
##    6                 450      0.5975675  0.7001093  0.4865538
##    6                 500      0.5967565  0.7011452  0.4862442
##    7                  50      0.6416382  0.6577286  0.5216524
##    7                 100      0.6264580  0.6817679  0.5123127
##    7                 150      0.6277958  0.6784691  0.5140732
##    7                 200      0.6302152  0.6756910  0.5160497
##    7                 250      0.6295272  0.6772938  0.5135817
##    7                 300      0.6284829  0.6794963  0.5123717
##    7                 350      0.6290959  0.6793974  0.5124335
##    7                 400      0.6294118  0.6790561  0.5122755
##    7                 450      0.6290627  0.6794954  0.5121803
##    7                 500      0.6291371  0.6797166  0.5125577
##    8                  50      0.6847926  0.6168126  0.5448827
##    8                 100      0.6758213  0.6227684  0.5232168
##    8                 150      0.6710163  0.6285341  0.5202417
##    8                 200      0.6682207  0.6335803  0.5204282
##    8                 250      0.6665798  0.6362466  0.5219160
##    8                 300      0.6658941  0.6369302  0.5226856
##    8                 350      0.6658490  0.6371099  0.5239016
##    8                 400      0.6661013  0.6374771  0.5245621
##    8                 450      0.6657955  0.6376807  0.5246843
##    8                 500      0.6663121  0.6377156  0.5250454
##    9                  50      0.6311070  0.6628740  0.5054270
##    9                 100      0.6210939  0.6779088  0.4958137
##    9                 150      0.6204869  0.6771817  0.4969351
##    9                 200      0.6169604  0.6812653  0.4939517
##    9                 250      0.6121686  0.6855438  0.4892885
##    9                 300      0.6123702  0.6848747  0.4900230
##    9                 350      0.6118291  0.6865609  0.4909422
##    9                 400      0.6111679  0.6869801  0.4910799
##    9                 450      0.6100228  0.6881993  0.4905098
##    9                 500      0.6102115  0.6880007  0.4908984
##   10                  50      0.6460078  0.6526971  0.5061813
##   10                 100      0.6161771  0.6890507  0.4863725
##   10                 150      0.6143909  0.6915696  0.4825131
##   10                 200      0.6044788  0.6998403  0.4798898
##   10                 250      0.6004079  0.7028617  0.4774439
##   10                 300      0.5990039  0.7048271  0.4765001
##   10                 350      0.5959254  0.7080689  0.4739420
##   10                 400      0.5945544  0.7096971  0.4730968
##   10                 450      0.5934899  0.7105101  0.4723497
##   10                 500      0.5932616  0.7105859  0.4727235
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
##  5, shrinkage = 0.1 and n.minobsinnode = 10.
cubistTuned <- train(trainX,trainY, 
                     method = "cubist",
                     tuneLength = 10,
                     trControl = trainControl(method = "cv"))
cubistTuned
## Cubist 
## 
## 140 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 127, 128, 124, 126, 127, 124, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    1          0          0.6862313  0.5693033  0.5569406
##    1          5          0.6059314  0.6686309  0.4822351
##    1          9          0.6355792  0.6336424  0.5019384
##   10          0          0.6393939  0.6210108  0.4979464
##   10          5          0.5783206  0.6952543  0.4433245
##   10          9          0.6075965  0.6641991  0.4581421
##   20          0          0.6187116  0.6478953  0.4842196
##   20          5          0.5576969  0.7201651  0.4286916
##   20          9          0.5845153  0.6942945  0.4428382
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
  1. Which tree-based regression model gives the optimal resampling and test set performance?

The Cubist model gives the optimal resampling and test set performance, with an R-squared value of over 0.7.

set.seed(63)
rpartPred <- predict(rpartTune, newdata = testX)
postResample(pred = rpartPred, obs = testY)
##      RMSE  Rsquared       MAE 
## 0.7582571 0.4553510 0.6158285
rfPred <- predict(rfModel, newdata = testX)
postResample(pred = rfPred, obs = testY)
##      RMSE  Rsquared       MAE 
## 0.6128878 0.6099963 0.4976785
gbmPred <- predict(gbmTune, newdata = testX)
postResample(pred = gbmPred, obs = testY)
##      RMSE  Rsquared       MAE 
## 0.5395483 0.6381464 0.4225340
cubistPred <- predict(cubistTuned, newdata = testX)
postResample(pred = cubistPred, obs = testY)
##      RMSE  Rsquared       MAE 
## 0.4823869 0.7134893 0.3954121
  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

It looks like Manufacturing Processes dominate the list of important predictors for the cubist model. Compared to the optimal linear and non-linear models, ManufacturingProcess32 remains the top predictor. Similar to the linear model, ManufacturingProcess09 is the second most important predictor, while similar to the non-linear model, ManufacturingProcess17 is the third most important predictor. Between the three models, there are 6 predictors that overlapped from the top ten list - ManufacturingProcess32, ManufacturingProcess09, ManufacturingProcess17, BiologicalMaterial03, BiologicalMaterial06, and BiologicalMaterial02. The Cubist model also puts a stronger weight on the top two variables in comparison to the other models, which have steadily declining variable importance among its top 10 predictors versus the Cubist model’s steep drop off.

varImp(cubistTuned)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09  71.910
## ManufacturingProcess17  53.933
## BiologicalMaterial12    34.831
## ManufacturingProcess33  33.708
## ManufacturingProcess04  32.584
## BiologicalMaterial03    31.461
## BiologicalMaterial06    23.596
## BiologicalMaterial08    19.101
## BiologicalMaterial02    17.978
## ManufacturingProcess13  16.854
## ManufacturingProcess10  15.730
## ManufacturingProcess29  15.730
## ManufacturingProcess25  14.607
## ManufacturingProcess39  13.483
## ManufacturingProcess31  11.236
## ManufacturingProcess18  11.236
## ManufacturingProcess20  11.236
## ManufacturingProcess14  10.112
## BiologicalMaterial01     8.989
  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Per the above modeling, the optimal single tree model was found at max depth = 6. Graphing the decision tree, we can see that the first split is done on ManufacturingProcess32, which has been consistently the most important predictor per all of our optimal models. Since the overall goal is to increase yield, this view of yield distribution in the terminal nodes is useful in displaying potential simplifications to the model. As all negative pathways are considered unfavorable, further investigation can be focused on ManufacturingProcess18 and ManufacturingProcess09, as they present the only splits that lead to both negative and positive outcomes. In other words, the split provided by BiologicalMaterial05 is unimportant as the split on BiologicalMaterial11 where BiologicalMaterial11 < -0.39 already leads to only negative yields. Since only the Manufacturing Processes can be altered, attention can be focused on ManufacturingProcess18, where values < 0.02 tend towards positive yields per the model. Similarly, ManufacturingProcess09 has a split at -0.47, where values greater than -0.47 all lead to terminal nodes with positive yields.

rpartTune
## CART 
## 
## 140 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 126, 126, 126, 127, 125, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE       Rsquared   MAE      
##    1        0.8246060  0.3990210  0.6385586
##    2        0.8585395  0.3294589  0.6737085
##    3        0.8478948  0.3358086  0.6692808
##    4        0.8429965  0.3509759  0.6484788
##    5        0.7995522  0.4085217  0.6155064
##    6        0.7963984  0.4224709  0.6041594
##    7        0.7969954  0.4292998  0.5960959
##    8        0.8053248  0.4181225  0.5992211
##    9        0.8031675  0.4215811  0.6101907
##   10        0.8028446  0.4226631  0.6068626
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 6.
rpartTune$finalModel
## n= 140 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 140 145.8064000  0.03415424  
##    2) ManufacturingProcess32< 0.191596 82  46.2908900 -0.50055150  
##      4) BiologicalMaterial11< -0.3896263 37  12.5158600 -0.90821600  
##        8) BiologicalMaterial05>=0.4725039 9   3.6129330 -1.48809900 *
##        9) BiologicalMaterial05< 0.4725039 28   4.9037900 -0.72182510  
##         18) BiologicalMaterial05< 0.01429589 19   2.3237940 -0.88811590 *
##         19) BiologicalMaterial05>=0.01429589 9   0.9454191 -0.37076680 *
##      5) BiologicalMaterial11>=-0.3896263 45  22.5701100 -0.16536070  
##       10) ManufacturingProcess18>=0.01991463 32   8.2799280 -0.44088230  
##         20) ManufacturingProcess27>=0.001387709 24   5.8982750 -0.57831000 *
##         21) ManufacturingProcess27< 0.001387709 8   0.5685590 -0.02859893 *
##       11) ManufacturingProcess18< 0.01991463 13   5.8814530  0.51284610 *
##    3) ManufacturingProcess32>=0.191596 58  42.9249700  0.79011760  
##      6) ManufacturingProcess09< -0.4721252 11   5.1234900 -0.13948520 *
##      7) ManufacturingProcess09>=-0.4721252 47  26.0709500  1.00768400  
##       14) BiologicalMaterial03< 1.102207 36  14.6134900  0.79713410  
##         28) ManufacturingProcess24>=-0.05764121 15   2.9820890  0.38547910 *
##         29) ManufacturingProcess24< -0.05764121 21   7.2738610  1.09117300  
##           58) ManufacturingProcess21>=-0.2387217 13   3.8428740  0.85210230 *
##           59) ManufacturingProcess21< -0.2387217 8   1.4805730  1.47966400 *
##       15) BiologicalMaterial03>=1.102207 11   4.6384960  1.69675700 *
rpart.plot(rpartTune$finalModel)