Load Package

library(tidyverse)
library(caret)
library(rpart)
library(rpart.plot)
library(party)
library(gbm)
library(Cubist)
library(missForest)

Question 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
data_x <- simulated$x
data_y <- simulated$y
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a)

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
#library(caret)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 - V10)?

Answer: No, the RF model did not significantly use V6 - V10, as shown from the variable importance scores that the top 5 vairiables are V1 - v5.

rfImp1

(b)

Now add an additional predictor that is highly correslated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200)*.1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correslated with V1?

Anwer: The importance score for V1 changed. When there is another predictor that is highly correlated to v1, the importance of V1 is shared by the itself and the newly added duplicated one. The importance score of V1 in model 1 is almost the same as the sum of the imprtance scores of V1 and duplicate1 in model 2.

model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 %>% arrange(desc(Overall))

(c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Answer: The importances show similar patterns between the conditional inference tree model (model 3) and the traditional random forest model (model 1). There is still difference between the score of V3, in traditional RF model, V3 is one of the top 5 variable, however in the conditional inference tree model, it shows very small importance.

#remove the column duplicate1 from (b)
simulated = subset(simulated, select = -duplicate1)
simulated

model3 <- cforest(y ~ ., data = simulated)

compare1 <- cbind(data.frame(varImp(model3)),
      VarImp_RF = rfImp1$Overall) %>%
  rownames_to_column() %>%
  rename(VarImp_CF = Overall,
         Variable = rowname) 
compare1 %>%
  gather(key = 'Model', value = 'VarImp', -Variable) %>%
  ggplot(aes(x=reorder(Variable, desc(Variable %>% str_remove('V') %>% as.integer())), y=VarImp, fill=Model)) +
  geom_bar(stat="identity") +
  facet_grid(~Model) +
  coord_flip() +
  ggtitle('Variable Importance Scores Between Conditional Inference Tree Model and Traditional RF Model') +
  xlab('Variable')

(d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Answer: Similar patterns occurs for all tree models. although the patterns are not exactly the same, it shows that tree models are good at identify informative variables.

model4 <- gbm(y ~., simulated, distribution = 'gaussian', n.trees = 1000)
VarImp_GBM <- varImp(model4,numTrees = 1000)


model5 <- cubist(simulated[-11], simulated$y)
VarImp_Cubist = varImp(model5)

VarImp_GBM %>%
  rownames_to_column() %>%
  arrange(rowname %>% str_remove('V') %>% as.integer()) %>%
  rename(VarImp_GBM = Overall) %>%
  left_join(VarImp_Cubist %>%
              rownames_to_column() %>%
              arrange(rowname %>% str_remove('V') %>% as.integer()) %>%
              rename(VarImp_Cubist = Overall)) %>%
  rename(Variable = rowname) %>% 
  left_join(compare1) %>%
  gather(key = 'Model', value = 'VarImp', -Variable) %>%
  ggplot(aes(x=reorder(Variable, desc(Variable %>% str_remove('V') %>% as.integer())), y=VarImp, fill=Model)) +
  geom_bar(stat="identity") +
  facet_grid(~Model, scales="free") +
  coord_flip() +
  ggtitle('Variable Importance Scores Across Models') +
  xlab('Variable')

Question 8.2

Use a simulation to show tree bias with different granularities.

Answer: The simulation below show that the more granular the variable is, the higher the importance score, which reflects selection bias of tree models.

Create 10 variables with different granularities

data <- NULL

for(i in 1:10){
  data = cbind(data, sample(1:(i^5), 10000, replace = TRUE))
}

data <- data %>% 
  as.data.frame() %>%
  mutate(y = rowSums(data)+ sample(-5:5,1))


str(data)

## 'data.frame':    10000 obs. of  11 variables:
##  $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V2 : int  24 4 28 8 9 30 30 27 31 17 ...
##  $ V3 : int  16 87 88 10 242 225 142 3 2 227 ...
##  $ V4 : int  138 158 927 679 547 324 443 667 884 226 ...
##  $ V5 : int  3049 1583 231 1054 3120 2506 727 1922 1947 796 ...
##  $ V6 : int  3090 5817 6474 3992 6810 1387 7329 6363 1763 6838 ...
##  $ V7 : int  6365 3688 4815 15926 14581 1488 14195 7213 10543 6156 ...
##  $ V8 : int  31585 5334 13519 10613 31408 4681 14373 21929 7054 9651 ...
##  $ V9 : int  42139 40891 47229 30611 45299 20004 12303 58097 1982 19223 ...
##  $ V10: int  7114 54193 90438 34194 80444 12361 44669 25885 4217 39986 ...
##  $ y  : num  93524 111759 163753 97091 182464 ...

build a regression tree

model_bias <- rpart(y ~ ., data)
model_bias

## n= 10000 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 10000 1.241369e+13 110515.00  
##    2) V10< 48754.5 4792 2.861160e+12  84324.14  
##      4) V9< 32041 2618 1.014627e+12  70758.65  
##        8) V10< 23134.5 1280 3.224180e+11  58752.65 *
##        9) V10>=23134.5 1338 3.311985e+11  82244.21 *
##      5) V9>=32041 2174 7.845966e+11 100660.10  
##       10) V10< 25596.5 1160 2.470058e+11  89390.88 *
##       11) V10>=25596.5 1014 2.217479e+11 113552.00 *
##    3) V10>=48754.5 5208 3.240864e+12 134613.70  
##      6) V9< 28705.5 2557 1.022576e+12 119828.10  
##       12) V10< 74971.5 1298 3.045487e+11 107268.00 *
##       13) V10>=74971.5 1259 3.021485e+11 132777.30 *
##      7) V9>=28705.5 2651 1.120112e+12 148875.10  
##       14) V10< 75799.5 1394 3.414189e+11 136317.20 *
##       15) V10>=75799.5 1257 3.150588e+11 162801.80 *

Variable Importance Score vs # of Distinct Values

data %>% 
  select(-y) %>%
  summarise_all(n_distinct) %>%
  gather(key = 'Variable', value = 'Distinct_Cnt') %>%
  left_join(varImp(model_bias) %>%
              rownames_to_column() %>%
              rename(Variable = rowname,
                     VarImp = Overall)) %>%
  select(-Distinct_Cnt, Distinct_Cnt) %>%
  arrange(desc(VarImp), desc(Distinct_Cnt))

Question 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a)

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Answer: 1. The bagging fraction controls the sample size in each iteration, as the bagging fraction increases, the sample size gets larger and gets closer to the population size, therefore the randomness gets weaker during the learning process, making the variables with higher importance are more possible to be chosen in the splits in each iteration. Hence, as the bagging fraction increases, the variables with higher importance will get even higher scores, and vice versa.

The learning rate is used to shrink the impact of each iteration of learning to the intial weak learner. The higher the learning rate, the faster the model parameters converage to the optimal values constrainted by the loss function, and therefore less ‘chances’ the variables to be used in splits. Hence, the variables with lower importance will get even less chances to be used in splits and has importance score to be calcualted as the learning rate increase, and vice versa.

(b)

Which model do you think would be more predictive of other samples? The model with both the bagging fraction and the learning rate as 0.1 (the model on the left) would be more predictive of other samples. As more the model consists greate randomness therefore bias is reduced, and over-fit is limited.

(c)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24? The interaction depth (the tree depth) increases, more variables are used in splits, therefore overall speaking the slope of the variable importance scores will be more flattened. However as the tree depth increases, the risk of over-fitting increase as well.

Questions 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Load Data

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

chem_predictors <- ChemicalManufacturingProcess %>% select(-Yield)  %>% as.matrix()
chem_response <- ChemicalManufacturingProcess %>% select(Yield) %>% as.matrix()

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

Data Imputation

Imputd mssing values using missFroest package.

set.seed(200)
imp_chem_predictors <- missForest(chem_predictors)$ximp

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!

train_test_split

set.seed(200)

train_index <- createDataPartition(chem_response,
                                   p = 0.75,
                                   list = FALSE,
                                   times = 1) %>%
  as.vector()

data_train_X <- imp_chem_predictors[train_index,] 
data_train_Y <- chem_response[train_index,]
data_test_X <- imp_chem_predictors[-train_index,] 
data_test_Y <-chem_response[-train_index,]

(a)

Which tree-based regression model gives the optimal resampling and test set performance?

Single Tree

set.seed(200)

Model_Tree <- train(x = data_train_X,
                  y = data_train_Y,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = 'cv'))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

Model_Tree

## CART 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.01329166  1.411477  0.4332574  1.170937
##   0.01915705  1.418131  0.4248740  1.171664
##   0.02096592  1.386849  0.4451542  1.147608
##   0.02929826  1.428505  0.4251838  1.168758
##   0.03394084  1.429358  0.4230241  1.170566
##   0.04299920  1.507825  0.3659646  1.217390
##   0.04755934  1.547018  0.3371451  1.244473
##   0.06066735  1.577123  0.3185186  1.279035
##   0.09672549  1.487679  0.3609714  1.212389
##   0.39433963  1.799685  0.1909683  1.477203
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02096592.

Model_Tree_Pred <- predict(Model_Tree, newdata = data_test_X)
Model_Tree_metrics <- postResample(pred = Model_Tree_Pred, obs = data_test_Y)
Model_Tree_metrics

##      RMSE  Rsquared       MAE 
## 1.4175629 0.4559385 1.0205033

Random Forest

set.seed(200)

Model_RF <- train(x = data_train_X,
                  y = data_train_Y,
                  method = "rf",
                  tuneLength = 10,
                  trControl = trainControl(method = 'cv'))
Model_RF

## Random Forest 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.280177  0.5780109  1.0431936
##    8    1.201788  0.6074522  0.9641183
##   14    1.181556  0.6040104  0.9348861
##   20    1.173658  0.6080744  0.9304153
##   26    1.176563  0.5986897  0.9272136
##   32    1.171192  0.5940031  0.9192327
##   38    1.173748  0.5881406  0.9249056
##   44    1.175994  0.5847389  0.9251666
##   50    1.181656  0.5750656  0.9309045
##   57    1.183814  0.5708200  0.9309317
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 32.

Model_RF_Pred <- predict(Model_RF, newdata = data_test_X)
Model_RF_metrics <- postResample(pred = Model_RF_Pred, obs = data_test_Y)
Model_RF_metrics

##      RMSE  Rsquared       MAE 
## 1.1096016 0.6650494 0.8349992

Gradient Boosting Machine

set.seed(200)

Model_GBM <- train(x = data_train_X,
                  y = data_train_Y,
                  method = "gbm",
                  tuneGrid = expand.grid(.interaction.depth = seq(1, 7, by = 2),
                              .n.trees = seq(100, 1000, by = 50),
                              .shrinkage = c(0.01, 0.1),
                              .n.minobsinnode = c(5,10)),
                  tuneLength = 10,
                  trControl = trainControl(method = 'cv'),
                  verbose = FALSE)
Model_GBM

## Stochastic Gradient Boosting 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  RMSE      Rsquared 
##   0.01       1                   5               100     1.473271  0.4971408
##   0.01       1                   5               150     1.401951  0.5108903
##   0.01       1                   5               200     1.347653  0.5241497
##   0.01       1                   5               250     1.302130  0.5431186
##   0.01       1                   5               300     1.276601  0.5505714
##   0.01       1                   5               350     1.256872  0.5548120
##   0.01       1                   5               400     1.244275  0.5569807
##   0.01       1                   5               450     1.231780  0.5622279
##   0.01       1                   5               500     1.220940  0.5669893
##   0.01       1                   5               550     1.214718  0.5677741
##   0.01       1                   5               600     1.202997  0.5748523
##   0.01       1                   5               650     1.198182  0.5773780
##   0.01       1                   5               700     1.193181  0.5801056
##   0.01       1                   5               750     1.189787  0.5820069
##   0.01       1                   5               800     1.189040  0.5833627
##   0.01       1                   5               850     1.187963  0.5829365
##   0.01       1                   5               900     1.182699  0.5862234
##   0.01       1                   5               950     1.177332  0.5900764
##   0.01       1                   5              1000     1.174548  0.5916716
##   0.01       1                  10               100     1.469223  0.4986417
##   0.01       1                  10               150     1.379558  0.5323742
##   0.01       1                  10               200     1.312912  0.5550574
##   0.01       1                  10               250     1.277211  0.5568709
##   0.01       1                  10               300     1.248318  0.5631208
##   0.01       1                  10               350     1.219843  0.5744880
##   0.01       1                  10               400     1.204593  0.5804048
##   0.01       1                  10               450     1.193994  0.5876713
##   0.01       1                  10               500     1.186797  0.5905007
##   0.01       1                  10               550     1.181715  0.5909184
##   0.01       1                  10               600     1.173786  0.5942702
##   0.01       1                  10               650     1.168697  0.5966921
##   0.01       1                  10               700     1.164736  0.5993087
##   0.01       1                  10               750     1.159973  0.6011236
##   0.01       1                  10               800     1.155070  0.6042505
##   0.01       1                  10               850     1.153125  0.6058711
##   0.01       1                  10               900     1.149685  0.6079912
##   0.01       1                  10               950     1.147126  0.6084951
##   0.01       1                  10              1000     1.145244  0.6085612
##   0.01       3                   5               100     1.357419  0.5575750
##   0.01       3                   5               150     1.281459  0.5656745
##   0.01       3                   5               200     1.233399  0.5805007
##   0.01       3                   5               250     1.206555  0.5872058
##   0.01       3                   5               300     1.180641  0.5995960
##   0.01       3                   5               350     1.167211  0.6046259
##   0.01       3                   5               400     1.160957  0.6068200
##   0.01       3                   5               450     1.149703  0.6119305
##   0.01       3                   5               500     1.139680  0.6185247
##   0.01       3                   5               550     1.138527  0.6190965
##   0.01       3                   5               600     1.133788  0.6223919
##   0.01       3                   5               650     1.129173  0.6255791
##   0.01       3                   5               700     1.123173  0.6288987
##   0.01       3                   5               750     1.120296  0.6296473
##   0.01       3                   5               800     1.113729  0.6336928
##   0.01       3                   5               850     1.108716  0.6377241
##   0.01       3                   5               900     1.102462  0.6413880
##   0.01       3                   5               950     1.099419  0.6434850
##   0.01       3                   5              1000     1.093640  0.6467310
##   0.01       3                  10               100     1.353018  0.5439765
##   0.01       3                  10               150     1.278130  0.5541052
##   0.01       3                  10               200     1.234987  0.5689872
##   0.01       3                  10               250     1.208427  0.5782269
##   0.01       3                  10               300     1.187394  0.5894530
##   0.01       3                  10               350     1.179411  0.5919592
##   0.01       3                  10               400     1.166413  0.5985721
##   0.01       3                  10               450     1.155720  0.6039674
##   0.01       3                  10               500     1.147734  0.6088759
##   0.01       3                  10               550     1.140973  0.6120866
##   0.01       3                  10               600     1.136109  0.6150817
##   0.01       3                  10               650     1.137142  0.6151410
##   0.01       3                  10               700     1.135291  0.6168131
##   0.01       3                  10               750     1.132945  0.6184990
##   0.01       3                  10               800     1.129369  0.6212336
##   0.01       3                  10               850     1.127654  0.6229035
##   0.01       3                  10               900     1.125279  0.6245485
##   0.01       3                  10               950     1.121298  0.6271545
##   0.01       3                  10              1000     1.119513  0.6278567
##   0.01       5                   5               100     1.352070  0.5387531
##   0.01       5                   5               150     1.265658  0.5644120
##   0.01       5                   5               200     1.221763  0.5768345
##   0.01       5                   5               250     1.193480  0.5868202
##   0.01       5                   5               300     1.172793  0.5946769
##   0.01       5                   5               350     1.154933  0.6059675
##   0.01       5                   5               400     1.143557  0.6140158
##   0.01       5                   5               450     1.137082  0.6178650
##   0.01       5                   5               500     1.126103  0.6249817
##   0.01       5                   5               550     1.117452  0.6314733
##   0.01       5                   5               600     1.110066  0.6359088
##   0.01       5                   5               650     1.105907  0.6386559
##   0.01       5                   5               700     1.100785  0.6428167
##   0.01       5                   5               750     1.096279  0.6458396
##   0.01       5                   5               800     1.093262  0.6479510
##   0.01       5                   5               850     1.087343  0.6523651
##   0.01       5                   5               900     1.083837  0.6544320
##   0.01       5                   5               950     1.080603  0.6566982
##   0.01       5                   5              1000     1.076789  0.6592701
##   0.01       5                  10               100     1.355303  0.5410077
##   0.01       5                  10               150     1.273471  0.5600165
##   0.01       5                  10               200     1.233007  0.5683194
##   0.01       5                  10               250     1.204450  0.5817770
##   0.01       5                  10               300     1.185174  0.5888532
##   0.01       5                  10               350     1.170836  0.5971954
##   0.01       5                  10               400     1.160551  0.6025170
##   0.01       5                  10               450     1.153460  0.6076243
##   0.01       5                  10               500     1.146073  0.6115883
##   0.01       5                  10               550     1.140595  0.6148295
##   0.01       5                  10               600     1.134250  0.6192855
##   0.01       5                  10               650     1.130072  0.6215453
##   0.01       5                  10               700     1.126778  0.6238277
##   0.01       5                  10               750     1.121955  0.6274533
##   0.01       5                  10               800     1.119345  0.6297200
##   0.01       5                  10               850     1.117002  0.6312709
##   0.01       5                  10               900     1.115410  0.6315943
##   0.01       5                  10               950     1.116111  0.6309625
##   0.01       5                  10              1000     1.110871  0.6349099
##   0.01       7                   5               100     1.332351  0.5679392
##   0.01       7                   5               150     1.258166  0.5784708
##   0.01       7                   5               200     1.223478  0.5848596
##   0.01       7                   5               250     1.195245  0.5977128
##   0.01       7                   5               300     1.176840  0.6065940
##   0.01       7                   5               350     1.149984  0.6229779
##   0.01       7                   5               400     1.137787  0.6315342
##   0.01       7                   5               450     1.127172  0.6385869
##   0.01       7                   5               500     1.120314  0.6426285
##   0.01       7                   5               550     1.113475  0.6481183
##   0.01       7                   5               600     1.105928  0.6531131
##   0.01       7                   5               650     1.098367  0.6583023
##   0.01       7                   5               700     1.093940  0.6611717
##   0.01       7                   5               750     1.089843  0.6635773
##   0.01       7                   5               800     1.086594  0.6656776
##   0.01       7                   5               850     1.082312  0.6680985
##   0.01       7                   5               900     1.077774  0.6711760
##   0.01       7                   5               950     1.075615  0.6724435
##   0.01       7                   5              1000     1.073592  0.6738310
##   0.01       7                  10               100     1.350033  0.5481168
##   0.01       7                  10               150     1.270210  0.5603304
##   0.01       7                  10               200     1.234464  0.5665183
##   0.01       7                  10               250     1.203020  0.5788659
##   0.01       7                  10               300     1.187338  0.5860930
##   0.01       7                  10               350     1.166703  0.5965186
##   0.01       7                  10               400     1.156279  0.6033520
##   0.01       7                  10               450     1.150075  0.6066079
##   0.01       7                  10               500     1.142495  0.6118756
##   0.01       7                  10               550     1.136866  0.6151499
##   0.01       7                  10               600     1.132201  0.6169032
##   0.01       7                  10               650     1.130974  0.6175154
##   0.01       7                  10               700     1.129253  0.6194197
##   0.01       7                  10               750     1.123705  0.6229924
##   0.01       7                  10               800     1.119819  0.6261337
##   0.01       7                  10               850     1.117081  0.6280077
##   0.01       7                  10               900     1.114474  0.6295583
##   0.01       7                  10               950     1.112168  0.6307374
##   0.01       7                  10              1000     1.109954  0.6321106
##   0.10       1                   5               100     1.205486  0.5677768
##   0.10       1                   5               150     1.175042  0.5844183
##   0.10       1                   5               200     1.152607  0.6076593
##   0.10       1                   5               250     1.160685  0.6029135
##   0.10       1                   5               300     1.157491  0.6034124
##   0.10       1                   5               350     1.130745  0.6132242
##   0.10       1                   5               400     1.121013  0.6204278
##   0.10       1                   5               450     1.115600  0.6240434
##   0.10       1                   5               500     1.116106  0.6263161
##   0.10       1                   5               550     1.110899  0.6286426
##   0.10       1                   5               600     1.106577  0.6329896
##   0.10       1                   5               650     1.103042  0.6338017
##   0.10       1                   5               700     1.100927  0.6354050
##   0.10       1                   5               750     1.099427  0.6361862
##   0.10       1                   5               800     1.097322  0.6371476
##   0.10       1                   5               850     1.099803  0.6347005
##   0.10       1                   5               900     1.101217  0.6325188
##   0.10       1                   5               950     1.097741  0.6342132
##   0.10       1                   5              1000     1.095818  0.6353135
##   0.10       1                  10               100     1.156099  0.6022677
##   0.10       1                  10               150     1.126084  0.6277226
##   0.10       1                  10               200     1.115084  0.6323078
##   0.10       1                  10               250     1.109829  0.6289807
##   0.10       1                  10               300     1.105368  0.6328331
##   0.10       1                  10               350     1.095910  0.6345427
##   0.10       1                  10               400     1.090515  0.6356221
##   0.10       1                  10               450     1.090684  0.6336019
##   0.10       1                  10               500     1.089658  0.6348636
##   0.10       1                  10               550     1.089942  0.6339399
##   0.10       1                  10               600     1.087060  0.6346821
##   0.10       1                  10               650     1.085620  0.6359460
##   0.10       1                  10               700     1.085802  0.6349203
##   0.10       1                  10               750     1.083855  0.6378528
##   0.10       1                  10               800     1.083868  0.6364452
##   0.10       1                  10               850     1.083774  0.6369773
##   0.10       1                  10               900     1.083053  0.6383456
##   0.10       1                  10               950     1.083398  0.6388623
##   0.10       1                  10              1000     1.081692  0.6392557
##   0.10       3                   5               100     1.161578  0.6027382
##   0.10       3                   5               150     1.139567  0.6170143
##   0.10       3                   5               200     1.129315  0.6224467
##   0.10       3                   5               250     1.126903  0.6239068
##   0.10       3                   5               300     1.122502  0.6267327
##   0.10       3                   5               350     1.120190  0.6275448
##   0.10       3                   5               400     1.119628  0.6286442
##   0.10       3                   5               450     1.119524  0.6286704
##   0.10       3                   5               500     1.119184  0.6287459
##   0.10       3                   5               550     1.118670  0.6291689
##   0.10       3                   5               600     1.118383  0.6294669
##   0.10       3                   5               650     1.118430  0.6293559
##   0.10       3                   5               700     1.118540  0.6292865
##   0.10       3                   5               750     1.118637  0.6292287
##   0.10       3                   5               800     1.118527  0.6292881
##   0.10       3                   5               850     1.118485  0.6292992
##   0.10       3                   5               900     1.118444  0.6293195
##   0.10       3                   5               950     1.118417  0.6293371
##   0.10       3                   5              1000     1.118414  0.6293425
##   0.10       3                  10               100     1.128901  0.6202220
##   0.10       3                  10               150     1.140720  0.6128924
##   0.10       3                  10               200     1.130160  0.6167806
##   0.10       3                  10               250     1.124844  0.6191742
##   0.10       3                  10               300     1.114858  0.6246581
##   0.10       3                  10               350     1.115825  0.6239249
##   0.10       3                  10               400     1.114552  0.6257944
##   0.10       3                  10               450     1.112355  0.6272502
##   0.10       3                  10               500     1.111626  0.6275872
##   0.10       3                  10               550     1.111609  0.6277501
##   0.10       3                  10               600     1.111973  0.6273088
##   0.10       3                  10               650     1.112320  0.6273120
##   0.10       3                  10               700     1.112134  0.6274487
##   0.10       3                  10               750     1.112341  0.6271668
##   0.10       3                  10               800     1.112421  0.6271241
##   0.10       3                  10               850     1.112343  0.6272230
##   0.10       3                  10               900     1.112908  0.6268414
##   0.10       3                  10               950     1.113044  0.6267142
##   0.10       3                  10              1000     1.113118  0.6265828
##   0.10       5                   5               100     1.117222  0.6404199
##   0.10       5                   5               150     1.094067  0.6531126
##   0.10       5                   5               200     1.084484  0.6586126
##   0.10       5                   5               250     1.076384  0.6651060
##   0.10       5                   5               300     1.073427  0.6675055
##   0.10       5                   5               350     1.071206  0.6691173
##   0.10       5                   5               400     1.069600  0.6704350
##   0.10       5                   5               450     1.068172  0.6713196
##   0.10       5                   5               500     1.067289  0.6721906
##   0.10       5                   5               550     1.066623  0.6726282
##   0.10       5                   5               600     1.066561  0.6728267
##   0.10       5                   5               650     1.066265  0.6730006
##   0.10       5                   5               700     1.066203  0.6730741
##   0.10       5                   5               750     1.066086  0.6731796
##   0.10       5                   5               800     1.066007  0.6732731
##   0.10       5                   5               850     1.065986  0.6732960
##   0.10       5                   5               900     1.065952  0.6733221
##   0.10       5                   5               950     1.065947  0.6733245
##   0.10       5                   5              1000     1.065938  0.6733276
##   0.10       5                  10               100     1.121702  0.6321996
##   0.10       5                  10               150     1.119793  0.6357847
##   0.10       5                  10               200     1.106667  0.6443552
##   0.10       5                  10               250     1.097924  0.6496728
##   0.10       5                  10               300     1.096011  0.6518810
##   0.10       5                  10               350     1.094623  0.6534330
##   0.10       5                  10               400     1.094889  0.6536713
##   0.10       5                  10               450     1.093868  0.6545526
##   0.10       5                  10               500     1.094878  0.6545101
##   0.10       5                  10               550     1.094252  0.6554586
##   0.10       5                  10               600     1.093277  0.6560280
##   0.10       5                  10               650     1.093562  0.6563242
##   0.10       5                  10               700     1.095014  0.6554559
##   0.10       5                  10               750     1.095030  0.6554822
##   0.10       5                  10               800     1.094371  0.6560415
##   0.10       5                  10               850     1.094430  0.6562128
##   0.10       5                  10               900     1.094596  0.6562757
##   0.10       5                  10               950     1.094709  0.6564110
##   0.10       5                  10              1000     1.094926  0.6563539
##   0.10       7                   5               100     1.189057  0.5871892
##   0.10       7                   5               150     1.168465  0.6040973
##   0.10       7                   5               200     1.159321  0.6124949
##   0.10       7                   5               250     1.156649  0.6155535
##   0.10       7                   5               300     1.154411  0.6174031
##   0.10       7                   5               350     1.153361  0.6185001
##   0.10       7                   5               400     1.152837  0.6188745
##   0.10       7                   5               450     1.152504  0.6194350
##   0.10       7                   5               500     1.152004  0.6199290
##   0.10       7                   5               550     1.151818  0.6201870
##   0.10       7                   5               600     1.151836  0.6202544
##   0.10       7                   5               650     1.151765  0.6203476
##   0.10       7                   5               700     1.151729  0.6204390
##   0.10       7                   5               750     1.151742  0.6204779
##   0.10       7                   5               800     1.151691  0.6205179
##   0.10       7                   5               850     1.151695  0.6205329
##   0.10       7                   5               900     1.151662  0.6205623
##   0.10       7                   5               950     1.151664  0.6205692
##   0.10       7                   5              1000     1.151668  0.6205709
##   0.10       7                  10               100     1.190652  0.5762821
##   0.10       7                  10               150     1.180201  0.5834694
##   0.10       7                  10               200     1.167701  0.5866883
##   0.10       7                  10               250     1.150748  0.5986952
##   0.10       7                  10               300     1.146134  0.6008925
##   0.10       7                  10               350     1.142632  0.6025330
##   0.10       7                  10               400     1.140351  0.6035123
##   0.10       7                  10               450     1.139099  0.6042302
##   0.10       7                  10               500     1.137666  0.6054228
##   0.10       7                  10               550     1.134858  0.6067432
##   0.10       7                  10               600     1.133763  0.6075150
##   0.10       7                  10               650     1.133845  0.6073639
##   0.10       7                  10               700     1.133342  0.6078599
##   0.10       7                  10               750     1.133814  0.6077471
##   0.10       7                  10               800     1.133704  0.6078972
##   0.10       7                  10               850     1.134101  0.6076897
##   0.10       7                  10               900     1.134260  0.6075400
##   0.10       7                  10               950     1.133912  0.6077740
##   0.10       7                  10              1000     1.134178  0.6076601
##   MAE      
##   1.2043518
##   1.1421541
##   1.0882058
##   1.0402271
##   1.0115711
##   0.9922845
##   0.9769365
##   0.9644533
##   0.9548605
##   0.9481121
##   0.9359568
##   0.9280852
##   0.9238474
##   0.9207813
##   0.9203403
##   0.9193643
##   0.9161853
##   0.9115853
##   0.9090842
##   1.2002632
##   1.1234821
##   1.0596538
##   1.0223125
##   0.9917492
##   0.9602612
##   0.9413721
##   0.9266708
##   0.9168448
##   0.9081947
##   0.9005390
##   0.8957363
##   0.8911517
##   0.8861443
##   0.8827197
##   0.8817305
##   0.8775892
##   0.8763731
##   0.8752682
##   1.1015150
##   1.0238667
##   0.9744013
##   0.9470636
##   0.9266289
##   0.9160160
##   0.9079582
##   0.8995514
##   0.8934877
##   0.8913169
##   0.8909973
##   0.8876236
##   0.8821467
##   0.8798356
##   0.8742158
##   0.8717258
##   0.8668497
##   0.8653056
##   0.8607952
##   1.0949619
##   1.0230255
##   0.9806685
##   0.9535297
##   0.9294918
##   0.9164396
##   0.9034637
##   0.8920558
##   0.8838081
##   0.8788421
##   0.8749954
##   0.8774099
##   0.8749346
##   0.8740594
##   0.8714432
##   0.8718563
##   0.8709054
##   0.8691689
##   0.8680796
##   1.1001487
##   1.0150763
##   0.9673993
##   0.9385435
##   0.9226789
##   0.9061504
##   0.8996265
##   0.8928975
##   0.8855824
##   0.8789483
##   0.8741524
##   0.8717220
##   0.8684747
##   0.8646725
##   0.8629497
##   0.8578387
##   0.8545945
##   0.8518930
##   0.8497360
##   1.0950860
##   1.0151876
##   0.9707329
##   0.9390401
##   0.9199674
##   0.9067851
##   0.8956035
##   0.8903520
##   0.8850489
##   0.8802560
##   0.8746180
##   0.8719435
##   0.8690298
##   0.8642113
##   0.8620942
##   0.8587785
##   0.8582023
##   0.8587495
##   0.8562072
##   1.0717283
##   1.0046766
##   0.9652709
##   0.9368655
##   0.9191283
##   0.8989342
##   0.8887754
##   0.8812429
##   0.8743986
##   0.8687471
##   0.8647713
##   0.8594903
##   0.8562926
##   0.8534261
##   0.8515857
##   0.8494407
##   0.8472237
##   0.8461133
##   0.8449297
##   1.0942387
##   1.0155898
##   0.9778825
##   0.9447624
##   0.9250637
##   0.9047899
##   0.8904787
##   0.8822174
##   0.8782820
##   0.8712289
##   0.8659259
##   0.8673838
##   0.8653880
##   0.8602866
##   0.8593037
##   0.8596426
##   0.8580370
##   0.8548019
##   0.8534993
##   0.9350977
##   0.9209409
##   0.9072459
##   0.9164076
##   0.9180271
##   0.8916412
##   0.8804452
##   0.8726802
##   0.8700616
##   0.8665726
##   0.8602514
##   0.8527719
##   0.8555090
##   0.8547317
##   0.8530406
##   0.8583742
##   0.8596178
##   0.8557754
##   0.8524109
##   0.8825193
##   0.8573866
##   0.8555280
##   0.8511323
##   0.8568328
##   0.8504406
##   0.8511253
##   0.8559759
##   0.8555801
##   0.8506090
##   0.8555627
##   0.8555471
##   0.8561722
##   0.8591346
##   0.8610010
##   0.8624592
##   0.8595501
##   0.8609681
##   0.8620273
##   0.9230787
##   0.9004964
##   0.9001334
##   0.8995979
##   0.8965742
##   0.8960315
##   0.8955442
##   0.8957099
##   0.8955243
##   0.8950421
##   0.8948210
##   0.8948255
##   0.8949695
##   0.8949607
##   0.8948477
##   0.8947963
##   0.8947576
##   0.8947357
##   0.8947378
##   0.8841140
##   0.8934855
##   0.8796558
##   0.8764818
##   0.8719990
##   0.8735289
##   0.8739195
##   0.8734412
##   0.8738539
##   0.8745281
##   0.8753331
##   0.8759031
##   0.8763275
##   0.8769341
##   0.8770040
##   0.8771321
##   0.8776571
##   0.8779181
##   0.8779328
##   0.8744934
##   0.8629142
##   0.8590920
##   0.8542072
##   0.8534808
##   0.8521780
##   0.8516822
##   0.8509193
##   0.8504283
##   0.8499112
##   0.8499754
##   0.8498767
##   0.8499288
##   0.8498336
##   0.8497946
##   0.8497829
##   0.8497617
##   0.8497671
##   0.8497727
##   0.8667003
##   0.8585501
##   0.8543481
##   0.8502296
##   0.8477846
##   0.8447928
##   0.8443338
##   0.8440546
##   0.8456569
##   0.8460636
##   0.8444986
##   0.8449172
##   0.8465640
##   0.8459460
##   0.8452996
##   0.8454180
##   0.8455203
##   0.8459147
##   0.8460901
##   0.9231650
##   0.9075976
##   0.9012684
##   0.8999699
##   0.8988860
##   0.8976014
##   0.8971639
##   0.8968914
##   0.8961339
##   0.8959511
##   0.8958689
##   0.8958212
##   0.8957781
##   0.8957706
##   0.8957223
##   0.8957200
##   0.8956821
##   0.8956833
##   0.8956911
##   0.9163567
##   0.9121291
##   0.9033163
##   0.8960191
##   0.8936307
##   0.8909682
##   0.8915852
##   0.8922805
##   0.8915413
##   0.8907899
##   0.8899397
##   0.8905556
##   0.8906511
##   0.8913763
##   0.8915522
##   0.8921929
##   0.8924841
##   0.8926450
##   0.8929164
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  5, shrinkage = 0.1 and n.minobsinnode = 5.

Model_GBM_Pred <- predict(Model_GBM, newdata = data_test_X)
Model_GBM_metrics <- postResample(pred = Model_GBM_Pred, obs = data_test_Y)
Model_GBM_metrics

##      RMSE  Rsquared       MAE 
## 1.0896377 0.6769028 0.8801046

Cubist

set.seed(200)

Model_Cubist <- train(x = data_train_X,
                  y = data_train_Y,
                  method = "cubist",
                  trControl = trainControl(method = 'cv'))
Model_Cubist

## Cubist 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.411332  0.5004927  1.1268350
##    1          5          1.180666  0.6664762  0.9322419
##    1          9          1.286096  0.5924816  1.0076844
##   10          0          1.135672  0.6351204  0.9539481
##   10          5          1.011371  0.7171035  0.8451593
##   10          9          1.065767  0.6811225  0.8876194
##   20          0          1.169131  0.6014467  0.9710317
##   20          5          1.045657  0.6882689  0.8606767
##   20          9          1.100499  0.6493282  0.9044529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 10 and neighbors = 5.

Model_Cubist_Pred <- predict(Model_Cubist, newdata = data_test_X)
Model_Cubist_metrics <- postResample(pred = Model_Cubist_Pred, obs = data_test_Y)
Model_Cubist_metrics

##      RMSE  Rsquared       MAE 
## 0.8879711 0.7813045 0.6587988

Model Comparison

The best model selected by both RMSE & R2 is Cubist.

rbind(Model_Tree_metrics,
      Model_RF_metrics,
      Model_GBM_metrics,
      Model_Cubist_metrics) %>%
  data.frame() %>%
  arrange(RMSE)

(b)

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Answer: 1. The most imporant variable in the Cubist model is ManufacturingProcess17.

The model is domiated by the process variables (8 of the 10 top are process variables).
For both the optimal linear and non-linear models in the previous homeworks, the Manufacturing Process variables domiate the VarImp list. all the linear, non-linear and tree model shows similar pattern.

varImp(Model_Cubist)

## cubist variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess17 100.000
## ManufacturingProcess32  97.959
## ManufacturingProcess01  62.245
## ManufacturingProcess39  41.837
## BiologicalMaterial12    40.816
## ManufacturingProcess09  39.796
## BiologicalMaterial06    33.673
## ManufacturingProcess33  31.633
## ManufacturingProcess29  30.612
## ManufacturingProcess37  25.510
## ManufacturingProcess04  22.449
## ManufacturingProcess27  20.408
## BiologicalMaterial02    16.327
## BiologicalMaterial08    14.286
## ManufacturingProcess45  14.286
## ManufacturingProcess13  11.224
## ManufacturingProcess15  10.204
## ManufacturingProcess42   8.163
## ManufacturingProcess38   6.122
## BiologicalMaterial10     6.122

(c)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Answer: Yes. The view also provided insight that most of the splits are made by manualfacturing process variables.

rpart.plot(Model_Tree$finalModel)

DATA 624 - HOMEWORK 9