DATA 624 - HOMEWORK 9
Load Package
Question 8.1
Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
data_x <- simulated$x
data_y <- simulated$y
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a)
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
#library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6 - V10)?
Answer: No, the RF model did not significantly use V6 - V10, as shown from the variable importance scores that the top 5 vairiables are V1 - v5.
(b)
Now add an additional predictor that is highly correslated with one of the informative predictors. For example:
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correslated with V1?
Anwer: The importance score for V1 changed. When there is another predictor that is highly correlated to v1, the importance of V1 is shared by the itself and the newly added duplicated one. The importance score of V1 in model 1 is almost the same as the sum of the imprtance scores of V1 and duplicate1 in model 2.
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 %>% arrange(desc(Overall))
(c)
Use the cforest
function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
Answer: The importances show similar patterns between the conditional inference tree model (model 3) and the traditional random forest model (model 1). There is still difference between the score of V3, in traditional RF model, V3 is one of the top 5 variable, however in the conditional inference tree model, it shows very small importance.
#remove the column duplicate1 from (b)
simulated = subset(simulated, select = -duplicate1)
simulated
model3 <- cforest(y ~ ., data = simulated)
compare1 <- cbind(data.frame(varImp(model3)),
VarImp_RF = rfImp1$Overall) %>%
rownames_to_column() %>%
rename(VarImp_CF = Overall,
Variable = rowname)
compare1 %>%
gather(key = 'Model', value = 'VarImp', -Variable) %>%
ggplot(aes(x=reorder(Variable, desc(Variable %>% str_remove('V') %>% as.integer())), y=VarImp, fill=Model)) +
geom_bar(stat="identity") +
facet_grid(~Model) +
coord_flip() +
ggtitle('Variable Importance Scores Between Conditional Inference Tree Model and Traditional RF Model') +
xlab('Variable')
(d)
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Answer: Similar patterns occurs for all tree models. although the patterns are not exactly the same, it shows that tree models are good at identify informative variables.
model4 <- gbm(y ~., simulated, distribution = 'gaussian', n.trees = 1000)
VarImp_GBM <- varImp(model4,numTrees = 1000)
model5 <- cubist(simulated[-11], simulated$y)
VarImp_Cubist = varImp(model5)
VarImp_GBM %>%
rownames_to_column() %>%
arrange(rowname %>% str_remove('V') %>% as.integer()) %>%
rename(VarImp_GBM = Overall) %>%
left_join(VarImp_Cubist %>%
rownames_to_column() %>%
arrange(rowname %>% str_remove('V') %>% as.integer()) %>%
rename(VarImp_Cubist = Overall)) %>%
rename(Variable = rowname) %>%
left_join(compare1) %>%
gather(key = 'Model', value = 'VarImp', -Variable) %>%
ggplot(aes(x=reorder(Variable, desc(Variable %>% str_remove('V') %>% as.integer())), y=VarImp, fill=Model)) +
geom_bar(stat="identity") +
facet_grid(~Model, scales="free") +
coord_flip() +
ggtitle('Variable Importance Scores Across Models') +
xlab('Variable')
Question 8.2
Use a simulation to show tree bias with different granularities.
Answer: The simulation below show that the more granular the variable is, the higher the importance score, which reflects selection bias of tree models.
Create 10 variables with different granularities
data <- NULL
for(i in 1:10){
data = cbind(data, sample(1:(i^5), 10000, replace = TRUE))
}
data <- data %>%
as.data.frame() %>%
mutate(y = rowSums(data)+ sample(-5:5,1))
str(data)
## 'data.frame': 10000 obs. of 11 variables:
## $ V1 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ V2 : int 24 4 28 8 9 30 30 27 31 17 ...
## $ V3 : int 16 87 88 10 242 225 142 3 2 227 ...
## $ V4 : int 138 158 927 679 547 324 443 667 884 226 ...
## $ V5 : int 3049 1583 231 1054 3120 2506 727 1922 1947 796 ...
## $ V6 : int 3090 5817 6474 3992 6810 1387 7329 6363 1763 6838 ...
## $ V7 : int 6365 3688 4815 15926 14581 1488 14195 7213 10543 6156 ...
## $ V8 : int 31585 5334 13519 10613 31408 4681 14373 21929 7054 9651 ...
## $ V9 : int 42139 40891 47229 30611 45299 20004 12303 58097 1982 19223 ...
## $ V10: int 7114 54193 90438 34194 80444 12361 44669 25885 4217 39986 ...
## $ y : num 93524 111759 163753 97091 182464 ...
build a regression tree
## n= 10000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 10000 1.241369e+13 110515.00
## 2) V10< 48754.5 4792 2.861160e+12 84324.14
## 4) V9< 32041 2618 1.014627e+12 70758.65
## 8) V10< 23134.5 1280 3.224180e+11 58752.65 *
## 9) V10>=23134.5 1338 3.311985e+11 82244.21 *
## 5) V9>=32041 2174 7.845966e+11 100660.10
## 10) V10< 25596.5 1160 2.470058e+11 89390.88 *
## 11) V10>=25596.5 1014 2.217479e+11 113552.00 *
## 3) V10>=48754.5 5208 3.240864e+12 134613.70
## 6) V9< 28705.5 2557 1.022576e+12 119828.10
## 12) V10< 74971.5 1298 3.045487e+11 107268.00 *
## 13) V10>=74971.5 1259 3.021485e+11 132777.30 *
## 7) V9>=28705.5 2651 1.120112e+12 148875.10
## 14) V10< 75799.5 1394 3.414189e+11 136317.20 *
## 15) V10>=75799.5 1257 3.150588e+11 162801.80 *
Variable Importance Score vs # of Distinct Values
data %>%
select(-y) %>%
summarise_all(n_distinct) %>%
gather(key = 'Variable', value = 'Distinct_Cnt') %>%
left_join(varImp(model_bias) %>%
rownames_to_column() %>%
rename(Variable = rowname,
VarImp = Overall)) %>%
select(-Distinct_Cnt, Distinct_Cnt) %>%
arrange(desc(VarImp), desc(Distinct_Cnt))
Question 8.3
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a)
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Answer: 1. The bagging fraction controls the sample size in each iteration, as the bagging fraction increases, the sample size gets larger and gets closer to the population size, therefore the randomness gets weaker during the learning process, making the variables with higher importance are more possible to be chosen in the splits in each iteration. Hence, as the bagging fraction increases, the variables with higher importance will get even higher scores, and vice versa.
- The learning rate is used to shrink the impact of each iteration of learning to the intial weak learner. The higher the learning rate, the faster the model parameters converage to the optimal values constrainted by the loss function, and therefore less ‘chances’ the variables to be used in splits. Hence, the variables with lower importance will get even less chances to be used in splits and has importance score to be calcualted as the learning rate increase, and vice versa.
(b)
Which model do you think would be more predictive of other samples? The model with both the bagging fraction and the learning rate as 0.1 (the model on the left) would be more predictive of other samples. As more the model consists greate randomness therefore bias is reduced, and over-fit is limited.
(c)
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24? The interaction depth (the tree depth) increases, more variables are used in splits, therefore overall speaking the slope of the variable importance scores will be more flattened. However as the tree depth increases, the risk of over-fitting increase as well.
Questions 8.7
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Load Data
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
chem_predictors <- ChemicalManufacturingProcess %>% select(-Yield) %>% as.matrix()
chem_response <- ChemicalManufacturingProcess %>% select(Yield) %>% as.matrix()
The matrix processPredictors
contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
Data Imputation
Imputd mssing values using missFroest
package.
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
train_test_split
set.seed(200)
train_index <- createDataPartition(chem_response,
p = 0.75,
list = FALSE,
times = 1) %>%
as.vector()
data_train_X <- imp_chem_predictors[train_index,]
data_train_Y <- chem_response[train_index,]
data_test_X <- imp_chem_predictors[-train_index,]
data_test_Y <-chem_response[-train_index,]
(a)
Which tree-based regression model gives the optimal resampling and test set performance?
Single Tree
set.seed(200)
Model_Tree <- train(x = data_train_X,
y = data_train_Y,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = 'cv'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## CART
##
## 132 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01329166 1.411477 0.4332574 1.170937
## 0.01915705 1.418131 0.4248740 1.171664
## 0.02096592 1.386849 0.4451542 1.147608
## 0.02929826 1.428505 0.4251838 1.168758
## 0.03394084 1.429358 0.4230241 1.170566
## 0.04299920 1.507825 0.3659646 1.217390
## 0.04755934 1.547018 0.3371451 1.244473
## 0.06066735 1.577123 0.3185186 1.279035
## 0.09672549 1.487679 0.3609714 1.212389
## 0.39433963 1.799685 0.1909683 1.477203
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02096592.
Model_Tree_Pred <- predict(Model_Tree, newdata = data_test_X)
Model_Tree_metrics <- postResample(pred = Model_Tree_Pred, obs = data_test_Y)
Model_Tree_metrics
## RMSE Rsquared MAE
## 1.4175629 0.4559385 1.0205033
Random Forest
set.seed(200)
Model_RF <- train(x = data_train_X,
y = data_train_Y,
method = "rf",
tuneLength = 10,
trControl = trainControl(method = 'cv'))
Model_RF
## Random Forest
##
## 132 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.280177 0.5780109 1.0431936
## 8 1.201788 0.6074522 0.9641183
## 14 1.181556 0.6040104 0.9348861
## 20 1.173658 0.6080744 0.9304153
## 26 1.176563 0.5986897 0.9272136
## 32 1.171192 0.5940031 0.9192327
## 38 1.173748 0.5881406 0.9249056
## 44 1.175994 0.5847389 0.9251666
## 50 1.181656 0.5750656 0.9309045
## 57 1.183814 0.5708200 0.9309317
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 32.
Model_RF_Pred <- predict(Model_RF, newdata = data_test_X)
Model_RF_metrics <- postResample(pred = Model_RF_Pred, obs = data_test_Y)
Model_RF_metrics
## RMSE Rsquared MAE
## 1.1096016 0.6650494 0.8349992
Gradient Boosting Machine
set.seed(200)
Model_GBM <- train(x = data_train_X,
y = data_train_Y,
method = "gbm",
tuneGrid = expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode = c(5,10)),
tuneLength = 10,
trControl = trainControl(method = 'cv'),
verbose = FALSE)
Model_GBM
## Stochastic Gradient Boosting
##
## 132 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 0.01 1 5 100 1.473271 0.4971408
## 0.01 1 5 150 1.401951 0.5108903
## 0.01 1 5 200 1.347653 0.5241497
## 0.01 1 5 250 1.302130 0.5431186
## 0.01 1 5 300 1.276601 0.5505714
## 0.01 1 5 350 1.256872 0.5548120
## 0.01 1 5 400 1.244275 0.5569807
## 0.01 1 5 450 1.231780 0.5622279
## 0.01 1 5 500 1.220940 0.5669893
## 0.01 1 5 550 1.214718 0.5677741
## 0.01 1 5 600 1.202997 0.5748523
## 0.01 1 5 650 1.198182 0.5773780
## 0.01 1 5 700 1.193181 0.5801056
## 0.01 1 5 750 1.189787 0.5820069
## 0.01 1 5 800 1.189040 0.5833627
## 0.01 1 5 850 1.187963 0.5829365
## 0.01 1 5 900 1.182699 0.5862234
## 0.01 1 5 950 1.177332 0.5900764
## 0.01 1 5 1000 1.174548 0.5916716
## 0.01 1 10 100 1.469223 0.4986417
## 0.01 1 10 150 1.379558 0.5323742
## 0.01 1 10 200 1.312912 0.5550574
## 0.01 1 10 250 1.277211 0.5568709
## 0.01 1 10 300 1.248318 0.5631208
## 0.01 1 10 350 1.219843 0.5744880
## 0.01 1 10 400 1.204593 0.5804048
## 0.01 1 10 450 1.193994 0.5876713
## 0.01 1 10 500 1.186797 0.5905007
## 0.01 1 10 550 1.181715 0.5909184
## 0.01 1 10 600 1.173786 0.5942702
## 0.01 1 10 650 1.168697 0.5966921
## 0.01 1 10 700 1.164736 0.5993087
## 0.01 1 10 750 1.159973 0.6011236
## 0.01 1 10 800 1.155070 0.6042505
## 0.01 1 10 850 1.153125 0.6058711
## 0.01 1 10 900 1.149685 0.6079912
## 0.01 1 10 950 1.147126 0.6084951
## 0.01 1 10 1000 1.145244 0.6085612
## 0.01 3 5 100 1.357419 0.5575750
## 0.01 3 5 150 1.281459 0.5656745
## 0.01 3 5 200 1.233399 0.5805007
## 0.01 3 5 250 1.206555 0.5872058
## 0.01 3 5 300 1.180641 0.5995960
## 0.01 3 5 350 1.167211 0.6046259
## 0.01 3 5 400 1.160957 0.6068200
## 0.01 3 5 450 1.149703 0.6119305
## 0.01 3 5 500 1.139680 0.6185247
## 0.01 3 5 550 1.138527 0.6190965
## 0.01 3 5 600 1.133788 0.6223919
## 0.01 3 5 650 1.129173 0.6255791
## 0.01 3 5 700 1.123173 0.6288987
## 0.01 3 5 750 1.120296 0.6296473
## 0.01 3 5 800 1.113729 0.6336928
## 0.01 3 5 850 1.108716 0.6377241
## 0.01 3 5 900 1.102462 0.6413880
## 0.01 3 5 950 1.099419 0.6434850
## 0.01 3 5 1000 1.093640 0.6467310
## 0.01 3 10 100 1.353018 0.5439765
## 0.01 3 10 150 1.278130 0.5541052
## 0.01 3 10 200 1.234987 0.5689872
## 0.01 3 10 250 1.208427 0.5782269
## 0.01 3 10 300 1.187394 0.5894530
## 0.01 3 10 350 1.179411 0.5919592
## 0.01 3 10 400 1.166413 0.5985721
## 0.01 3 10 450 1.155720 0.6039674
## 0.01 3 10 500 1.147734 0.6088759
## 0.01 3 10 550 1.140973 0.6120866
## 0.01 3 10 600 1.136109 0.6150817
## 0.01 3 10 650 1.137142 0.6151410
## 0.01 3 10 700 1.135291 0.6168131
## 0.01 3 10 750 1.132945 0.6184990
## 0.01 3 10 800 1.129369 0.6212336
## 0.01 3 10 850 1.127654 0.6229035
## 0.01 3 10 900 1.125279 0.6245485
## 0.01 3 10 950 1.121298 0.6271545
## 0.01 3 10 1000 1.119513 0.6278567
## 0.01 5 5 100 1.352070 0.5387531
## 0.01 5 5 150 1.265658 0.5644120
## 0.01 5 5 200 1.221763 0.5768345
## 0.01 5 5 250 1.193480 0.5868202
## 0.01 5 5 300 1.172793 0.5946769
## 0.01 5 5 350 1.154933 0.6059675
## 0.01 5 5 400 1.143557 0.6140158
## 0.01 5 5 450 1.137082 0.6178650
## 0.01 5 5 500 1.126103 0.6249817
## 0.01 5 5 550 1.117452 0.6314733
## 0.01 5 5 600 1.110066 0.6359088
## 0.01 5 5 650 1.105907 0.6386559
## 0.01 5 5 700 1.100785 0.6428167
## 0.01 5 5 750 1.096279 0.6458396
## 0.01 5 5 800 1.093262 0.6479510
## 0.01 5 5 850 1.087343 0.6523651
## 0.01 5 5 900 1.083837 0.6544320
## 0.01 5 5 950 1.080603 0.6566982
## 0.01 5 5 1000 1.076789 0.6592701
## 0.01 5 10 100 1.355303 0.5410077
## 0.01 5 10 150 1.273471 0.5600165
## 0.01 5 10 200 1.233007 0.5683194
## 0.01 5 10 250 1.204450 0.5817770
## 0.01 5 10 300 1.185174 0.5888532
## 0.01 5 10 350 1.170836 0.5971954
## 0.01 5 10 400 1.160551 0.6025170
## 0.01 5 10 450 1.153460 0.6076243
## 0.01 5 10 500 1.146073 0.6115883
## 0.01 5 10 550 1.140595 0.6148295
## 0.01 5 10 600 1.134250 0.6192855
## 0.01 5 10 650 1.130072 0.6215453
## 0.01 5 10 700 1.126778 0.6238277
## 0.01 5 10 750 1.121955 0.6274533
## 0.01 5 10 800 1.119345 0.6297200
## 0.01 5 10 850 1.117002 0.6312709
## 0.01 5 10 900 1.115410 0.6315943
## 0.01 5 10 950 1.116111 0.6309625
## 0.01 5 10 1000 1.110871 0.6349099
## 0.01 7 5 100 1.332351 0.5679392
## 0.01 7 5 150 1.258166 0.5784708
## 0.01 7 5 200 1.223478 0.5848596
## 0.01 7 5 250 1.195245 0.5977128
## 0.01 7 5 300 1.176840 0.6065940
## 0.01 7 5 350 1.149984 0.6229779
## 0.01 7 5 400 1.137787 0.6315342
## 0.01 7 5 450 1.127172 0.6385869
## 0.01 7 5 500 1.120314 0.6426285
## 0.01 7 5 550 1.113475 0.6481183
## 0.01 7 5 600 1.105928 0.6531131
## 0.01 7 5 650 1.098367 0.6583023
## 0.01 7 5 700 1.093940 0.6611717
## 0.01 7 5 750 1.089843 0.6635773
## 0.01 7 5 800 1.086594 0.6656776
## 0.01 7 5 850 1.082312 0.6680985
## 0.01 7 5 900 1.077774 0.6711760
## 0.01 7 5 950 1.075615 0.6724435
## 0.01 7 5 1000 1.073592 0.6738310
## 0.01 7 10 100 1.350033 0.5481168
## 0.01 7 10 150 1.270210 0.5603304
## 0.01 7 10 200 1.234464 0.5665183
## 0.01 7 10 250 1.203020 0.5788659
## 0.01 7 10 300 1.187338 0.5860930
## 0.01 7 10 350 1.166703 0.5965186
## 0.01 7 10 400 1.156279 0.6033520
## 0.01 7 10 450 1.150075 0.6066079
## 0.01 7 10 500 1.142495 0.6118756
## 0.01 7 10 550 1.136866 0.6151499
## 0.01 7 10 600 1.132201 0.6169032
## 0.01 7 10 650 1.130974 0.6175154
## 0.01 7 10 700 1.129253 0.6194197
## 0.01 7 10 750 1.123705 0.6229924
## 0.01 7 10 800 1.119819 0.6261337
## 0.01 7 10 850 1.117081 0.6280077
## 0.01 7 10 900 1.114474 0.6295583
## 0.01 7 10 950 1.112168 0.6307374
## 0.01 7 10 1000 1.109954 0.6321106
## 0.10 1 5 100 1.205486 0.5677768
## 0.10 1 5 150 1.175042 0.5844183
## 0.10 1 5 200 1.152607 0.6076593
## 0.10 1 5 250 1.160685 0.6029135
## 0.10 1 5 300 1.157491 0.6034124
## 0.10 1 5 350 1.130745 0.6132242
## 0.10 1 5 400 1.121013 0.6204278
## 0.10 1 5 450 1.115600 0.6240434
## 0.10 1 5 500 1.116106 0.6263161
## 0.10 1 5 550 1.110899 0.6286426
## 0.10 1 5 600 1.106577 0.6329896
## 0.10 1 5 650 1.103042 0.6338017
## 0.10 1 5 700 1.100927 0.6354050
## 0.10 1 5 750 1.099427 0.6361862
## 0.10 1 5 800 1.097322 0.6371476
## 0.10 1 5 850 1.099803 0.6347005
## 0.10 1 5 900 1.101217 0.6325188
## 0.10 1 5 950 1.097741 0.6342132
## 0.10 1 5 1000 1.095818 0.6353135
## 0.10 1 10 100 1.156099 0.6022677
## 0.10 1 10 150 1.126084 0.6277226
## 0.10 1 10 200 1.115084 0.6323078
## 0.10 1 10 250 1.109829 0.6289807
## 0.10 1 10 300 1.105368 0.6328331
## 0.10 1 10 350 1.095910 0.6345427
## 0.10 1 10 400 1.090515 0.6356221
## 0.10 1 10 450 1.090684 0.6336019
## 0.10 1 10 500 1.089658 0.6348636
## 0.10 1 10 550 1.089942 0.6339399
## 0.10 1 10 600 1.087060 0.6346821
## 0.10 1 10 650 1.085620 0.6359460
## 0.10 1 10 700 1.085802 0.6349203
## 0.10 1 10 750 1.083855 0.6378528
## 0.10 1 10 800 1.083868 0.6364452
## 0.10 1 10 850 1.083774 0.6369773
## 0.10 1 10 900 1.083053 0.6383456
## 0.10 1 10 950 1.083398 0.6388623
## 0.10 1 10 1000 1.081692 0.6392557
## 0.10 3 5 100 1.161578 0.6027382
## 0.10 3 5 150 1.139567 0.6170143
## 0.10 3 5 200 1.129315 0.6224467
## 0.10 3 5 250 1.126903 0.6239068
## 0.10 3 5 300 1.122502 0.6267327
## 0.10 3 5 350 1.120190 0.6275448
## 0.10 3 5 400 1.119628 0.6286442
## 0.10 3 5 450 1.119524 0.6286704
## 0.10 3 5 500 1.119184 0.6287459
## 0.10 3 5 550 1.118670 0.6291689
## 0.10 3 5 600 1.118383 0.6294669
## 0.10 3 5 650 1.118430 0.6293559
## 0.10 3 5 700 1.118540 0.6292865
## 0.10 3 5 750 1.118637 0.6292287
## 0.10 3 5 800 1.118527 0.6292881
## 0.10 3 5 850 1.118485 0.6292992
## 0.10 3 5 900 1.118444 0.6293195
## 0.10 3 5 950 1.118417 0.6293371
## 0.10 3 5 1000 1.118414 0.6293425
## 0.10 3 10 100 1.128901 0.6202220
## 0.10 3 10 150 1.140720 0.6128924
## 0.10 3 10 200 1.130160 0.6167806
## 0.10 3 10 250 1.124844 0.6191742
## 0.10 3 10 300 1.114858 0.6246581
## 0.10 3 10 350 1.115825 0.6239249
## 0.10 3 10 400 1.114552 0.6257944
## 0.10 3 10 450 1.112355 0.6272502
## 0.10 3 10 500 1.111626 0.6275872
## 0.10 3 10 550 1.111609 0.6277501
## 0.10 3 10 600 1.111973 0.6273088
## 0.10 3 10 650 1.112320 0.6273120
## 0.10 3 10 700 1.112134 0.6274487
## 0.10 3 10 750 1.112341 0.6271668
## 0.10 3 10 800 1.112421 0.6271241
## 0.10 3 10 850 1.112343 0.6272230
## 0.10 3 10 900 1.112908 0.6268414
## 0.10 3 10 950 1.113044 0.6267142
## 0.10 3 10 1000 1.113118 0.6265828
## 0.10 5 5 100 1.117222 0.6404199
## 0.10 5 5 150 1.094067 0.6531126
## 0.10 5 5 200 1.084484 0.6586126
## 0.10 5 5 250 1.076384 0.6651060
## 0.10 5 5 300 1.073427 0.6675055
## 0.10 5 5 350 1.071206 0.6691173
## 0.10 5 5 400 1.069600 0.6704350
## 0.10 5 5 450 1.068172 0.6713196
## 0.10 5 5 500 1.067289 0.6721906
## 0.10 5 5 550 1.066623 0.6726282
## 0.10 5 5 600 1.066561 0.6728267
## 0.10 5 5 650 1.066265 0.6730006
## 0.10 5 5 700 1.066203 0.6730741
## 0.10 5 5 750 1.066086 0.6731796
## 0.10 5 5 800 1.066007 0.6732731
## 0.10 5 5 850 1.065986 0.6732960
## 0.10 5 5 900 1.065952 0.6733221
## 0.10 5 5 950 1.065947 0.6733245
## 0.10 5 5 1000 1.065938 0.6733276
## 0.10 5 10 100 1.121702 0.6321996
## 0.10 5 10 150 1.119793 0.6357847
## 0.10 5 10 200 1.106667 0.6443552
## 0.10 5 10 250 1.097924 0.6496728
## 0.10 5 10 300 1.096011 0.6518810
## 0.10 5 10 350 1.094623 0.6534330
## 0.10 5 10 400 1.094889 0.6536713
## 0.10 5 10 450 1.093868 0.6545526
## 0.10 5 10 500 1.094878 0.6545101
## 0.10 5 10 550 1.094252 0.6554586
## 0.10 5 10 600 1.093277 0.6560280
## 0.10 5 10 650 1.093562 0.6563242
## 0.10 5 10 700 1.095014 0.6554559
## 0.10 5 10 750 1.095030 0.6554822
## 0.10 5 10 800 1.094371 0.6560415
## 0.10 5 10 850 1.094430 0.6562128
## 0.10 5 10 900 1.094596 0.6562757
## 0.10 5 10 950 1.094709 0.6564110
## 0.10 5 10 1000 1.094926 0.6563539
## 0.10 7 5 100 1.189057 0.5871892
## 0.10 7 5 150 1.168465 0.6040973
## 0.10 7 5 200 1.159321 0.6124949
## 0.10 7 5 250 1.156649 0.6155535
## 0.10 7 5 300 1.154411 0.6174031
## 0.10 7 5 350 1.153361 0.6185001
## 0.10 7 5 400 1.152837 0.6188745
## 0.10 7 5 450 1.152504 0.6194350
## 0.10 7 5 500 1.152004 0.6199290
## 0.10 7 5 550 1.151818 0.6201870
## 0.10 7 5 600 1.151836 0.6202544
## 0.10 7 5 650 1.151765 0.6203476
## 0.10 7 5 700 1.151729 0.6204390
## 0.10 7 5 750 1.151742 0.6204779
## 0.10 7 5 800 1.151691 0.6205179
## 0.10 7 5 850 1.151695 0.6205329
## 0.10 7 5 900 1.151662 0.6205623
## 0.10 7 5 950 1.151664 0.6205692
## 0.10 7 5 1000 1.151668 0.6205709
## 0.10 7 10 100 1.190652 0.5762821
## 0.10 7 10 150 1.180201 0.5834694
## 0.10 7 10 200 1.167701 0.5866883
## 0.10 7 10 250 1.150748 0.5986952
## 0.10 7 10 300 1.146134 0.6008925
## 0.10 7 10 350 1.142632 0.6025330
## 0.10 7 10 400 1.140351 0.6035123
## 0.10 7 10 450 1.139099 0.6042302
## 0.10 7 10 500 1.137666 0.6054228
## 0.10 7 10 550 1.134858 0.6067432
## 0.10 7 10 600 1.133763 0.6075150
## 0.10 7 10 650 1.133845 0.6073639
## 0.10 7 10 700 1.133342 0.6078599
## 0.10 7 10 750 1.133814 0.6077471
## 0.10 7 10 800 1.133704 0.6078972
## 0.10 7 10 850 1.134101 0.6076897
## 0.10 7 10 900 1.134260 0.6075400
## 0.10 7 10 950 1.133912 0.6077740
## 0.10 7 10 1000 1.134178 0.6076601
## MAE
## 1.2043518
## 1.1421541
## 1.0882058
## 1.0402271
## 1.0115711
## 0.9922845
## 0.9769365
## 0.9644533
## 0.9548605
## 0.9481121
## 0.9359568
## 0.9280852
## 0.9238474
## 0.9207813
## 0.9203403
## 0.9193643
## 0.9161853
## 0.9115853
## 0.9090842
## 1.2002632
## 1.1234821
## 1.0596538
## 1.0223125
## 0.9917492
## 0.9602612
## 0.9413721
## 0.9266708
## 0.9168448
## 0.9081947
## 0.9005390
## 0.8957363
## 0.8911517
## 0.8861443
## 0.8827197
## 0.8817305
## 0.8775892
## 0.8763731
## 0.8752682
## 1.1015150
## 1.0238667
## 0.9744013
## 0.9470636
## 0.9266289
## 0.9160160
## 0.9079582
## 0.8995514
## 0.8934877
## 0.8913169
## 0.8909973
## 0.8876236
## 0.8821467
## 0.8798356
## 0.8742158
## 0.8717258
## 0.8668497
## 0.8653056
## 0.8607952
## 1.0949619
## 1.0230255
## 0.9806685
## 0.9535297
## 0.9294918
## 0.9164396
## 0.9034637
## 0.8920558
## 0.8838081
## 0.8788421
## 0.8749954
## 0.8774099
## 0.8749346
## 0.8740594
## 0.8714432
## 0.8718563
## 0.8709054
## 0.8691689
## 0.8680796
## 1.1001487
## 1.0150763
## 0.9673993
## 0.9385435
## 0.9226789
## 0.9061504
## 0.8996265
## 0.8928975
## 0.8855824
## 0.8789483
## 0.8741524
## 0.8717220
## 0.8684747
## 0.8646725
## 0.8629497
## 0.8578387
## 0.8545945
## 0.8518930
## 0.8497360
## 1.0950860
## 1.0151876
## 0.9707329
## 0.9390401
## 0.9199674
## 0.9067851
## 0.8956035
## 0.8903520
## 0.8850489
## 0.8802560
## 0.8746180
## 0.8719435
## 0.8690298
## 0.8642113
## 0.8620942
## 0.8587785
## 0.8582023
## 0.8587495
## 0.8562072
## 1.0717283
## 1.0046766
## 0.9652709
## 0.9368655
## 0.9191283
## 0.8989342
## 0.8887754
## 0.8812429
## 0.8743986
## 0.8687471
## 0.8647713
## 0.8594903
## 0.8562926
## 0.8534261
## 0.8515857
## 0.8494407
## 0.8472237
## 0.8461133
## 0.8449297
## 1.0942387
## 1.0155898
## 0.9778825
## 0.9447624
## 0.9250637
## 0.9047899
## 0.8904787
## 0.8822174
## 0.8782820
## 0.8712289
## 0.8659259
## 0.8673838
## 0.8653880
## 0.8602866
## 0.8593037
## 0.8596426
## 0.8580370
## 0.8548019
## 0.8534993
## 0.9350977
## 0.9209409
## 0.9072459
## 0.9164076
## 0.9180271
## 0.8916412
## 0.8804452
## 0.8726802
## 0.8700616
## 0.8665726
## 0.8602514
## 0.8527719
## 0.8555090
## 0.8547317
## 0.8530406
## 0.8583742
## 0.8596178
## 0.8557754
## 0.8524109
## 0.8825193
## 0.8573866
## 0.8555280
## 0.8511323
## 0.8568328
## 0.8504406
## 0.8511253
## 0.8559759
## 0.8555801
## 0.8506090
## 0.8555627
## 0.8555471
## 0.8561722
## 0.8591346
## 0.8610010
## 0.8624592
## 0.8595501
## 0.8609681
## 0.8620273
## 0.9230787
## 0.9004964
## 0.9001334
## 0.8995979
## 0.8965742
## 0.8960315
## 0.8955442
## 0.8957099
## 0.8955243
## 0.8950421
## 0.8948210
## 0.8948255
## 0.8949695
## 0.8949607
## 0.8948477
## 0.8947963
## 0.8947576
## 0.8947357
## 0.8947378
## 0.8841140
## 0.8934855
## 0.8796558
## 0.8764818
## 0.8719990
## 0.8735289
## 0.8739195
## 0.8734412
## 0.8738539
## 0.8745281
## 0.8753331
## 0.8759031
## 0.8763275
## 0.8769341
## 0.8770040
## 0.8771321
## 0.8776571
## 0.8779181
## 0.8779328
## 0.8744934
## 0.8629142
## 0.8590920
## 0.8542072
## 0.8534808
## 0.8521780
## 0.8516822
## 0.8509193
## 0.8504283
## 0.8499112
## 0.8499754
## 0.8498767
## 0.8499288
## 0.8498336
## 0.8497946
## 0.8497829
## 0.8497617
## 0.8497671
## 0.8497727
## 0.8667003
## 0.8585501
## 0.8543481
## 0.8502296
## 0.8477846
## 0.8447928
## 0.8443338
## 0.8440546
## 0.8456569
## 0.8460636
## 0.8444986
## 0.8449172
## 0.8465640
## 0.8459460
## 0.8452996
## 0.8454180
## 0.8455203
## 0.8459147
## 0.8460901
## 0.9231650
## 0.9075976
## 0.9012684
## 0.8999699
## 0.8988860
## 0.8976014
## 0.8971639
## 0.8968914
## 0.8961339
## 0.8959511
## 0.8958689
## 0.8958212
## 0.8957781
## 0.8957706
## 0.8957223
## 0.8957200
## 0.8956821
## 0.8956833
## 0.8956911
## 0.9163567
## 0.9121291
## 0.9033163
## 0.8960191
## 0.8936307
## 0.8909682
## 0.8915852
## 0.8922805
## 0.8915413
## 0.8907899
## 0.8899397
## 0.8905556
## 0.8906511
## 0.8913763
## 0.8915522
## 0.8921929
## 0.8924841
## 0.8926450
## 0.8929164
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
## 5, shrinkage = 0.1 and n.minobsinnode = 5.
Model_GBM_Pred <- predict(Model_GBM, newdata = data_test_X)
Model_GBM_metrics <- postResample(pred = Model_GBM_Pred, obs = data_test_Y)
Model_GBM_metrics
## RMSE Rsquared MAE
## 1.0896377 0.6769028 0.8801046
Cubist
set.seed(200)
Model_Cubist <- train(x = data_train_X,
y = data_train_Y,
method = "cubist",
trControl = trainControl(method = 'cv'))
Model_Cubist
## Cubist
##
## 132 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 117, 119, 119, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.411332 0.5004927 1.1268350
## 1 5 1.180666 0.6664762 0.9322419
## 1 9 1.286096 0.5924816 1.0076844
## 10 0 1.135672 0.6351204 0.9539481
## 10 5 1.011371 0.7171035 0.8451593
## 10 9 1.065767 0.6811225 0.8876194
## 20 0 1.169131 0.6014467 0.9710317
## 20 5 1.045657 0.6882689 0.8606767
## 20 9 1.100499 0.6493282 0.9044529
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 10 and neighbors = 5.
Model_Cubist_Pred <- predict(Model_Cubist, newdata = data_test_X)
Model_Cubist_metrics <- postResample(pred = Model_Cubist_Pred, obs = data_test_Y)
Model_Cubist_metrics
## RMSE Rsquared MAE
## 0.8879711 0.7813045 0.6587988
Model Comparison
The best model selected by both RMSE & R2 is Cubist.
rbind(Model_Tree_metrics,
Model_RF_metrics,
Model_GBM_metrics,
Model_Cubist_metrics) %>%
data.frame() %>%
arrange(RMSE)
(b)
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Answer: 1. The most imporant variable in the Cubist model is ManufacturingProcess17
.
The model is domiated by the process variables (8 of the 10 top are process variables).
For both the optimal linear and non-linear models in the previous homeworks, the Manufacturing Process variables domiate the VarImp list. all the linear, non-linear and tree model shows similar pattern.
## cubist variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess17 100.000
## ManufacturingProcess32 97.959
## ManufacturingProcess01 62.245
## ManufacturingProcess39 41.837
## BiologicalMaterial12 40.816
## ManufacturingProcess09 39.796
## BiologicalMaterial06 33.673
## ManufacturingProcess33 31.633
## ManufacturingProcess29 30.612
## ManufacturingProcess37 25.510
## ManufacturingProcess04 22.449
## ManufacturingProcess27 20.408
## BiologicalMaterial02 16.327
## BiologicalMaterial08 14.286
## ManufacturingProcess45 14.286
## ManufacturingProcess13 11.224
## ManufacturingProcess15 10.204
## ManufacturingProcess42 8.163
## ManufacturingProcess38 6.122
## BiologicalMaterial10 6.122
(c)
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
Answer: Yes. The view also provided insight that most of the splits are made by manualfacturing process variables.