Q 8.1 A- First we did the random forrest and as expected the uninformative predictors were not valued significantly in this random forest model. V4 was our highest imoprtance value, and while one of the uninformative predictors in example did have a positive value at 0.0085, this was still several decimal points lower showing it is not significant.
#A
set.seed(250)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000
)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 7.11857750
## V2 8.38027260
## V3 0.92975796
## V4 13.57435414
## V5 1.40444789
## V6 -0.09009484
## V7 0.00858833
## V8 -0.13804201
## V9 -0.10573801
## V10 -0.15306925
#B As expected, our duplicate had a very similar correlation and significance in the random forest. After adding the duplicate we did decrease the importance score of v1. This makes sense since there is such a similarity these values now begin to share predictive value in the model.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9354848
model2 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000
)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.19432776
## V2 8.50341178
## V3 0.82764300
## V4 12.43702218
## V5 1.01911956
## V6 -0.06259129
## V7 -0.04025951
## V8 -0.15164836
## V9 -0.07489225
## V10 -0.12376839
## duplicate1 3.28452905
#C The conditional inference produced a similar result to the traditional model. For informative predictors V1 to V5 we received the largest importance scores, especially V4, V2, and V1. The uninformative predictors V6 to V10 had importance values near zero or negative, so they were not meaningfully used in the model. Compared with the traditional random forest, the conditional inference forest produced smaller importance vslues but still changing by a similar scale for each predictor. The rankings were mostly unchanged which suggests that both methods identified the same general signal.
cforest_model <- cforest(
y ~ .,
data = simulated,
controls = cforest_unbiased(ntree = 1000, mtry = 3)
)
traditional_imp <- varimp(cforest_model, conditional = FALSE)
traditional_imp
## V1 V2 V3 V4 V5 V6
## 4.367238708 7.456416260 0.237840033 10.512089680 1.023676070 0.015234297
## V7 V8 V9 V10 duplicate1
## -0.065968523 -0.041008522 0.005394484 -0.091130425 2.706393320
conditional_imp <- varimp(cforest_model, conditional = TRUE)
conditional_imp
## V1 V2 V3 V4 V5 V6
## 2.356158991 4.995099702 0.170594093 6.225531720 0.623869873 -0.010172844
## V7 V8 V9 V10 duplicate1
## -0.006975007 -0.011198308 -0.038897243 -0.020477433 1.060613376
#D The boosted tree and Cubist models showed the same general pattern as the informative predictors V1 to V5 received the largest importance scores, while the uninformative predictors V6 to V10 had much smaller scores. The boosted tree model placed the most importance on V4, followed by V2 and V1. Cubist ranked the variables differently with V2, V1, and V4 receiving the largest importance values. Both models separated the informative predictors from the noise predictors but the cubist model had a different order of importance than rf and our boosted tree model. The same overall pattern occurs across the different tree-based models as the models identify the true predictors. The ranking and magnitude of importance scores can vary by model. We see this in this example as the boosted tree is two digits larger than the Cubist model. For this reason I would say variable importance scores are useful but values should be criticized against other models.
#Boosted Trees
ctrl <- trainControl(method = "cv", number = 10)
gbm_fit <- train(
y ~ .,
data = simulated,
method = "gbm",
trControl = ctrl,
verbose = FALSE,
tuneLength = 5
)
gbm_imp <- varImp(gbm_fit, scale = FALSE)
gbm_imp
## gbm variable importance
##
## Overall
## V4 6852.75
## V2 4142.87
## V1 3052.19
## V5 1292.70
## V3 1215.50
## duplicate1 900.85
## V6 312.60
## V7 226.86
## V10 170.70
## V9 121.60
## V8 95.03
plot(gbm_imp, top = 15)
#Cubist
cubist_fit <- train(
y ~ .,
data = simulated,
method = "cubist",
trControl = ctrl,
tuneLength = 5
)
cubist_imp <- varImp(cubist_fit, scale = FALSE)
cubist_imp
## cubist variable importance
##
## Overall
## V1 62.0
## V2 60.0
## V4 58.0
## V5 40.0
## V3 33.5
## duplicate1 8.0
## V10 3.5
## V6 3.0
## V7 1.5
## V9 0.0
## V8 0.0
plot(cubist_imp, top = 15)
8.2 In order to compare tree bias with different predictor granularities
I simulated a random response variable and several random predictors
with different numbers of possible split points. Ideally we would see
more split points offer more opportunities and will thus show up as
stronger predictors. The predictors created included a binary variable,
a 5 level categorical variable, a 20 level categorical variable, and a
continuous variable. The response was generated randomly so none of the
predictors should have real predictive value. The fitted tree assigned
different importance scores to the predictors and the continuous
variable had the largest importance score then followed by the 20 level
and then the 5 level categorical variables. The binary predictor had the
lowest importance score which is expected as it has the least possible
split points. This shows the tree bias because the variables with the
most possible split points were more likely to be selected by the tree
even though all predictors were not true predictors. These results show
that tree-based models can favor predictors with higher granularity and
a continuous variable or a categorical variable with many levels has
more opportunities to create a splits to predict on.
n <- 1000
tree_bias_sim <- data.frame(
y = rnorm(n),
x_binary = factor(sample(c("A", "B"), n, replace = TRUE)),
x_5level = factor(sample(paste0("L", 1:5), n, replace = TRUE)),
x_20level = factor(sample(paste0("L", 1:20), n, replace = TRUE)),
x_continuous = runif(n)
)
tree_model <- rpart(
y ~ .,
data = tree_bias_sim,
method = "anova",
control = rpart.control(cp = 0.001)
)
tree_model
## n= 1000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1000 964.8026000 0.010391590
## 2) x_20level=L1,L15,L17,L18,L19,L2,L4,L9 394 402.5564000 -0.103042300
## 4) x_5level=L1,L2,L3,L5 321 309.2136000 -0.140363700
## 8) x_continuous>=0.05378612 303 293.3037000 -0.162180300
## 16) x_continuous< 0.2172965 46 33.5073900 -0.372743300
## 32) x_20level=L1,L15,L17,L19 23 13.7557800 -0.619349600
## 64) x_binary=B 10 5.4351040 -0.965756400 *
## 65) x_binary=A 13 6.1976390 -0.352882800 *
## 33) x_20level=L18,L2,L4,L9 23 16.9541300 -0.126137000 *
## 17) x_continuous>=0.2172965 257 257.3918000 -0.124491900
## 34) x_continuous>=0.9346207 23 18.2606100 -0.456300600
## 68) x_20level=L17,L18,L4 7 0.7768168 -1.026604000 *
## 69) x_20level=L1,L15,L19,L9 16 14.2110100 -0.206793000 *
## 35) x_continuous< 0.9346207 234 236.3500000 -0.091878270
## 70) x_continuous< 0.8933792 225 229.3959000 -0.108775500
## 140) x_continuous>=0.8684971 18 13.7522300 -0.467081200 *
## 141) x_continuous< 0.8684971 207 213.1318000 -0.077618430
## 282) x_continuous< 0.8543512 200 208.2798000 -0.100446800
## 564) x_continuous>=0.8248875 10 11.3065800 -0.741660900 *
## 565) x_continuous< 0.8248875 190 192.6452000 -0.066698640
## 1130) x_continuous< 0.7967408 181 176.0120000 -0.111810800
## 2260) x_continuous>=0.5131531 83 90.9795200 -0.213913300
## 4520) x_20level=L17,L18,L19,L2,L4 51 48.0022200 -0.446194000
## 9040) x_5level=L2,L3,L5 42 35.7118100 -0.544781600
## 18080) x_continuous>=0.5532513 32 26.7962400 -0.635846100
## 36160) x_continuous< 0.6652611 15 7.3100900 -0.913237300 *
## 36161) x_continuous>=0.6652611 17 17.3135600 -0.391089200 *
## 18081) x_continuous< 0.5532513 10 7.8010310 -0.253375200 *
## 9041) x_5level=L1 9 9.9771620 0.013881750 *
## 4521) x_20level=L1,L15,L9 32 35.8401500 0.156284200
## 9042) x_continuous< 0.6617579 13 10.7261800 -0.174391200 *
## 9043) x_continuous>=0.6617579 19 22.7198600 0.382535700 *
## 2261) x_continuous< 0.5131531 98 83.4343400 -0.025336300
## 4522) x_continuous< 0.4723028 83 64.1946200 -0.146907600
## 9044) x_continuous>=0.4179507 27 18.0710300 -0.456322100
## 18088) x_20level=L15,L19,L4,L9 14 7.5574400 -0.822941300 *
## 18089) x_20level=L1,L17,L18,L2 13 6.6053680 -0.061501530 *
## 9045) x_continuous< 0.4179507 56 42.2923900 0.002274361
## 18090) x_20level=L1,L15,L17,L18,L2,L4 43 27.3827300 -0.137489200
## 36180) x_5level=L1,L2,L3 29 19.1369300 -0.331689300
## 72360) x_continuous< 0.302812 15 7.2469680 -0.609061400 *
## 72361) x_continuous>=0.302812 14 9.4994750 -0.034504840 *
## 36181) x_5level=L5 14 4.8865890 0.264782400 *
## 18091) x_20level=L19,L9 13 11.2913900 0.464569200 *
## 4523) x_continuous>=0.4723028 15 11.2252400 0.647358400 *
## 1131) x_continuous>=0.7967408 9 8.8569190 0.840557400 *
## 283) x_continuous>=0.8543512 7 1.7699310 0.574619200 *
## 71) x_continuous>=0.8933792 9 5.2838640 0.330551200 *
## 9) x_continuous< 0.05378612 18 13.3380500 0.226882700 *
## 5) x_5level=L4 73 90.9295500 0.061069190
## 10) x_20level=L1,L2 14 7.9741450 -0.421406400 *
## 11) x_20level=L15,L17,L18,L19,L4,L9 59 78.9231300 0.175554900
## 22) x_continuous>=0.4878807 30 46.4313300 -0.024174390
## 44) x_20level=L17,L18,L19 14 25.8407900 -0.332190700 *
## 45) x_20level=L15,L4,L9 16 18.1000900 0.245339900 *
## 23) x_continuous< 0.4878807 29 30.0570300 0.382171500
## 46) x_20level=L15,L9 10 9.7338560 -0.160765100 *
## 47) x_20level=L17,L18,L19,L4 19 15.8239000 0.667927600 *
## 3) x_20level=L10,L11,L12,L13,L14,L16,L20,L3,L5,L6,L7,L8 606 553.8804000 0.084142370
## 6) x_5level=L2,L3,L4 337 297.4489000 -0.024645470
## 12) x_20level=L10,L13,L20,L3,L6,L8 182 159.7282000 -0.121940900
## 24) x_continuous< 0.9621274 174 155.3224000 -0.148424700
## 48) x_continuous>=0.2135687 139 111.4317000 -0.214172800
## 96) x_5level=L2 56 48.6139800 -0.333711900
## 192) x_20level=L10,L6 19 10.0833600 -0.799196900 *
## 193) x_20level=L13,L20,L3,L8 37 32.2997100 -0.094679050
## 386) x_continuous< 0.5302982 18 16.7556300 -0.334470200 *
## 387) x_continuous>=0.5302982 19 13.5285600 0.132491500 *
## 97) x_5level=L3,L4 83 61.4776300 -0.133519800
## 194) x_20level=L13,L3,L8 41 32.7481000 -0.345889600
## 388) x_continuous< 0.7882385 31 23.6747500 -0.430678500
## 776) x_continuous>=0.6566202 7 3.5319050 -1.071911000 *
## 777) x_continuous< 0.6566202 24 16.4251000 -0.243652200
## 1554) x_continuous< 0.397126 11 6.1756230 -0.561323300 *
## 1555) x_continuous>=0.397126 13 8.2001270 0.025146340 *
## 389) x_continuous>=0.7882385 10 8.1596040 -0.083044250 *
## 195) x_20level=L10,L20,L6 42 25.0752800 0.073793520
## 390) x_binary=A 19 8.8470600 -0.205318400 *
## 391) x_binary=B 23 13.5253100 0.304364300 *
## 49) x_continuous< 0.2135687 35 40.9035100 0.112689300
## 98) x_5level=L3,L4 24 28.4115100 -0.220097700
## 196) x_20level=L20,L3,L8 7 6.0804890 -1.312469000 *
## 197) x_20level=L10,L13,L6 17 10.5386600 0.229702300 *
## 99) x_5level=L2 11 4.0349470 0.838769800 *
## 25) x_continuous>=0.9621274 8 1.6292840 0.454080600 *
## 13) x_20level=L11,L12,L14,L16,L5,L7 155 133.9748000 0.089598200
## 26) x_continuous>=0.4840581 86 71.4895600 -0.020038150
## 52) x_20level=L12,L5,L7 46 32.9274600 -0.245308900
## 104) x_continuous< 0.5689724 9 7.8150500 -0.901700000 *
## 105) x_continuous>=0.5689724 37 20.2915500 -0.085646250
## 210) x_binary=B 16 7.9623210 -0.268590800 *
## 211) x_binary=A 21 11.3857300 0.053740090
## 422) x_continuous< 0.7812383 13 5.2512020 -0.118265400 *
## 423) x_continuous>=0.7812383 8 5.1249130 0.333249000 *
## 53) x_20level=L11,L14,L16 40 33.5432300 0.239023300
## 106) x_continuous>=0.644022 32 26.9267500 0.094422940
## 212) x_continuous< 0.8900425 20 15.7208500 -0.173422100 *
## 213) x_continuous>=0.8900425 12 7.3797190 0.540831400 *
## 107) x_continuous< 0.644022 8 3.2709930 0.817424600 *
## 27) x_continuous< 0.4840581 69 60.1631400 0.226246400
## 54) x_continuous< 0.3495906 46 36.0462400 -0.028023140
## 108) x_continuous>=0.1027615 37 31.0931600 -0.130620500
## 216) x_20level=L16,L5,L7 22 18.8638800 -0.296505200
## 432) x_5level=L4 7 9.1976110 -0.723475400 *
## 433) x_5level=L2,L3 15 7.7946220 -0.097252360 *
## 217) x_20level=L11,L12,L14 15 10.7359800 0.112677000 *
## 109) x_continuous< 0.1027615 9 2.9624550 0.393765900 *
## 55) x_continuous>=0.3495906 23 15.1947800 0.734785500
## 110) x_20level=L11,L12,L14,L16 15 7.3175030 0.472744500 *
## 111) x_20level=L5,L7 8 4.9160810 1.226112000 *
## 7) x_5level=L1,L5 269 247.4467000 0.220430500
## 14) x_continuous< 0.03435689 9 2.2294210 -0.572351400 *
## 15) x_continuous>=0.03435689 260 239.3649000 0.247872900
## 30) x_20level=L11,L12,L7 60 64.7436400 -0.038419530
## 60) x_5level=L1 33 33.8951900 -0.238181300
## 120) x_continuous< 0.7865633 25 28.3834300 -0.432181000
## 240) x_20level=L12,L7 16 15.6382900 -0.653000200 *
## 241) x_20level=L11 9 10.5779800 -0.039613620 *
## 121) x_continuous>=0.7865633 8 1.6305630 0.368067800 *
## 61) x_5level=L5 27 27.9221000 0.205733800
## 122) x_continuous>=0.4425729 19 15.1314900 -0.114936600 *
## 123) x_continuous< 0.4425729 8 6.1966650 0.967326000 *
## 31) x_20level=L10,L13,L14,L16,L20,L3,L5,L6,L8 200 168.2281000 0.333760700
## 62) x_continuous< 0.1426957 22 18.1401400 0.107892600
## 124) x_binary=A 14 9.4158880 -0.234131000 *
## 125) x_binary=B 8 4.2205160 0.706433800 *
## 63) x_continuous>=0.1426957 178 148.8269000 0.361677000
## 126) x_binary=B 88 65.0022800 0.244513500
## 252) x_20level=L10,L14,L16,L20 40 31.6352800 0.016187800
## 504) x_5level=L1 20 14.0783800 -0.284226900
## 1008) x_20level=L14,L16 8 5.0362460 -0.751947400 *
## 1009) x_20level=L10,L20 12 6.1253000 0.027586670 *
## 505) x_5level=L5 20 13.9469400 0.316602500
## 1010) x_20level=L10,L20 9 7.4697510 0.030234970 *
## 1011) x_20level=L14,L16 11 5.1352680 0.550903300 *
## 253) x_20level=L13,L3,L5,L6,L8 48 29.5439400 0.434785000
## 506) x_continuous< 0.2818695 8 5.2436800 -0.062367340 *
## 507) x_continuous>=0.2818695 40 21.9275200 0.534215400
## 1014) x_20level=L13,L3,L5,L8 33 17.7086500 0.423206000 *
## 1015) x_20level=L6 7 1.8950860 1.057546000 *
## 127) x_binary=A 90 81.4354900 0.476236800
## 254) x_20level=L6 10 9.0011960 0.011433210 *
## 255) x_20level=L10,L13,L14,L16,L20,L3,L5,L8 80 70.0038200 0.534337200
## 510) x_continuous>=0.2258926 71 63.4998800 0.485919200
## 1020) x_20level=L5 8 5.7908440 -0.080636420 *
## 1021) x_20level=L10,L13,L14,L16,L20,L3,L8 63 54.8150800 0.557862800
## 2042) x_continuous>=0.6528626 25 20.4604100 0.367019300
## 4084) x_continuous< 0.857048 12 5.6336960 -0.152276100 *
## 4085) x_continuous>=0.857048 13 8.6036160 0.846368800 *
## 2043) x_continuous< 0.6528626 38 32.8451000 0.683417800
## 4086) x_20level=L10,L13,L14,L3,L8 28 18.4987400 0.526869900
## 8172) x_continuous< 0.5174447 15 8.4743020 0.352140200 *
## 8173) x_continuous>=0.5174447 13 9.0380650 0.728481000 *
## 4087) x_20level=L16,L20 10 11.7388000 1.121752000 *
## 511) x_continuous< 0.2258926 9 5.0244190 0.916301200 *
tree_imp <- varImp(tree_model, scale = FALSE)
tree_imp
## Overall
## x_20level 3.934052
## x_5level 1.653124
## x_binary 1.126148
## x_continuous 4.122719
8.3 #A The model on the right focuses on only a few predictors because the bagging fraction and learning rate are large and close to 1. With a bagging fraction of 0.9 each tree is trained on most of the data and that’s why the same strong predictors are repeatedly available. That’s also why each tree has a large impact on the final model and so the result is that the model emphasizes the strongest predictors early and gives them very large importance scores. The model on the left uses a smaller bagging fraction and smaller learning rate so each tree has less influence and is also trained on a smaller random subset of the data. This makes it so that the importance can spread across a larger number of predictors. There is a value in both but fine tuning this will be most likely to show the best number to use.
#B I would expect the model on the left to be more predictive on future samples. The smaller learning rate makes the model less likely to be overfitted to our training data and the smaller bagging fraction adds randomness that reduces overfitting and would make it more predictive on other samples. The model on the right gives too much weight to early trees and concentrates importance on only a few variables so it’s more likely to be overfitted and perform worse on new data. #C Increasing interaction depth would make each tree/model relationship more complex among our predictors. If we have shallow trees the boosting model mostly captures larger importance but that may concentrate the strongest individual predictors. Deeper trees make it so each tree can include multiple diversions and capture interactions among variables. This can spread importance across more predictors but deeper trees also increase the model complexity and can increase the risk of overfitting which can be even more exagerrated with a high learning rate.
8.7 #A
data(ChemicalManufacturingProcess)
set.seed(62462)
chem <- ChemicalManufacturingProcess
x <- chem[, names(chem) != "Yield"]
y <- chem$Yield
training_index <- createDataPartition(y, p = 0.80, list = FALSE)
train_x <- x[training_index, ]
test_x <- x[-training_index, ]
train_y <- y[training_index]
test_y <- y[-training_index]
chem_preprocess <- preProcess(
train_x,
method = c("medianImpute", "center", "scale")
)
train_x_processed <- predict(chem_preprocess, train_x)
test_x_processed <- predict(chem_preprocess, test_x)
train_data <- data.frame(train_x_processed, Yield = train_y)
test_data <- data.frame(test_x_processed, Yield = test_y)
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5
)
#Single regression tree
rpart_fit <- train(
Yield ~ .,
data = train_data,
method = "rpart",
trControl = ctrl,
tuneLength = 10,
metric = "RMSE"
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
rpart_fit
## CART
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 129, 130, 129, 129, 131, 129, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01166171 1.391212 0.4916981 1.073797
## 0.01764571 1.404991 0.4815042 1.082046
## 0.02409976 1.426726 0.4696542 1.101242
## 0.02823040 1.434066 0.4669813 1.108598
## 0.02948997 1.439802 0.4649547 1.111288
## 0.04539905 1.447049 0.4554312 1.129731
## 0.04980354 1.453354 0.4437482 1.142253
## 0.06551772 1.437705 0.4434301 1.137715
## 0.10724566 1.463395 0.4154622 1.157379
## 0.38447502 1.745086 0.2369196 1.407932
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01166171.
#Random Forest
rf_fit <- train(
Yield ~ .,
data = train_data,
method = "rf",
trControl = ctrl,
tuneLength = 5,
metric = "RMSE",
importance = TRUE
)
rf_fit
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 131, 130, 130, 131, 129, 129, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.201648 0.6623120 0.9642171
## 15 1.073680 0.6985497 0.8450098
## 29 1.061131 0.6989627 0.8316197
## 43 1.061003 0.6922837 0.8287020
## 57 1.069024 0.6847127 0.8323899
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 43.
#Boosted Trees
gbm_fit <- train(
Yield ~ .,
data = train_data,
method = "gbm",
trControl = ctrl,
tuneLength = 5,
metric = "RMSE",
verbose = FALSE
)
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
gbm_fit
## Stochastic Gradient Boosting
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 129, 131, 128, 131, 130, 130, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 1.173778 0.6261683 0.9136573
## 1 100 1.157318 0.6338251 0.9077326
## 1 150 1.152453 0.6399458 0.9044024
## 1 200 1.158590 0.6339452 0.9092358
## 1 250 1.149980 0.6405769 0.9057639
## 2 50 1.158012 0.6307061 0.8998786
## 2 100 1.128549 0.6463140 0.8799975
## 2 150 1.117425 0.6535445 0.8780331
## 2 200 1.107756 0.6604819 0.8736083
## 2 250 1.101950 0.6665681 0.8714308
## 3 50 1.126964 0.6544021 0.8804137
## 3 100 1.089063 0.6749766 0.8506750
## 3 150 1.075617 0.6839268 0.8444513
## 3 200 1.071206 0.6868750 0.8418709
## 3 250 1.066179 0.6905517 0.8392647
## 4 50 1.143261 0.6379582 0.8917140
## 4 100 1.115667 0.6572577 0.8758878
## 4 150 1.097671 0.6689795 0.8623324
## 4 200 1.091684 0.6734568 0.8580749
## 4 250 1.086844 0.6770774 0.8548799
## 5 50 1.131507 0.6490796 0.8869771
## 5 100 1.100319 0.6677153 0.8640611
## 5 150 1.080082 0.6800792 0.8508346
## 5 200 1.070822 0.6860334 0.8446090
## 5 250 1.065587 0.6889975 0.8418312
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 250, interaction.depth =
## 5, shrinkage = 0.1 and n.minobsinnode = 10.
#Cubist model
cubist_fit <- train(
Yield ~ .,
data = train_data,
method = "cubist",
trControl = ctrl,
tuneLength = 5,
metric = "RMSE"
)
cubist_fit
## Cubist
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 130, 131, 129, 130, 131, 129, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.4325830 0.5632584 1.0307735
## 1 5 1.3100360 0.6379680 0.9153750
## 1 9 1.3406924 0.6128971 0.9406832
## 10 0 1.1149835 0.6508739 0.8767342
## 10 5 1.0056971 0.7178674 0.7906116
## 10 9 1.0500193 0.6860021 0.8140653
## 20 0 1.0764719 0.6742535 0.8485175
## 20 5 0.9740984 0.7359200 0.7612285
## 20 9 1.0175441 0.7045330 0.7840011
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
tree_results <- resamples(
list(
Single_Tree = rpart_fit,
Random_Forest = rf_fit,
Boosted_Trees = gbm_fit,
Cubist = cubist_fit
)
)
summary(tree_results)
##
## Call:
## summary.resamples(object = tree_results)
##
## Models: Single_Tree, Random_Forest, Boosted_Trees, Cubist
## Number of resamples: 50
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Single_Tree 0.6079992 0.9147214 1.0707895 1.0737967 1.2287255 1.796440 0
## Random_Forest 0.4965885 0.7054470 0.8209578 0.8287020 0.9159793 1.257064 0
## Boosted_Trees 0.5264021 0.7087707 0.8365575 0.8418312 0.9178593 1.507299 0
## Cubist 0.5231806 0.6651997 0.7318827 0.7612285 0.8263787 1.275589 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Single_Tree 0.7748994 1.1669465 1.4232353 1.3912118 1.533369 2.174696 0
## Random_Forest 0.5623351 0.8772929 1.0534444 1.0610033 1.243026 1.708418 0
## Boosted_Trees 0.6167073 0.9077503 1.0597840 1.0655873 1.168317 1.793513 0
## Cubist 0.5882040 0.8259086 0.9382308 0.9740984 1.062399 1.928599 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Single_Tree 0.08464544 0.4080889 0.5169410 0.4916981 0.5681905 0.8445140 0
## Random_Forest 0.41697995 0.5837081 0.7338248 0.6922837 0.7907269 0.9200819 0
## Boosted_Trees 0.26555430 0.6146996 0.6909801 0.6889975 0.7836062 0.9061588 0
## Cubist 0.39100517 0.6821785 0.7653913 0.7359200 0.8285277 0.9150012 0
bwplot(tree_results, metric = "RMSE")
dotplot(tree_results, metric = "RMSE")
rpart_pred <- predict(rpart_fit, newdata = test_data)
rf_pred <- predict(rf_fit, newdata = test_data)
gbm_pred <- predict(gbm_fit, newdata = test_data)
cubist_pred <- predict(cubist_fit, newdata = test_data)
# Generate predictions
rpart_pred <- predict(rpart_fit, newdata = test_data)
rf_pred <- predict(rf_fit, newdata = test_data)
gbm_pred <- predict(gbm_fit, newdata = test_data)
cubist_pred <- predict(cubist_fit, newdata = test_data)
# Calculate performance using caret::postResample
rpart_perf <- postResample(pred = rpart_pred, obs = test_data$Yield)
rf_perf <- postResample(pred = rf_pred, obs = test_data$Yield)
gbm_perf <- postResample(pred = gbm_pred, obs = test_data$Yield)
cubist_perf <- postResample(pred = cubist_pred, obs = test_data$Yield)
# Combine into one table
tree_test_results <- data.frame(
Model = c("Single Tree", "Random Forest", "Boosted Trees", "Cubist"),
RMSE = c(rpart_perf["RMSE"], rf_perf["RMSE"], gbm_perf["RMSE"], cubist_perf["RMSE"]),
Rsquared = c(rpart_perf["Rsquared"], rf_perf["Rsquared"], gbm_perf["Rsquared"], cubist_perf["Rsquared"]),
MAE = c(rpart_perf["MAE"], rf_perf["MAE"], gbm_perf["MAE"], cubist_perf["MAE"])
)
tree_test_results <- tree_test_results |>
arrange(RMSE)
tree_test_results
## Model RMSE Rsquared MAE
## 1 Cubist 0.951341 0.7377865 0.7644967
## 2 Boosted Trees 1.103514 0.6431468 0.8684359
## 3 Random Forest 1.263732 0.5166297 0.9378793
## 4 Single Tree 1.623386 0.2968931 1.2120498
Above we can see that Cubist was our best performing model as it has the lowest RMSE and MAE, as well as the highest R^2 value(RMSE 0.951, R^2 0.738, and MAE 0.764). The boosted tree model was the next best followed by random forest and the single regression tree performed the worst. The resampling results showed the with Cubist having the lowest RMSE. This suggests that Cubist was the optimal tree based regression model for predicting yield in the chemical manufacturing data. The strong performance of Cubist may be due to its ability to combine tree-based rules with linear regression models, allowing it to capture both nonlinear structure and local linear relationships in the data.
#B
best_tree_model <- cubist_fit
best_tree_imp <- varImp(best_tree_model, scale = FALSE)
plot(best_tree_imp, top = 20)
best_tree_imp_table <- best_tree_imp$importance |>
as.data.frame() |>
tibble::rownames_to_column("Variable") |>
arrange(desc(Overall))
top10_tree <- best_tree_imp_table |>
slice(1:10)
top10_tree
## Variable Overall
## 1 ManufacturingProcess32 50.0
## 2 ManufacturingProcess09 26.0
## 3 BiologicalMaterial03 23.0
## 4 ManufacturingProcess17 17.0
## 5 ManufacturingProcess33 16.5
## 6 ManufacturingProcess29 14.0
## 7 BiologicalMaterial06 13.0
## 8 ManufacturingProcess25 13.0
## 9 ManufacturingProcess28 12.5
## 10 ManufacturingProcess13 11.0
In the optimal model Cubist, the most important predictor was ManufacturingProcess32 with an importance score of 50. ManufacturingProcess09, BiologicalMaterial03, ManufacturingProcess17, and ManufacturingProcess33 were also very high ranked predictors. The top 10 list was dominated by manufacturing process variables in this Cubist model. 8 of the top 10 predictors were process variables and only two were biological material variables. This suggests that the manufacturing process measurements were more influential than the biological material measurements for predicting yield even though the biological predictors were still relevant because BiologicalMaterial03 and BiologicalMaterial06 both appeared in the top 10.
Compared with the optimal linear and nonlinear models from the previous exercises, there was noticeable overlap. Predictors such as ManufacturingProcess32, ManufacturingProcess17, ManufacturingProcess13, BiologicalMaterial03, and BiologicalMaterial06 appeared were highly significant. The exact ranking changed depending on the model and our previous best model had 6 process variables and 4 biological, but we still saw more process variables in our best model.
#C
rpart.plot(
rpart_fit$finalModel,
type = 2,
extra = 101,
fallen.leaves = TRUE,
main = "Optimal Single Regression Tree for Yield"
)
train_data$terminal_node <- factor(rpart_fit$finalModel$where)
ggplot(train_data, aes(x = terminal_node, y = Yield)) +
geom_boxplot() +
labs(
title = "Distribution of Yield in Terminal Nodes",
x = "Terminal Node",
y = "Yield"
)
terminal_summary <- train_data |>
group_by(terminal_node) |>
summarize(
n = n(),
mean_yield = mean(Yield),
median_yield = median(Yield),
sd_yield = sd(Yield),
min_yield = min(Yield),
max_yield = max(Yield),
.groups = "drop"
) |>
arrange(mean_yield)
terminal_summary
## # A tibble: 10 × 7
## terminal_node n mean_yield median_yield sd_yield min_yield max_yield
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 7 36.9 37.1 0.943 35.2 37.9
## 2 6 16 38.5 38.2 0.751 37.4 40.2
## 3 7 25 39.1 39.1 0.816 37.5 40.7
## 4 9 21 39.4 39.5 0.684 38.2 40.4
## 5 14 18 40.1 39.9 1.23 36.8 42.7
## 6 10 16 41.1 41.1 1.13 38.0 43.1
## 7 17 10 41.2 41.2 0.709 39.8 42.2
## 8 15 8 41.7 41.7 0.917 40.1 43.4
## 9 18 13 42.5 42.4 0.905 40.9 44.4
## 10 19 10 43.2 43.0 1.40 41.2 46.3
The optimal single regression tree is a great visual summary of how the predictors are able to separate observations into different yield groups. The first split in the tree was on ManufacturingProcess32 which is the most important predictor in the Cubist model also, and as we expected this suggests that ManufacturingProcess32 is strongly associated with yield. The single tree used both manufacturing process variables and biological material variables like ManufacturingProcess32, ManufacturingProcess09, ManufacturingProcess17, ManufacturingProcess30, and ManufacturingProcess24 with Biological variables included BiologicalMaterial11, BiologicalMaterial03, and BiologicalMaterial05. This suggests that both types of predictors help explain yield but the process variables seem more imoprtant earlier in the tree as expected.
The terminal node boxplot also shows that the tree created groups that have different yield distributions. The lowest yield terminal node had a mean yield of about 36.89 and the highest-yield terminal node had a mean yield of about 43.19 which is indicating that the tree separated lower and higher yield observations. Though the single tree shouldn’t be viewed as the best predictive model since Cubist had much better test set performance it is useful for explanation, while Cubist is better for prediction based on our test results.