HW9

Q 8.1 A- First we did the random forrest and as expected the uninformative predictors were not valued significantly in this random forest model. V4 was our highest imoprtance value, and while one of the uninformative predictors in example did have a positive value at 0.0085, this was still several decimal points lower showing it is not significant.

#A
set.seed(250)

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

model1 <- randomForest(
  y ~ ., 
  data = simulated,
  importance = TRUE,
  ntree = 1000
)

rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##         Overall
## V1   7.11857750
## V2   8.38027260
## V3   0.92975796
## V4  13.57435414
## V5   1.40444789
## V6  -0.09009484
## V7   0.00858833
## V8  -0.13804201
## V9  -0.10573801
## V10 -0.15306925

#B As expected, our duplicate had a very similar correlation and significance in the random forest. After adding the duplicate we did decrease the importance score of v1. This makes sense since there is such a similarity these values now begin to share predictive value in the model.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1

cor(simulated$duplicate1, simulated$V1)

## [1] 0.9354848

model2 <- randomForest(
  y ~ ., 
  data = simulated,
  importance = TRUE,
  ntree = 1000
)

rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.19432776
## V2          8.50341178
## V3          0.82764300
## V4         12.43702218
## V5          1.01911956
## V6         -0.06259129
## V7         -0.04025951
## V8         -0.15164836
## V9         -0.07489225
## V10        -0.12376839
## duplicate1  3.28452905

#C The conditional inference produced a similar result to the traditional model. For informative predictors V1 to V5 we received the largest importance scores, especially V4, V2, and V1. The uninformative predictors V6 to V10 had importance values near zero or negative, so they were not meaningfully used in the model. Compared with the traditional random forest, the conditional inference forest produced smaller importance vslues but still changing by a similar scale for each predictor. The rankings were mostly unchanged which suggests that both methods identified the same general signal.

cforest_model <- cforest(
  y ~ ., 
  data = simulated,
  controls = cforest_unbiased(ntree = 1000, mtry = 3)
)

traditional_imp <- varimp(cforest_model, conditional = FALSE)
traditional_imp

##           V1           V2           V3           V4           V5           V6 
##  4.367238708  7.456416260  0.237840033 10.512089680  1.023676070  0.015234297 
##           V7           V8           V9          V10   duplicate1 
## -0.065968523 -0.041008522  0.005394484 -0.091130425  2.706393320

conditional_imp <- varimp(cforest_model, conditional = TRUE)
conditional_imp

##           V1           V2           V3           V4           V5           V6 
##  2.356158991  4.995099702  0.170594093  6.225531720  0.623869873 -0.010172844 
##           V7           V8           V9          V10   duplicate1 
## -0.006975007 -0.011198308 -0.038897243 -0.020477433  1.060613376

#D The boosted tree and Cubist models showed the same general pattern as the informative predictors V1 to V5 received the largest importance scores, while the uninformative predictors V6 to V10 had much smaller scores. The boosted tree model placed the most importance on V4, followed by V2 and V1. Cubist ranked the variables differently with V2, V1, and V4 receiving the largest importance values. Both models separated the informative predictors from the noise predictors but the cubist model had a different order of importance than rf and our boosted tree model. The same overall pattern occurs across the different tree-based models as the models identify the true predictors. The ranking and magnitude of importance scores can vary by model. We see this in this example as the boosted tree is two digits larger than the Cubist model. For this reason I would say variable importance scores are useful but values should be criticized against other models.

#Boosted Trees
ctrl <- trainControl(method = "cv", number = 10)

gbm_fit <- train(
  y ~ .,
  data = simulated,
  method = "gbm",
  trControl = ctrl,
  verbose = FALSE,
  tuneLength = 5
)

gbm_imp <- varImp(gbm_fit, scale = FALSE)
gbm_imp

## gbm variable importance
## 
##            Overall
## V4         6852.75
## V2         4142.87
## V1         3052.19
## V5         1292.70
## V3         1215.50
## duplicate1  900.85
## V6          312.60
## V7          226.86
## V10         170.70
## V9          121.60
## V8           95.03

plot(gbm_imp, top = 15)

#Cubist
cubist_fit <- train(
  y ~ .,
  data = simulated,
  method = "cubist",
  trControl = ctrl,
  tuneLength = 5
)

cubist_imp <- varImp(cubist_fit, scale = FALSE)
cubist_imp

## cubist variable importance
## 
##            Overall
## V1            62.0
## V2            60.0
## V4            58.0
## V5            40.0
## V3            33.5
## duplicate1     8.0
## V10            3.5
## V6             3.0
## V7             1.5
## V9             0.0
## V8             0.0

plot(cubist_imp, top = 15)

8.2 In order to compare tree bias with different predictor granularities I simulated a random response variable and several random predictors with different numbers of possible split points. Ideally we would see more split points offer more opportunities and will thus show up as stronger predictors. The predictors created included a binary variable, a 5 level categorical variable, a 20 level categorical variable, and a continuous variable. The response was generated randomly so none of the predictors should have real predictive value. The fitted tree assigned different importance scores to the predictors and the continuous variable had the largest importance score then followed by the 20 level and then the 5 level categorical variables. The binary predictor had the lowest importance score which is expected as it has the least possible split points. This shows the tree bias because the variables with the most possible split points were more likely to be selected by the tree even though all predictors were not true predictors. These results show that tree-based models can favor predictors with higher granularity and a continuous variable or a categorical variable with many levels has more opportunities to create a splits to predict on.

n <- 1000

tree_bias_sim <- data.frame(
  y = rnorm(n),
  x_binary = factor(sample(c("A", "B"), n, replace = TRUE)),
  x_5level = factor(sample(paste0("L", 1:5), n, replace = TRUE)),
  x_20level = factor(sample(paste0("L", 1:20), n, replace = TRUE)),
  x_continuous = runif(n)
)


tree_model <- rpart(
  y ~ .,
  data = tree_bias_sim,
  method = "anova",
  control = rpart.control(cp = 0.001)
)

tree_model

## n= 1000 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##     1) root 1000 964.8026000  0.010391590  
##       2) x_20level=L1,L15,L17,L18,L19,L2,L4,L9 394 402.5564000 -0.103042300  
##         4) x_5level=L1,L2,L3,L5 321 309.2136000 -0.140363700  
##           8) x_continuous>=0.05378612 303 293.3037000 -0.162180300  
##            16) x_continuous< 0.2172965 46  33.5073900 -0.372743300  
##              32) x_20level=L1,L15,L17,L19 23  13.7557800 -0.619349600  
##                64) x_binary=B 10   5.4351040 -0.965756400 *
##                65) x_binary=A 13   6.1976390 -0.352882800 *
##              33) x_20level=L18,L2,L4,L9 23  16.9541300 -0.126137000 *
##            17) x_continuous>=0.2172965 257 257.3918000 -0.124491900  
##              34) x_continuous>=0.9346207 23  18.2606100 -0.456300600  
##                68) x_20level=L17,L18,L4 7   0.7768168 -1.026604000 *
##                69) x_20level=L1,L15,L19,L9 16  14.2110100 -0.206793000 *
##              35) x_continuous< 0.9346207 234 236.3500000 -0.091878270  
##                70) x_continuous< 0.8933792 225 229.3959000 -0.108775500  
##                 140) x_continuous>=0.8684971 18  13.7522300 -0.467081200 *
##                 141) x_continuous< 0.8684971 207 213.1318000 -0.077618430  
##                   282) x_continuous< 0.8543512 200 208.2798000 -0.100446800  
##                     564) x_continuous>=0.8248875 10  11.3065800 -0.741660900 *
##                     565) x_continuous< 0.8248875 190 192.6452000 -0.066698640  
##                      1130) x_continuous< 0.7967408 181 176.0120000 -0.111810800  
##                        2260) x_continuous>=0.5131531 83  90.9795200 -0.213913300  
##                          4520) x_20level=L17,L18,L19,L2,L4 51  48.0022200 -0.446194000  
##                            9040) x_5level=L2,L3,L5 42  35.7118100 -0.544781600  
##                             18080) x_continuous>=0.5532513 32  26.7962400 -0.635846100  
##                               36160) x_continuous< 0.6652611 15   7.3100900 -0.913237300 *
##                               36161) x_continuous>=0.6652611 17  17.3135600 -0.391089200 *
##                             18081) x_continuous< 0.5532513 10   7.8010310 -0.253375200 *
##                            9041) x_5level=L1 9   9.9771620  0.013881750 *
##                          4521) x_20level=L1,L15,L9 32  35.8401500  0.156284200  
##                            9042) x_continuous< 0.6617579 13  10.7261800 -0.174391200 *
##                            9043) x_continuous>=0.6617579 19  22.7198600  0.382535700 *
##                        2261) x_continuous< 0.5131531 98  83.4343400 -0.025336300  
##                          4522) x_continuous< 0.4723028 83  64.1946200 -0.146907600  
##                            9044) x_continuous>=0.4179507 27  18.0710300 -0.456322100  
##                             18088) x_20level=L15,L19,L4,L9 14   7.5574400 -0.822941300 *
##                             18089) x_20level=L1,L17,L18,L2 13   6.6053680 -0.061501530 *
##                            9045) x_continuous< 0.4179507 56  42.2923900  0.002274361  
##                             18090) x_20level=L1,L15,L17,L18,L2,L4 43  27.3827300 -0.137489200  
##                               36180) x_5level=L1,L2,L3 29  19.1369300 -0.331689300  
##                                 72360) x_continuous< 0.302812 15   7.2469680 -0.609061400 *
##                                 72361) x_continuous>=0.302812 14   9.4994750 -0.034504840 *
##                               36181) x_5level=L5 14   4.8865890  0.264782400 *
##                             18091) x_20level=L19,L9 13  11.2913900  0.464569200 *
##                          4523) x_continuous>=0.4723028 15  11.2252400  0.647358400 *
##                      1131) x_continuous>=0.7967408 9   8.8569190  0.840557400 *
##                   283) x_continuous>=0.8543512 7   1.7699310  0.574619200 *
##                71) x_continuous>=0.8933792 9   5.2838640  0.330551200 *
##           9) x_continuous< 0.05378612 18  13.3380500  0.226882700 *
##         5) x_5level=L4 73  90.9295500  0.061069190  
##          10) x_20level=L1,L2 14   7.9741450 -0.421406400 *
##          11) x_20level=L15,L17,L18,L19,L4,L9 59  78.9231300  0.175554900  
##            22) x_continuous>=0.4878807 30  46.4313300 -0.024174390  
##              44) x_20level=L17,L18,L19 14  25.8407900 -0.332190700 *
##              45) x_20level=L15,L4,L9 16  18.1000900  0.245339900 *
##            23) x_continuous< 0.4878807 29  30.0570300  0.382171500  
##              46) x_20level=L15,L9 10   9.7338560 -0.160765100 *
##              47) x_20level=L17,L18,L19,L4 19  15.8239000  0.667927600 *
##       3) x_20level=L10,L11,L12,L13,L14,L16,L20,L3,L5,L6,L7,L8 606 553.8804000  0.084142370  
##         6) x_5level=L2,L3,L4 337 297.4489000 -0.024645470  
##          12) x_20level=L10,L13,L20,L3,L6,L8 182 159.7282000 -0.121940900  
##            24) x_continuous< 0.9621274 174 155.3224000 -0.148424700  
##              48) x_continuous>=0.2135687 139 111.4317000 -0.214172800  
##                96) x_5level=L2 56  48.6139800 -0.333711900  
##                 192) x_20level=L10,L6 19  10.0833600 -0.799196900 *
##                 193) x_20level=L13,L20,L3,L8 37  32.2997100 -0.094679050  
##                   386) x_continuous< 0.5302982 18  16.7556300 -0.334470200 *
##                   387) x_continuous>=0.5302982 19  13.5285600  0.132491500 *
##                97) x_5level=L3,L4 83  61.4776300 -0.133519800  
##                 194) x_20level=L13,L3,L8 41  32.7481000 -0.345889600  
##                   388) x_continuous< 0.7882385 31  23.6747500 -0.430678500  
##                     776) x_continuous>=0.6566202 7   3.5319050 -1.071911000 *
##                     777) x_continuous< 0.6566202 24  16.4251000 -0.243652200  
##                      1554) x_continuous< 0.397126 11   6.1756230 -0.561323300 *
##                      1555) x_continuous>=0.397126 13   8.2001270  0.025146340 *
##                   389) x_continuous>=0.7882385 10   8.1596040 -0.083044250 *
##                 195) x_20level=L10,L20,L6 42  25.0752800  0.073793520  
##                   390) x_binary=A 19   8.8470600 -0.205318400 *
##                   391) x_binary=B 23  13.5253100  0.304364300 *
##              49) x_continuous< 0.2135687 35  40.9035100  0.112689300  
##                98) x_5level=L3,L4 24  28.4115100 -0.220097700  
##                 196) x_20level=L20,L3,L8 7   6.0804890 -1.312469000 *
##                 197) x_20level=L10,L13,L6 17  10.5386600  0.229702300 *
##                99) x_5level=L2 11   4.0349470  0.838769800 *
##            25) x_continuous>=0.9621274 8   1.6292840  0.454080600 *
##          13) x_20level=L11,L12,L14,L16,L5,L7 155 133.9748000  0.089598200  
##            26) x_continuous>=0.4840581 86  71.4895600 -0.020038150  
##              52) x_20level=L12,L5,L7 46  32.9274600 -0.245308900  
##               104) x_continuous< 0.5689724 9   7.8150500 -0.901700000 *
##               105) x_continuous>=0.5689724 37  20.2915500 -0.085646250  
##                 210) x_binary=B 16   7.9623210 -0.268590800 *
##                 211) x_binary=A 21  11.3857300  0.053740090  
##                   422) x_continuous< 0.7812383 13   5.2512020 -0.118265400 *
##                   423) x_continuous>=0.7812383 8   5.1249130  0.333249000 *
##              53) x_20level=L11,L14,L16 40  33.5432300  0.239023300  
##               106) x_continuous>=0.644022 32  26.9267500  0.094422940  
##                 212) x_continuous< 0.8900425 20  15.7208500 -0.173422100 *
##                 213) x_continuous>=0.8900425 12   7.3797190  0.540831400 *
##               107) x_continuous< 0.644022 8   3.2709930  0.817424600 *
##            27) x_continuous< 0.4840581 69  60.1631400  0.226246400  
##              54) x_continuous< 0.3495906 46  36.0462400 -0.028023140  
##               108) x_continuous>=0.1027615 37  31.0931600 -0.130620500  
##                 216) x_20level=L16,L5,L7 22  18.8638800 -0.296505200  
##                   432) x_5level=L4 7   9.1976110 -0.723475400 *
##                   433) x_5level=L2,L3 15   7.7946220 -0.097252360 *
##                 217) x_20level=L11,L12,L14 15  10.7359800  0.112677000 *
##               109) x_continuous< 0.1027615 9   2.9624550  0.393765900 *
##              55) x_continuous>=0.3495906 23  15.1947800  0.734785500  
##               110) x_20level=L11,L12,L14,L16 15   7.3175030  0.472744500 *
##               111) x_20level=L5,L7 8   4.9160810  1.226112000 *
##         7) x_5level=L1,L5 269 247.4467000  0.220430500  
##          14) x_continuous< 0.03435689 9   2.2294210 -0.572351400 *
##          15) x_continuous>=0.03435689 260 239.3649000  0.247872900  
##            30) x_20level=L11,L12,L7 60  64.7436400 -0.038419530  
##              60) x_5level=L1 33  33.8951900 -0.238181300  
##               120) x_continuous< 0.7865633 25  28.3834300 -0.432181000  
##                 240) x_20level=L12,L7 16  15.6382900 -0.653000200 *
##                 241) x_20level=L11 9  10.5779800 -0.039613620 *
##               121) x_continuous>=0.7865633 8   1.6305630  0.368067800 *
##              61) x_5level=L5 27  27.9221000  0.205733800  
##               122) x_continuous>=0.4425729 19  15.1314900 -0.114936600 *
##               123) x_continuous< 0.4425729 8   6.1966650  0.967326000 *
##            31) x_20level=L10,L13,L14,L16,L20,L3,L5,L6,L8 200 168.2281000  0.333760700  
##              62) x_continuous< 0.1426957 22  18.1401400  0.107892600  
##               124) x_binary=A 14   9.4158880 -0.234131000 *
##               125) x_binary=B 8   4.2205160  0.706433800 *
##              63) x_continuous>=0.1426957 178 148.8269000  0.361677000  
##               126) x_binary=B 88  65.0022800  0.244513500  
##                 252) x_20level=L10,L14,L16,L20 40  31.6352800  0.016187800  
##                   504) x_5level=L1 20  14.0783800 -0.284226900  
##                    1008) x_20level=L14,L16 8   5.0362460 -0.751947400 *
##                    1009) x_20level=L10,L20 12   6.1253000  0.027586670 *
##                   505) x_5level=L5 20  13.9469400  0.316602500  
##                    1010) x_20level=L10,L20 9   7.4697510  0.030234970 *
##                    1011) x_20level=L14,L16 11   5.1352680  0.550903300 *
##                 253) x_20level=L13,L3,L5,L6,L8 48  29.5439400  0.434785000  
##                   506) x_continuous< 0.2818695 8   5.2436800 -0.062367340 *
##                   507) x_continuous>=0.2818695 40  21.9275200  0.534215400  
##                    1014) x_20level=L13,L3,L5,L8 33  17.7086500  0.423206000 *
##                    1015) x_20level=L6 7   1.8950860  1.057546000 *
##               127) x_binary=A 90  81.4354900  0.476236800  
##                 254) x_20level=L6 10   9.0011960  0.011433210 *
##                 255) x_20level=L10,L13,L14,L16,L20,L3,L5,L8 80  70.0038200  0.534337200  
##                   510) x_continuous>=0.2258926 71  63.4998800  0.485919200  
##                    1020) x_20level=L5 8   5.7908440 -0.080636420 *
##                    1021) x_20level=L10,L13,L14,L16,L20,L3,L8 63  54.8150800  0.557862800  
##                      2042) x_continuous>=0.6528626 25  20.4604100  0.367019300  
##                        4084) x_continuous< 0.857048 12   5.6336960 -0.152276100 *
##                        4085) x_continuous>=0.857048 13   8.6036160  0.846368800 *
##                      2043) x_continuous< 0.6528626 38  32.8451000  0.683417800  
##                        4086) x_20level=L10,L13,L14,L3,L8 28  18.4987400  0.526869900  
##                          8172) x_continuous< 0.5174447 15   8.4743020  0.352140200 *
##                          8173) x_continuous>=0.5174447 13   9.0380650  0.728481000 *
##                        4087) x_20level=L16,L20 10  11.7388000  1.121752000 *
##                   511) x_continuous< 0.2258926 9   5.0244190  0.916301200 *

tree_imp <- varImp(tree_model, scale = FALSE)
tree_imp

##               Overall
## x_20level    3.934052
## x_5level     1.653124
## x_binary     1.126148
## x_continuous 4.122719

8.3 #A The model on the right focuses on only a few predictors because the bagging fraction and learning rate are large and close to 1. With a bagging fraction of 0.9 each tree is trained on most of the data and that’s why the same strong predictors are repeatedly available. That’s also why each tree has a large impact on the final model and so the result is that the model emphasizes the strongest predictors early and gives them very large importance scores. The model on the left uses a smaller bagging fraction and smaller learning rate so each tree has less influence and is also trained on a smaller random subset of the data. This makes it so that the importance can spread across a larger number of predictors. There is a value in both but fine tuning this will be most likely to show the best number to use.

#B I would expect the model on the left to be more predictive on future samples. The smaller learning rate makes the model less likely to be overfitted to our training data and the smaller bagging fraction adds randomness that reduces overfitting and would make it more predictive on other samples. The model on the right gives too much weight to early trees and concentrates importance on only a few variables so it’s more likely to be overfitted and perform worse on new data. #C Increasing interaction depth would make each tree/model relationship more complex among our predictors. If we have shallow trees the boosting model mostly captures larger importance but that may concentrate the strongest individual predictors. Deeper trees make it so each tree can include multiple diversions and capture interactions among variables. This can spread importance across more predictors but deeper trees also increase the model complexity and can increase the risk of overfitting which can be even more exagerrated with a high learning rate.

8.7 #A

data(ChemicalManufacturingProcess)
set.seed(62462)
chem <- ChemicalManufacturingProcess

x <- chem[, names(chem) != "Yield"]
y <- chem$Yield

training_index <- createDataPartition(y, p = 0.80, list = FALSE)

train_x <- x[training_index, ]
test_x  <- x[-training_index, ]

train_y <- y[training_index]
test_y  <- y[-training_index]

chem_preprocess <- preProcess(
  train_x,
  method = c("medianImpute", "center", "scale")
)

train_x_processed <- predict(chem_preprocess, train_x)
test_x_processed  <- predict(chem_preprocess, test_x)

train_data <- data.frame(train_x_processed, Yield = train_y)
test_data  <- data.frame(test_x_processed, Yield = test_y)

ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

#Single regression tree
rpart_fit <- train(
  Yield ~ .,
  data = train_data,
  method = "rpart",
  trControl = ctrl,
  tuneLength = 10,
  metric = "RMSE"
)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

rpart_fit

## CART 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 129, 130, 129, 129, 131, 129, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.01166171  1.391212  0.4916981  1.073797
##   0.01764571  1.404991  0.4815042  1.082046
##   0.02409976  1.426726  0.4696542  1.101242
##   0.02823040  1.434066  0.4669813  1.108598
##   0.02948997  1.439802  0.4649547  1.111288
##   0.04539905  1.447049  0.4554312  1.129731
##   0.04980354  1.453354  0.4437482  1.142253
##   0.06551772  1.437705  0.4434301  1.137715
##   0.10724566  1.463395  0.4154622  1.157379
##   0.38447502  1.745086  0.2369196  1.407932
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01166171.

#Random Forest

rf_fit <- train(
  Yield ~ .,
  data = train_data,
  method = "rf",
  trControl = ctrl,
  tuneLength = 5,
  metric = "RMSE",
  importance = TRUE
)

rf_fit

## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 131, 130, 130, 131, 129, 129, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.201648  0.6623120  0.9642171
##   15    1.073680  0.6985497  0.8450098
##   29    1.061131  0.6989627  0.8316197
##   43    1.061003  0.6922837  0.8287020
##   57    1.069024  0.6847127  0.8323899
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 43.

#Boosted Trees

gbm_fit <- train(
  Yield ~ .,
  data = train_data,
  method = "gbm",
  trControl = ctrl,
  tuneLength = 5,
  metric = "RMSE",
  verbose = FALSE
)

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.

gbm_fit

## Stochastic Gradient Boosting 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 129, 131, 128, 131, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE      
##   1                   50      1.173778  0.6261683  0.9136573
##   1                  100      1.157318  0.6338251  0.9077326
##   1                  150      1.152453  0.6399458  0.9044024
##   1                  200      1.158590  0.6339452  0.9092358
##   1                  250      1.149980  0.6405769  0.9057639
##   2                   50      1.158012  0.6307061  0.8998786
##   2                  100      1.128549  0.6463140  0.8799975
##   2                  150      1.117425  0.6535445  0.8780331
##   2                  200      1.107756  0.6604819  0.8736083
##   2                  250      1.101950  0.6665681  0.8714308
##   3                   50      1.126964  0.6544021  0.8804137
##   3                  100      1.089063  0.6749766  0.8506750
##   3                  150      1.075617  0.6839268  0.8444513
##   3                  200      1.071206  0.6868750  0.8418709
##   3                  250      1.066179  0.6905517  0.8392647
##   4                   50      1.143261  0.6379582  0.8917140
##   4                  100      1.115667  0.6572577  0.8758878
##   4                  150      1.097671  0.6689795  0.8623324
##   4                  200      1.091684  0.6734568  0.8580749
##   4                  250      1.086844  0.6770774  0.8548799
##   5                   50      1.131507  0.6490796  0.8869771
##   5                  100      1.100319  0.6677153  0.8640611
##   5                  150      1.080082  0.6800792  0.8508346
##   5                  200      1.070822  0.6860334  0.8446090
##   5                  250      1.065587  0.6889975  0.8418312
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 250, interaction.depth =
##  5, shrinkage = 0.1 and n.minobsinnode = 10.

#Cubist model

cubist_fit <- train(
  Yield ~ .,
  data = train_data,
  method = "cubist",
  trControl = ctrl,
  tuneLength = 5,
  metric = "RMSE"
)

cubist_fit

## Cubist 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 130, 131, 129, 130, 131, 129, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    1          0          1.4325830  0.5632584  1.0307735
##    1          5          1.3100360  0.6379680  0.9153750
##    1          9          1.3406924  0.6128971  0.9406832
##   10          0          1.1149835  0.6508739  0.8767342
##   10          5          1.0056971  0.7178674  0.7906116
##   10          9          1.0500193  0.6860021  0.8140653
##   20          0          1.0764719  0.6742535  0.8485175
##   20          5          0.9740984  0.7359200  0.7612285
##   20          9          1.0175441  0.7045330  0.7840011
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

tree_results <- resamples(
  list(
    Single_Tree = rpart_fit,
    Random_Forest = rf_fit,
    Boosted_Trees = gbm_fit,
    Cubist = cubist_fit
  )
)

summary(tree_results)

## 
## Call:
## summary.resamples(object = tree_results)
## 
## Models: Single_Tree, Random_Forest, Boosted_Trees, Cubist 
## Number of resamples: 50 
## 
## MAE 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## Single_Tree   0.6079992 0.9147214 1.0707895 1.0737967 1.2287255 1.796440    0
## Random_Forest 0.4965885 0.7054470 0.8209578 0.8287020 0.9159793 1.257064    0
## Boosted_Trees 0.5264021 0.7087707 0.8365575 0.8418312 0.9178593 1.507299    0
## Cubist        0.5231806 0.6651997 0.7318827 0.7612285 0.8263787 1.275589    0
## 
## RMSE 
##                    Min.   1st Qu.    Median      Mean  3rd Qu.     Max. NA's
## Single_Tree   0.7748994 1.1669465 1.4232353 1.3912118 1.533369 2.174696    0
## Random_Forest 0.5623351 0.8772929 1.0534444 1.0610033 1.243026 1.708418    0
## Boosted_Trees 0.6167073 0.9077503 1.0597840 1.0655873 1.168317 1.793513    0
## Cubist        0.5882040 0.8259086 0.9382308 0.9740984 1.062399 1.928599    0
## 
## Rsquared 
##                     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Single_Tree   0.08464544 0.4080889 0.5169410 0.4916981 0.5681905 0.8445140    0
## Random_Forest 0.41697995 0.5837081 0.7338248 0.6922837 0.7907269 0.9200819    0
## Boosted_Trees 0.26555430 0.6146996 0.6909801 0.6889975 0.7836062 0.9061588    0
## Cubist        0.39100517 0.6821785 0.7653913 0.7359200 0.8285277 0.9150012    0

bwplot(tree_results, metric = "RMSE")

dotplot(tree_results, metric = "RMSE")

rpart_pred <- predict(rpart_fit, newdata = test_data)
rf_pred <- predict(rf_fit, newdata = test_data)
gbm_pred <- predict(gbm_fit, newdata = test_data)
cubist_pred <- predict(cubist_fit, newdata = test_data)

# Generate predictions
rpart_pred  <- predict(rpart_fit, newdata = test_data)
rf_pred     <- predict(rf_fit, newdata = test_data)
gbm_pred    <- predict(gbm_fit, newdata = test_data)
cubist_pred <- predict(cubist_fit, newdata = test_data)

# Calculate performance using caret::postResample
rpart_perf  <- postResample(pred = rpart_pred,  obs = test_data$Yield)
rf_perf     <- postResample(pred = rf_pred,     obs = test_data$Yield)
gbm_perf    <- postResample(pred = gbm_pred,    obs = test_data$Yield)
cubist_perf <- postResample(pred = cubist_pred, obs = test_data$Yield)

# Combine into one table
tree_test_results <- data.frame(
  Model = c("Single Tree", "Random Forest", "Boosted Trees", "Cubist"),
  RMSE = c(rpart_perf["RMSE"], rf_perf["RMSE"], gbm_perf["RMSE"], cubist_perf["RMSE"]),
  Rsquared = c(rpart_perf["Rsquared"], rf_perf["Rsquared"], gbm_perf["Rsquared"], cubist_perf["Rsquared"]),
  MAE = c(rpart_perf["MAE"], rf_perf["MAE"], gbm_perf["MAE"], cubist_perf["MAE"])
)

tree_test_results <- tree_test_results |>
  arrange(RMSE)

tree_test_results

##           Model     RMSE  Rsquared       MAE
## 1        Cubist 0.951341 0.7377865 0.7644967
## 2 Boosted Trees 1.103514 0.6431468 0.8684359
## 3 Random Forest 1.263732 0.5166297 0.9378793
## 4   Single Tree 1.623386 0.2968931 1.2120498

Above we can see that Cubist was our best performing model as it has the lowest RMSE and MAE, as well as the highest R^2 value(RMSE 0.951, R^2 0.738, and MAE 0.764). The boosted tree model was the next best followed by random forest and the single regression tree performed the worst. The resampling results showed the with Cubist having the lowest RMSE. This suggests that Cubist was the optimal tree based regression model for predicting yield in the chemical manufacturing data. The strong performance of Cubist may be due to its ability to combine tree-based rules with linear regression models, allowing it to capture both nonlinear structure and local linear relationships in the data.

best_tree_model <- cubist_fit

best_tree_imp <- varImp(best_tree_model, scale = FALSE)

plot(best_tree_imp, top = 20)

best_tree_imp_table <- best_tree_imp$importance |>
  as.data.frame() |>
  tibble::rownames_to_column("Variable") |>
  arrange(desc(Overall))

top10_tree <- best_tree_imp_table |>
  slice(1:10)

top10_tree

##                  Variable Overall
## 1  ManufacturingProcess32    50.0
## 2  ManufacturingProcess09    26.0
## 3    BiologicalMaterial03    23.0
## 4  ManufacturingProcess17    17.0
## 5  ManufacturingProcess33    16.5
## 6  ManufacturingProcess29    14.0
## 7    BiologicalMaterial06    13.0
## 8  ManufacturingProcess25    13.0
## 9  ManufacturingProcess28    12.5
## 10 ManufacturingProcess13    11.0

In the optimal model Cubist, the most important predictor was ManufacturingProcess32 with an importance score of 50. ManufacturingProcess09, BiologicalMaterial03, ManufacturingProcess17, and ManufacturingProcess33 were also very high ranked predictors. The top 10 list was dominated by manufacturing process variables in this Cubist model. 8 of the top 10 predictors were process variables and only two were biological material variables. This suggests that the manufacturing process measurements were more influential than the biological material measurements for predicting yield even though the biological predictors were still relevant because BiologicalMaterial03 and BiologicalMaterial06 both appeared in the top 10.

Compared with the optimal linear and nonlinear models from the previous exercises, there was noticeable overlap. Predictors such as ManufacturingProcess32, ManufacturingProcess17, ManufacturingProcess13, BiologicalMaterial03, and BiologicalMaterial06 appeared were highly significant. The exact ranking changed depending on the model and our previous best model had 6 process variables and 4 biological, but we still saw more process variables in our best model.

rpart.plot(
  rpart_fit$finalModel,
  type = 2,
  extra = 101,
  fallen.leaves = TRUE,
  main = "Optimal Single Regression Tree for Yield"
)

train_data$terminal_node <- factor(rpart_fit$finalModel$where)

ggplot(train_data, aes(x = terminal_node, y = Yield)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Yield in Terminal Nodes",
    x = "Terminal Node",
    y = "Yield"
  )

terminal_summary <- train_data |>
  group_by(terminal_node) |>
  summarize(
    n = n(),
    mean_yield = mean(Yield),
    median_yield = median(Yield),
    sd_yield = sd(Yield),
    min_yield = min(Yield),
    max_yield = max(Yield),
    .groups = "drop"
  ) |>
  arrange(mean_yield)

terminal_summary

## # A tibble: 10 × 7
##    terminal_node     n mean_yield median_yield sd_yield min_yield max_yield
##    <fct>         <int>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl>
##  1 5                 7       36.9         37.1    0.943      35.2      37.9
##  2 6                16       38.5         38.2    0.751      37.4      40.2
##  3 7                25       39.1         39.1    0.816      37.5      40.7
##  4 9                21       39.4         39.5    0.684      38.2      40.4
##  5 14               18       40.1         39.9    1.23       36.8      42.7
##  6 10               16       41.1         41.1    1.13       38.0      43.1
##  7 17               10       41.2         41.2    0.709      39.8      42.2
##  8 15                8       41.7         41.7    0.917      40.1      43.4
##  9 18               13       42.5         42.4    0.905      40.9      44.4
## 10 19               10       43.2         43.0    1.40       41.2      46.3

The optimal single regression tree is a great visual summary of how the predictors are able to separate observations into different yield groups. The first split in the tree was on ManufacturingProcess32 which is the most important predictor in the Cubist model also, and as we expected this suggests that ManufacturingProcess32 is strongly associated with yield. The single tree used both manufacturing process variables and biological material variables like ManufacturingProcess32, ManufacturingProcess09, ManufacturingProcess17, ManufacturingProcess30, and ManufacturingProcess24 with Biological variables included BiologicalMaterial11, BiologicalMaterial03, and BiologicalMaterial05. This suggests that both types of predictors help explain yield but the process variables seem more imoprtant earlier in the tree as expected.

The terminal node boxplot also shows that the tree created groups that have different yield distributions. The lowest yield terminal node had a mean yield of about 36.89 and the highest-yield terminal node had a mean yield of about 43.19 which is indicating that the tree separated lower and higher yield observations. Though the single tree shouldn’t be viewed as the best predictive model since Cubist had much better test set performance it is useful for explanation, while Cubist is better for prediction based on our test results.

HW9

Luis Hernandez

4-29-26