Exercise 8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y ~ .,
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##         Overall
## V1   8.83890885
## V2   6.49023056
## V3   0.67583163
## V4   7.58822553
## V5   2.27426009
## V6   0.17436781
## V7   0.15136583
## V8  -0.03078937
## V9  -0.02989832
## V10 -0.08529218
Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The random forest model did not significantly use the uninformative predictors. The variable importance scores of the first five variables are in the range of 2.2 to 8.8. The uninformative predictors (V6-V10) have a variable importance score range of 0.1 to -0.08.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ .,
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

Interestingly, when you add another predictor that is highly correlated with a preexisting predictor it decreases the importance score of the original predictor. I don’t imagine this is always the case but the importance score V1 and duplicate1 is almost the same as V1 was before the addition of the correlated predictor (duplicate1).

rfImp2
##                Overall
## V1          6.29780744
## V2          6.08038134
## V3          0.58410718
## V4          6.93924427
## V5          2.03104094
## V6          0.07947642
## V7         -0.02566414
## V8         -0.11007435
## V9         -0.08839463
## V10        -0.00715093
## duplicate1  3.56411581
(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
## 
##     where
model3 <- cforest(y ~ .,
                  data = simulated,
          control = cforest_unbiased(ntree = 500))

rfImp3 <- varimp(model3)

Using the ‘cforest’ function helps mitigate the collinearity of V1 and duplicate1 by reducing the importance score of duplicate1

rfImp3
##           V1           V2           V3           V4           V5           V6 
##  6.571632689  6.110131511  0.010450332  7.485796796  1.889117623 -0.006751589 
##           V7           V8           V9          V10   duplicate1 
##  0.007081381 -0.036022685  0.008976088  0.005356254  2.722380892
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

We tried using gradient boosted machines and we see a similar relationship to the previous methods. Variables V1-V5 are important while V6-V10 are not. Interestingly, the gbm model reduces the influence of duplicate1 compared to the above methods. The scales are different but duplicate1 has a 10th the influence compared to a 3rd the influence of V1 in the above models.

Mental note: gbm deals with the effects of collinearity better than other tree methods.

library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")

relative.influence(gbmModel, scale. = TRUE)
## n.trees not given. Using 100 trees.
##         V1         V2         V3         V4         V5         V6         V7 
## 0.45460082 0.71969471 0.26299852 1.00000000 0.33812464 0.01690042 0.00426528 
##         V8         V9        V10 duplicate1 
## 0.00000000 0.00000000 0.00000000 0.40781275

8.2. Use a simulation to show tree bias with different granularities.

“trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors”

high <- sample(1:10, 100, replace = TRUE)
low <- sample(1:5, 100, replace = TRUE)

y <- jitter(low + high)

sim_data <- dplyr::bind_cols(y, high, low)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
names(sim_data) <- c('y', 'high', 'low')
head(sim_data)
## # A tibble: 6 × 3
##       y  high   low
##   <dbl> <int> <int>
## 1  9.06     4     5
## 2  5.91     4     2
## 3  4.12     1     3
## 4 12.0      8     4
## 5  5.95     4     2
## 6  7.10     2     5

We can see the predictor with more distinct values has a much higher variable importance score in our random forest model.

I want to note that for me this is a counter intuitive use of the word ‘granular’. I think of granular as having more detail/potential states but here it appears to mean the opposite - the variable with the fewest possible distinct values.

model1 <- randomForest(y ~ .,
                       data = sim_data,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1
##        Overall
## high 15.036417
## low   3.447798

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

I think with a bagging fraction of 0.9 we are using most of the data with each bootstrapping sample so as a result we are building many trees with essentially the same data set. As a result, whatever predictor that is most important to that specific data set is going to most important in the majority of the trees.

(b) Which model do you think would be more predictive of other samples?

The model with the bagging fraction of 0.1 and learning rate of 0.1 is going to generalize to unseen samples. With a smaller bagging fraction, smaller samples of the original data will be used to build each tree reducing the variance and also potential the bias on other samples.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

This is an interesting question, I think that increasing the interaction depth with decrease the slope on the right (0.9 bagging fraction). Building deeper trees might result in increasing the importance of other predictors in the model resulting in a more gradual slope.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
cmp_df <- ChemicalManufacturingProcess
sum(is.na(cmp_df))
## [1] 106

We can use the ‘colSums()’ function to understand missingness by predictor. The greatest amount of missingness is 15 values. I’m going to replace the missing values with the column mean.

colSums(is.na(cmp_df))
##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##                      0                      0                      0 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##                      0                      0                      0 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##                      0                      0                      0 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##                      0                      0                      0 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##                      0                      1                      3 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##                     15                      1                      1 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##                      2                      1                      1 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##                      0                      9                     10 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##                      1                      0                      1 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##                      0                      0                      0 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##                      0                      0                      0 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##                      0                      1                      1 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##                      1                      5                      5 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##                      5                      5                      5 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##                      5                      5                      0 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      5                      5                      5 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##                      5                      0                      0 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##                      0                      1                      1 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##                      0                      0                      0 
## ManufacturingProcess45 
##                      0
library(zoo)
cmp_df <- na.aggregate(cmp_df)
sum(is.na(cmp_df))
## [1] 0
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
columns_to_remove <- nearZeroVar(cmp_df)
cmp_df <- cmp_df[,-columns_to_remove]

# Set the random number seed so we can reproduce the results
set.seed(123456789)
# By default, the numbers are returned as a list. Using
# list = FALSE, a matrix of row numbers is generated.
# These samples are allocated to the training set.
trainingRows <- createDataPartition(cmp_df$Yield, p = .70, list = FALSE)
head(trainingRows)
##      Resample1
## [1,]         2
## [2,]         3
## [3,]         4
## [4,]         5
## [5,]         7
## [6,]         8
# Subset the data into objects for training using
# integer sub-setting.
train <- cmp_df[trainingRows,]
train <- train |> select(-Yield)

trans <- preProcess(train, method = c("center", "scale"))
train <- as.data.frame(predict(trans, train))

yield_train <- cmp_df$Yield[trainingRows]
head(train)
##   BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 2            2.0957767             1.252483           -0.1268052
## 3            2.0957767             1.252483           -0.1268052
## 4            2.0957767             1.252483           -0.1268052
## 5            1.3778129             1.837914            1.0949364
## 7            1.3911085             2.120707            1.1359172
## 8            0.6731447             1.904892            1.0462716
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 2            1.1800471            0.3587809             1.089012
## 3            1.1800471            0.3587809             1.089012
## 4            1.1800471            0.3587809             1.089012
## 5            0.8504608           -0.3993435             1.496600
## 7            0.7458303           -0.5039124             1.440288
## 8            1.7293575            0.3901516             1.512689
##   BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 2             2.311820           -0.7707901            1.0177401
## 3             2.311820           -0.7707901            1.0177401
## 4             2.311820           -0.7707901            1.0177401
## 5             1.088856           -0.1581108            0.3809166
## 7             1.834566            0.2094968            0.3653844
## 8             2.028451            0.6506259            1.6234990
##   BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 2           1.38641812            1.1472202             -5.4714537
## 3           1.38641812            1.1472202             -5.4714537
## 4           1.38641812            1.1472202             -5.4714537
## 5           0.12116426            1.1472202             -0.2242367
## 7           0.01677037            0.7370656              0.1680786
## 8           1.49707564            1.6940931              0.4132757
##   ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 2              -1.969428           -0.008089843             -2.3181036
## 3              -1.969428           -0.008089843             -3.1108362
## 4              -1.969428           -0.008089843             -3.2693827
## 5              -1.969428           -0.008089843             -2.1595570
## 7              -1.969428            1.016858040              0.2186408
## 8              -1.969428            0.515287799             -0.4155453
##   ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 2              0.9688887              0.8891570             -1.0286226
## 3              0.0417505             -0.1547302              0.9643337
## 4              0.3983421              2.0770287             -1.0286226
## 5              0.8165268             -0.6586758              0.9643337
## 7             -0.4347856              0.8891570             -1.0286226
## 8              0.2783977              1.5010909              0.9643337
##   ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 2              0.8604992              0.5591263            -0.02499045
## 3              0.8604992             -0.4168261            -0.02499045
## 4             -1.1527442             -0.5144214            -0.02499045
## 5              0.8604992             -0.4883960            -0.02499045
## 7              0.8604992              2.3743977             3.10870450
## 8              0.8604992              1.9319660             1.29654056
##   ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 2           -0.006996376             -0.4879186             -0.4456372
## 3           -0.006996376             -0.4879186              0.3100416
## 4           -0.006996376             -0.4879186              0.3100416
## 5           -0.006996376             -0.4879186              0.1211219
## 7            3.019944677             -0.4879186             -1.9569949
## 8            2.733635723             -0.4879186             -0.8234766
##   ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 2            0.278423622              0.9457981              0.3928445
## 3            0.440595210              0.8079242              0.3928445
## 4            0.782957451              1.0664377              0.6854134
## 5            2.494768657              3.3241223              2.2782886
## 7           -1.955940478             -0.7086884             -1.7364069
## 8            0.008137642              1.1181404              0.5391290
##   ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 2             -0.2144985              0.6185123              1.4529711
## 3              0.4094971              0.8268674              1.0426138
## 4              0.4094971              0.7226898              0.9346251
## 5             -0.2924979              1.0143870              1.5609598
## 7             -0.3704974             -1.6525586             -0.3612397
## 8             -0.5264963             -1.4858745             -0.1668600
##   ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 2              0.7597951              0.2744285            -0.68368241
## 3              0.7597951              0.2744285            -0.38090877
## 4              0.5570959              0.2744285            -0.07813513
## 5              1.5300521             -0.7018171             0.83018578
## 7             -1.2469271              2.2269197            -1.28922968
## 8             -0.6388295              0.2744285            -0.98645605
##   ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 2             -1.7829988             -0.9201897              0.3224479
## 3             -1.1918441             -0.7459859              1.0168982
## 4             -0.6006894             -0.5717820              0.8928892
## 5              0.5816199              1.6928682              1.8353575
## 7             -1.1918441             -1.2685975             -1.5128850
## 8             -0.6006894             -1.0943936             -1.2400652
##   ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 2              1.2489958             0.91027573              0.8234824
## 3              1.4505103             1.06428140              0.8034346
## 4              1.3385578             0.91027573              0.8034346
## 5              2.2341779             2.09831946              0.8435302
## 7              0.1294708            -0.36577125              0.8234824
## 8              0.1742518            -0.05775991              0.8034346
##   ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 2               2.035497              1.0357345             -1.3228383
## 3               1.874264              0.2846139             -0.8912206
## 4               1.874264              0.2846139             -0.8912206
## 5               2.357963             -0.3162826             -0.8192844
## 7               1.713030              2.9886481             -2.1141372
## 8               1.713030              2.5379758             -1.8983284
##   ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 2              1.9201778              1.0509993              1.9873701
## 3              2.6579069              1.0509993              1.9873701
## 4              2.2890423              1.9190132              0.1249846
## 5              2.2890423              2.7870272              0.1249846
## 7              0.0758552              0.6169924              0.1249846
## 8              0.4447197              0.6169924              0.1249846
##   ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 2             1.05962903              -0.735134             2.14246805
## 3             1.15314672              -1.905827            -0.68145550
## 4            -0.06258329              -1.905827             0.40466895
## 5            -2.68107870              -3.076521            -1.76757994
## 7            -2.02645485              -0.735134            -0.46423061
## 8            -1.74590177              -0.735134            -0.02978083
##   ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 2             -0.6760464              0.2356983              1.9819345
## 3             -0.6760464              0.2356983             -0.5004885
## 4             -0.6760464              0.2356983             -0.5004885
## 5             -0.6760464              0.3000740             -0.5004885
## 7             -0.6760464              0.3000740             -0.5004885
## 8             -0.6760464              0.3000740             -0.5004885
##   ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 2              2.0995668             0.02815815            -0.03692521
## 3             -0.4781192             0.42096439             0.06482425
## 4             -0.4781192            -0.19006753             0.16657371
## 5             -0.4781192            -0.01548698             0.16657371
## 7             -0.4781192             0.29002897            -0.24042412
## 8             -0.4781192             0.15909356            -0.13867466
##   ManufacturingProcess44 ManufacturingProcess45
## 2             0.30054609             0.15921800
## 3             0.03623605             0.37150866
## 4             0.03623605            -0.05307267
## 5            -0.22807398            -0.05307267
## 7             0.56485612             0.15921800
## 8             0.56485612             0.15921800
# Do the same for the test set using negative integers.
test <- cmp_df[-trainingRows,]
test <- test |> select(-Yield)

trans <- preProcess(test, method = c("center", "scale"))
test <- as.data.frame(predict(trans, test))

yield_test <- cmp_df$Yield[-trainingRows]
head(test)
##    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1            -0.1755184           -1.3793558          -2.40754671
## 6            -0.3862653            0.8007961          -0.41775223
## 9             0.9430609            2.1019346           1.19269294
## 15           -0.2079410            1.9355677           0.63917697
## 18            2.0778515            1.0689697           0.03794411
## 21            1.8671047            1.1434624           0.05225918
##    BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1             0.3542804             0.651668           -1.2603016
## 6             2.0820641             2.014069            0.7176365
## 9             2.4589059             0.597889            1.6380370
## 15           -0.3140803             1.267139            1.5984783
## 18            2.4375752             1.840781            0.9259793
## 21            2.4731263             2.020044            0.9998224
##    BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 1             -1.211512          -3.17093088             1.389593
## 6              1.136085          -1.58036748             2.054042
## 9              1.923388           0.72479687             2.199390
## 15             1.980647           0.01019592            -1.019035
## 18             1.737298           0.10240249             2.843075
## 21             1.866130           0.37902222             3.009187
##    BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 1             -1.758777           -1.5392922            -0.10566420
## 6              1.035161            0.6717192             0.59433809
## 9              1.505207            1.4621844             0.59433809
## 15             1.227637            2.2984736            -0.02390474
## 18             1.973228            1.4850964             0.94761971
## 21             2.348049            1.8860570             1.21258092
##    ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 1            -0.002005454             0.01550889            -0.08922738
## 6            -2.008492079             0.01550889            -1.36399595
## 9            -2.008492079             0.83950282            -0.71455053
## 15           -2.008492079             0.01550889            -0.38982783
## 18           -2.008492079             0.01550889            -0.38982783
## 21           -2.008492079             0.83950282            -1.20163459
##    ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 1              0.04971722              0.1250541              0.1764062
## 6              0.55858429              0.7382875              1.2411437
## 9              0.11058032              0.6564290             -0.8064284
## 15             4.05840476              0.2880655             -0.8064284
## 18             0.71353303              0.5336412             -0.8064284
## 21             3.45545206             -0.2030857              1.2411437
##    ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 1              0.08726288             -1.6101083             0.06710068
## 6              0.97879500             -0.1392355             0.06710068
## 9             -1.02143731              1.0526787             0.82638896
## 15            -1.02143731              0.5708411             0.38914364
## 18            -1.02143731              0.1207033             0.53489208
## 21             0.97879500             -0.2026352            -0.19385013
##    ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 1              0.01681627             0.03113167              0.9982485
## 6              0.01681627            -0.46351599             -0.6549019
## 9              2.49064599            -0.46351599             -0.7651119
## 15             1.19201577            -0.46351599              0.7778284
## 18             1.76918475            -0.46351599              0.9982485
## 21             0.90343127            -0.46351599              0.6676184
##    ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 1               0.8324958               1.209825              0.2823790
## 6               2.5025585               3.126916              0.4449408
## 9               0.7365152               1.815222              0.2448647
## 15              1.2164182               2.017021              0.2948837
## 18              0.8516919               1.462074              0.2980099
## 21              1.8114981               2.101104              0.4136789
##    ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 1               0.8562408             0.18038449              0.5551000
## 6              -0.9558037             0.17591996              2.0540011
## 9              -0.5243645             0.03751929              0.1917301
## 15              0.6836651             0.15657363              1.5543674
## 18              0.8562408             0.10002281              0.3734150
## 21              0.5973773             0.18931357              1.8950268
##    ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 1               0.2715747             0.09499888             -0.1044680
## 6               0.3202221            -0.56366000              0.9619927
## 9               0.1005240             0.09499888             -0.8182676
## 15              0.2417585             0.09499888              1.5554127
## 18              0.2260658             0.09499888             -1.1149777
## 21              0.3343456             0.09499888             -0.2248475
##    ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 1            -0.001520213             -0.2299603             0.16487392
## 6            -1.269930888             -1.5985493             0.16190785
## 9            -0.012299907             -1.2491648             0.11889974
## 15           -0.012299907              0.8471423             0.16635696
## 18           -1.269930888             -1.4238571             0.07885771
## 21            0.616515583             -0.8997803             0.17377215
##    ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 1               0.1707123              0.2895910              0.9717201
## 6               0.2350214              0.2927257              1.1000139
## 9               0.2052487              0.1704753              1.0816862
## 15              0.2314487              0.2488409              1.0816862
## 18              0.1480850              0.2049562              0.9900478
## 21              0.2219214              0.3068315              1.0450308
##    ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 1               0.4514727              0.5923419            -0.02438267
## 6               0.6963849              0.7307769            -0.11567137
## 9               0.6264100              1.0076468            -0.13595775
## 15              0.6613974              0.5923419            -0.08524180
## 18              0.4164853              1.0768643            -0.03452586
## 21              0.6264100              0.5923419            -0.06495542
##    ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 1              -0.4040165              0.9122196             -1.7476413
## 6               2.7566080              2.3496567              0.1069984
## 9               0.3396599              0.5528604              0.1069984
## 15              0.3396599              0.1935011              0.1069984
## 18             -0.2180974              0.1935011              0.1069984
## 21             -0.2180974             -0.1658581              0.1069984
##    ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 1              -0.6869526             -0.5069243             -1.2503431
## 6              -0.2965533             -1.6530140             -1.4938557
## 9              -0.1989535             -0.5069243              0.4542445
## 15             -0.2965533             -0.5069243              1.4282946
## 18              0.3866454              0.6391654              0.2107320
## 21              0.1914458              0.6391654             -0.2762931
##    ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 1               0.6901983              0.2200479              0.1810502
## 6              -1.4209964              0.2200479             -0.3685787
## 9               0.6901983              0.3630791             -0.3685787
## 15              0.6901983              0.2200479             -0.3685787
## 18              0.6901983              0.3630791             -0.3685787
## 21              0.6901983              0.2915635              2.7341653
##    ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 1               0.2466006            -0.07204426             4.28571934
## 6              -0.3503305            -0.60723023             2.68817521
## 9              -0.3503305            -1.14241620             0.09216601
## 15             -0.3503305            -0.07204426            -0.10752701
## 18             -0.3503305             0.46314170             1.68971013
## 21              4.6840274            -1.14241620             2.28878918
##    ManufacturingProcess44 ManufacturingProcess45
## 1              -0.5717719              1.2985131
## 6              -0.5717719             -0.9522429
## 9               0.5717719             -0.3895539
## 15              0.5717719             -0.9522429
## 18             -0.5717719              0.1731351
## 21             -2.8588594             -0.9522429
(a) Which tree-based regression model gives the optimal resampling and test set performance?
Random Forest

Random Forest gives me optimal performance over Conditional Random Forest and Gradient Booting with a RMSE of 1.28

model1 <- randomForest(yield_train ~ .,
                       data = train,
                       importance = TRUE,
                       ntree = 1000,
                       keep.forest = TRUE)

rf_pred <- predict(model1, test)
RMSE = mean((yield_test - rf_pred)^2) %>% sqrt()
RMSE
## [1] 1.285018
GBM
gbmModel <- gbm(yield_train ~ ., data = train, distribution = "gaussian")

gbm_pred <- predict(gbmModel, test)
## Using 100 trees...
RMSE = mean((yield_test - gbm_pred)^2) %>% sqrt()
RMSE
## [1] 1.327406
Conditional Random Forests
model3 <- cforest(yield_train ~ .,
                  data = train,
          control = cforest_unbiased(ntree = 500))

crf_pred <- predict(model3, test, type = "response", OOB = TRUE)
RMSE = mean((yield_test - crf_pred[1:52])^2) %>% sqrt()
RMSE
## [1] 1.932903
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The results are interesting, compared to the top 10 predictors for our linear and nonlinear models there are more Biological Materials in our top 10. We still see the Manufacturing Process 32 at the top of the list but Biological Material 12 in second place is a new addition.

rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 |> arrange(desc(Overall)) |> slice_head(n = 10)
##                          Overall
## ManufacturingProcess32 0.8726530
## BiologicalMaterial12   0.3566173
## ManufacturingProcess13 0.3117625
## ManufacturingProcess36 0.2413422
## ManufacturingProcess31 0.2227998
## ManufacturingProcess09 0.2187102
## BiologicalMaterial11   0.1996627
## BiologicalMaterial06   0.1651863
## ManufacturingProcess17 0.1643580
## BiologicalMaterial03   0.1636269
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

The column names in our data set are to verbose to visualize in a tree. I’m going to rename the columns and then retrain our model and then visualize our optimal tree.

This tree is so big it’s a challenge to interpret. Think it’s interesting that with all the previous focus we have had on the manufacturing process, one of the first splits is based on Biological Material 12. This would drastically reshape my understanding of product yield. The most essential split I am making is based on a biological predictor and not my process.

names(train) <- gsub("ManufacturingProcess", "MP", names(train), perl=T)
names(train) <- gsub("BiologicalMaterial", "BM", names(train), perl=T)

train <- round(train, 2)

model1 <- randomForest(yield_train ~ .,
                       data = train,
                       importance = TRUE,
                       ntree = 1000,
                       keep.forest = TRUE)
library(reprtree)
## Loading required package: tree
## Loading required package: plotrix
## Registered S3 method overwritten by 'reprtree':
##   method    from
##   text.tree tree
reprtree:::plot.getTree(model1)