Homework 9

Question 8.1

Recreate the simulated data from Exercise

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.3.3

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Part A

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##         Overall
## V1   8.62743275
## V2   6.27437240
## V3   0.72305459
## V4   7.50258584
## V5   2.13575650
## V6   0.12395003
## V7   0.02927888
## V8  -0.11724317
## V9  -0.10344797
## V10  0.04312556

Did the random forest model significantly use the uninformative predictors (V6– V10)?

No the random forest model did not use the uninformative predictors (V6-V10) those variables had the least importance and all were at less then .1 showing that it did not see any of those predictors as signifigantly important.

Part B

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9485201

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                 Overall
## V1          6.774034589
## V2          6.426340527
## V3          0.613805379
## V4          7.135941576
## V5          2.135242904
## V6          0.171933358
## V7          0.142238552
## V8         -0.073192083
## V9         -0.098719872
## V10        -0.009701234
## duplicate1  3.084990840

When we add a highly correlated predictor the importance score for V1 drops a bit and we see that the additional highly correlated predictor has about half the importance of the V1 predictor. We can also see that with the addition of the variable the importance of the other variables decreases slightly from the original model.

Part C

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

## Warning: package 'party' was built under R version 4.3.3

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Warning: package 'modeltools' was built under R version 4.3.3

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.3.3

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

cforest_model <- cforest(y ~ ., data = simulated, controls = cforest_unbiased(mtry = 2, ntree = 500))

# Conditional = FALSE: standard permutation importance
vi_standard <- varimp(cforest_model, conditional = FALSE)

# Conditional = TRUE: conditional permutation importance (controls for correlated predictors)
vi_conditional <- varimp(cforest_model, conditional = TRUE)

vi_standard

##          V1          V2          V3          V4          V5          V6 
##  4.88772588  4.26038443  0.26379163  4.84561394  1.36600235 -0.07457394 
##          V7          V8          V9         V10  duplicate1 
##  0.02843078 -0.15398300 -0.01850278 -0.18602640  2.88439981

vi_conditional

##          V1          V2          V3          V4          V5          V6 
##  2.13845554  3.20962005  0.13958751  3.29907289  0.90217642 -0.03664880 
##          V7          V8          V9         V10  duplicate1 
## -0.04403149 -0.08743505 -0.01021958 -0.07130713  1.19903719

rfImp2

##                 Overall
## V1          6.774034589
## V2          6.426340527
## V3          0.613805379
## V4          7.135941576
## V5          2.135242904
## V6          0.171933358
## V7          0.142238552
## V8         -0.073192083
## V9         -0.098719872
## V10        -0.009701234
## duplicate1  3.084990840

The patterns are similar in that they all give V4,V1,V2 and duplicate high importance. However, with the conditional you can see that since it looks for correlations and penalizes redundant predictors. In the first two the duplicate1 had pretty high importance of 3.9, 2.96 but then a big decrease to 0.97 for the conditional because it tries to penalize redundant variables which may only seem important because they are highly correlated with another predictor. The decrease in duplicate1’s importance shows this clearly.

Part D

Repeat this process with diﬀerent tree models, such as boosted trees and Cubist. Does the same pattern occur?

Try boosted model first.

library(gbm)

## Warning: package 'gbm' was built under R version 4.3.3

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

gbm_model <- train(y ~ ., data = simulated, method = "gbm", trControl = trainControl(method = "cv", number = 5), verbose = FALSE)

# Extract variable importance
gbm_imp_standard <- varImp(gbm_model, conditional = FALSE)$importance
gbm_imp_standard

##                Overall
## V1          72.5245681
## V2          67.7914214
## V3          24.0023527
## V4         100.0000000
## V5          38.0200868
## V6           2.4421133
## V7           3.1574042
## V8           0.7099166
## V9           0.3748497
## V10          0.0000000
## duplicate1  12.0687168

For the gbm the same patterns seem to be appearing with V1, V2 and V4 being the most important

Now lets try a cubist

library(Cubist)

## Warning: package 'Cubist' was built under R version 4.3.3

cubist_model <- train(y ~ ., data = simulated, method = "cubist", trControl = trainControl(method = "cv", number = 5))

cubist_imp <- varImp(cubist_model)$importance
cubist_imp$Variable <- rownames(cubist_imp)
cubist_imp

##               Overall   Variable
## V1         100.000000         V1
## V3          54.166667         V3
## V2          84.027778         V2
## V4          65.277778         V4
## V5          51.388889         V5
## V6          14.583333         V6
## V8           4.166667         V8
## duplicate1   3.472222 duplicate1
## V7           0.000000         V7
## V9           0.000000         V9
## V10          0.000000        V10

The Cubist model is following the same pattern with V1,V2 and V4 being the most important and V6-V10 being not as important at all although it is surprising to see V6 at 14 thats higher than I expected.

Question 8.2

Use a simulation to show tree bias with diﬀerent granularities.

Generate a simulated data set with 4 predictors X1 is random normal with mean 500 and SD 10, X2 is a random uniform with min 50 and max 300, X3 is a random normal with mean 5 and SD 2, X4 is X1 with some noise

set.seed(123)
# Predictors
X1 <- rnorm(1000, 500, 10)                     
X2 <- runif(1000, 50, 300)                     
X3 <- rnorm(1000, 5, 2)   
X4 <- X1 + rnorm(1000, 500, 50)

# Response
Y <- X1 + rnorm(500, 250, 10)

df_sim <- data.frame(Y, X1, X2, X3, X4)

library(rpart)
cart_model <- rpart(Y ~ ., data = df_sim)
cart_imp <- varImp(cart_model)
cart_imp

##       Overall
## X1 0.98645819
## X2 0.08460103
## X3 0.07810180
## X4 0.20728459

cforest_model <- cforest(Y ~ ., data = df_sim, controls = cforest_unbiased(mtry = 2, ntree = 500))
varImp(cforest_model,conditional = TRUE)

##        Overall
## X1 47.75064536
## X2  0.50367189
## X3  0.05330327
## X4  0.07887213

Looking at the results above we can see that X1 is the most important in both models which makes sense because X1 is correlated with the response then X4 is the second most important in both which also makes sense because it is correlated with the X1 which is correlated with the response. We can see that in the conditional tree model it gives less importance in comparison to the other predictors to predictor X4 than in the first tree this makes sense because it will try to get rid of the importance for redundant predictors which is what X4 is.

Question 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters aﬀect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Figure 8.24

Part A

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right focuses its important on just the first few predictors because the bagging fraction and learning rates are high. When the bagging fraction is high this means that for each tree when it samples the data it will sample a large portion of the data and with the large learning rate the trees will converge fast where the strongest predictors will dominate. This happens because each tree can see a large portion of almost all the data so they can see whats important and with the high learning rate they will end up quickly choosing the same strong predictors.

With the second model on the left that samples a much smaller portion of the data with a low learning rate this means that each tree will most likely be seeing different samples from the data so this means that each tree will explore different parts of the data and with a low learning rate it will take each tree a long time to converge with more splits so they will have less bias and give more importance to more predictors.

So, the one on the right is more overfit to a few strong predictors while the one on the left is less overfit with less bias because it is forced to explore more predictors.

Part B

Which model do you think would be more predictive of other samples?

I think that the second model would be more predictive of other samples because it was forced to explore and spread out the importance of each variable. The model on the right with the high bagging fraction and high learning rate is prone to overfitting and may not perform well with new or other samples because it may be so overfit to the training data.

Part C

How would increasing interaction depth aﬀect the slope of predictor importance for either model in Fig.8.24?

Increasing the interaction depth would affect the slope of the predictor importance in both models by flattening it out a bit. Increasing the interaction depth would increase the depth of the tress allowing for them to explore the relationships with each predictor and in turn spread out the importance of each predictor more. This would be more likely to flatten the model on the left because that model already has a low bagging fraction and learning rate so it is already spreading out the importance of the predictors while on the right it may have less of an effect because the model could be overfit and it may just continue to reiterate the importance of the top stongest predictors.

Question 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Part A

Which tree-based regression model gives the optimal resampling and test set performance?

Load in the data and clean it

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

Impute missing values

# Create a pre-processing object to apply KNN imputation
preProc <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
# Apply the imputation to the data
imputed_data <- predict(preProc, ChemicalManufacturingProcess)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:party':
## 
##     where

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

set.seed(123)
train_index <- createDataPartition(imputed_data$Yield, p = 0.8, list = FALSE)
train_data <- imputed_data[train_index, ]
test_data <- imputed_data[-train_index, ]

X_train <- train_data %>% 
  select(-Yield)
Y_train <- train_data$Yield

x_test <- test_data %>%
  select(-Yield)
y_test <- test_data$Yield
train_control <- trainControl(method = "cv", number = 10)

Fit Models

cart_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.005))

cart_model <- train( Yield ~ ., data = train_data, method = "rpart", trControl = train_control, tuneGrid = cart_grid)
cart_model$results

##       cp      RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1  0.001 0.7813111 0.4884111 0.5827292 0.2323160  0.2031939 0.1527980
## 2  0.006 0.7813111 0.4884111 0.5827292 0.2323160  0.2031939 0.1527980
## 3  0.011 0.7782476 0.4851441 0.5887368 0.2330517  0.2026556 0.1561120
## 4  0.016 0.7743975 0.4896178 0.5866205 0.2279804  0.1850005 0.1563178
## 5  0.021 0.7744553 0.4871144 0.5820834 0.2236935  0.1896188 0.1461648
## 6  0.026 0.7825698 0.4666704 0.5880287 0.2144527  0.1779456 0.1396202
## 7  0.031 0.7899669 0.4637394 0.5974589 0.2117703  0.1751609 0.1311342
## 8  0.036 0.7907492 0.4623011 0.6007929 0.2136451  0.1778936 0.1355622
## 9  0.041 0.8145829 0.4361676 0.6326022 0.2105704  0.1655689 0.1420356
## 10 0.046 0.7969053 0.4472866 0.6281677 0.2134937  0.1705487 0.1569650
## 11 0.051 0.7871051 0.4623219 0.6240567 0.2062825  0.1698932 0.1530901
## 12 0.056 0.7820799 0.4672019 0.6174642 0.2128024  0.1773442 0.1613884
## 13 0.061 0.8079349 0.4409841 0.6401686 0.2212127  0.1765677 0.1677303
## 14 0.066 0.8133496 0.4289846 0.6485062 0.2055559  0.1439758 0.1499052
## 15 0.071 0.8186225 0.4248306 0.6531754 0.2013650  0.1519766 0.1449756
## 16 0.076 0.8009315 0.4238973 0.6499746 0.1767501  0.1446504 0.1297180
## 17 0.081 0.8153200 0.4025292 0.6606076 0.1939710  0.1697826 0.1400331
## 18 0.086 0.8097735 0.4025745 0.6475950 0.2177559  0.1941403 0.1482161
## 19 0.091 0.8082416 0.4021179 0.6460520 0.2148436  0.1950266 0.1467949
## 20 0.096 0.8049685 0.4120894 0.6487962 0.2249105  0.2159824 0.1556594

cart_model$bestTune

##      cp
## 4 0.016

The best CART model gives a training RMSE of 0.7687742 with a CP of 0.041

rf_grid <- expand.grid(mtry = seq(2, ncol(train_data) - 1, by = 2)) 

rf_model <- train( Yield ~ ., data = train_data, method = "rf", trControl = train_control, tuneGrid = rf_grid)
rf_model$results

##    mtry      RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1     2 0.6647108 0.6170756 0.5298046 0.1924333  0.2101954 0.1220010
## 2     4 0.6329151 0.6377918 0.5031495 0.2003447  0.2114336 0.1337663
## 3     6 0.6234007 0.6401338 0.4963178 0.2094614  0.2146643 0.1381301
## 4     8 0.6149989 0.6430567 0.4933024 0.2163277  0.2224814 0.1460229
## 5    10 0.6063040 0.6499405 0.4809787 0.2159995  0.2223692 0.1436248
## 6    12 0.6043247 0.6507022 0.4792980 0.2179135  0.2148121 0.1470091
## 7    14 0.5995312 0.6605086 0.4759586 0.2132600  0.2120453 0.1440215
## 8    16 0.6010140 0.6518358 0.4771918 0.2186478  0.2125606 0.1478376
## 9    18 0.6037117 0.6462688 0.4781628 0.2205801  0.2185438 0.1492423
## 10   20 0.5987741 0.6519650 0.4742222 0.2210815  0.2163074 0.1498532
## 11   22 0.6000498 0.6500945 0.4740219 0.2147167  0.2122590 0.1439288
## 12   24 0.5967707 0.6529065 0.4726573 0.2184349  0.2094549 0.1485644
## 13   26 0.5930400 0.6514006 0.4680033 0.2261438  0.2215060 0.1535388
## 14   28 0.5928316 0.6543010 0.4657563 0.2243283  0.2193681 0.1514911
## 15   30 0.5946242 0.6490275 0.4680762 0.2255124  0.2160562 0.1523174
## 16   32 0.5973824 0.6436347 0.4688344 0.2277198  0.2189048 0.1532000
## 17   34 0.5994237 0.6416863 0.4688036 0.2277011  0.2153068 0.1534835
## 18   36 0.5921494 0.6490422 0.4639692 0.2309897  0.2186465 0.1545614
## 19   38 0.5910163 0.6465660 0.4627557 0.2261759  0.2169202 0.1516395
## 20   40 0.5910284 0.6509026 0.4630451 0.2236190  0.2082423 0.1468849
## 21   42 0.5898556 0.6521454 0.4612157 0.2213279  0.2011252 0.1447405
## 22   44 0.5955598 0.6439432 0.4650698 0.2209907  0.2047364 0.1475274
## 23   46 0.5870762 0.6517545 0.4600810 0.2259972  0.2120677 0.1504442
## 24   48 0.5916755 0.6454745 0.4601160 0.2267792  0.2085834 0.1489553
## 25   50 0.5917448 0.6443899 0.4598028 0.2236107  0.2089107 0.1471884
## 26   52 0.5933610 0.6427798 0.4597137 0.2239722  0.2105642 0.1467469
## 27   54 0.5948514 0.6385155 0.4627169 0.2277840  0.2147819 0.1492559
## 28   56 0.5956152 0.6388303 0.4620724 0.2249647  0.2101621 0.1438432

rf_model$bestTune

##    mtry
## 23   46

The best Random Forest Model gives a training RMSE of 0.5940813 and mtry of 26.

gbm_grid <- expand.grid( n.trees = seq(100, 1000, by = 100), interaction.depth = c(1, 3, 5, 7), shrinkage = c(0.01, 0.1), n.minobsinnode = 10)

gbm_model <- train( Yield ~ ., data = train_data, method = "gbm", trControl = train_control, tuneGrid = gbm_grid, verbose = FALSE)

gbm_model$results

##    shrinkage interaction.depth n.minobsinnode n.trees      RMSE  Rsquared
## 1       0.01                 1             10     100 0.7809983 0.5760340
## 41      0.10                 1             10     100 0.6640866 0.5908073
## 11      0.01                 3             10     100 0.7244056 0.6172035
## 51      0.10                 3             10     100 0.6505848 0.6144675
## 21      0.01                 5             10     100 0.7206939 0.6180078
## 61      0.10                 5             10     100 0.6503624 0.6299850
## 31      0.01                 7             10     100 0.7171311 0.6340682
## 71      0.10                 7             10     100 0.6179723 0.6544190
## 2       0.01                 1             10     200 0.6985794 0.6049695
## 42      0.10                 1             10     200 0.6598598 0.6053577
## 12      0.01                 3             10     200 0.6557285 0.6359397
## 52      0.10                 3             10     200 0.6354595 0.6312861
## 22      0.01                 5             10     200 0.6539334 0.6380062
## 62      0.10                 5             10     200 0.6435069 0.6296736
## 32      0.01                 7             10     200 0.6492588 0.6481695
## 72      0.10                 7             10     200 0.6167320 0.6553145
## 3       0.01                 1             10     300 0.6644312 0.6136083
## 43      0.10                 1             10     300 0.6664763 0.5985661
## 13      0.01                 3             10     300 0.6369965 0.6408581
## 53      0.10                 3             10     300 0.6383337 0.6282624
## 23      0.01                 5             10     300 0.6338418 0.6434588
## 63      0.10                 5             10     300 0.6429333 0.6298669
## 33      0.01                 7             10     300 0.6316618 0.6496956
## 73      0.10                 7             10     300 0.6161145 0.6551645
## 4       0.01                 1             10     400 0.6518935 0.6192693
## 44      0.10                 1             10     400 0.6629387 0.6041118
## 14      0.01                 3             10     400 0.6323506 0.6411671
## 54      0.10                 3             10     400 0.6358126 0.6311411
## 24      0.01                 5             10     400 0.6252358 0.6485216
## 64      0.10                 5             10     400 0.6413786 0.6310735
## 34      0.01                 7             10     400 0.6242246 0.6541044
## 74      0.10                 7             10     400 0.6149523 0.6560486
## 5       0.01                 1             10     500 0.6480774 0.6209010
## 45      0.10                 1             10     500 0.6607846 0.6053460
## 15      0.01                 3             10     500 0.6293832 0.6437873
## 55      0.10                 3             10     500 0.6356505 0.6315482
## 25      0.01                 5             10     500 0.6220769 0.6520887
## 65      0.10                 5             10     500 0.6417018 0.6306456
## 35      0.01                 7             10     500 0.6202433 0.6569844
## 75      0.10                 7             10     500 0.6148183 0.6560146
## 6       0.01                 1             10     600 0.6449296 0.6214355
## 46      0.10                 1             10     600 0.6552671 0.6103385
## 16      0.01                 3             10     600 0.6274014 0.6460574
## 56      0.10                 3             10     600 0.6348087 0.6326545
## 26      0.01                 5             10     600 0.6233996 0.6484248
## 66      0.10                 5             10     600 0.6401864 0.6322755
## 36      0.01                 7             10     600 0.6189423 0.6591177
## 76      0.10                 7             10     600 0.6142960 0.6566133
## 7       0.01                 1             10     700 0.6439270 0.6223229
## 47      0.10                 1             10     700 0.6523919 0.6124403
## 17      0.01                 3             10     700 0.6272419 0.6472043
## 57      0.10                 3             10     700 0.6345688 0.6329886
## 27      0.01                 5             10     700 0.6225233 0.6492774
## 67      0.10                 5             10     700 0.6400327 0.6325948
## 37      0.01                 7             10     700 0.6173250 0.6590776
## 77      0.10                 7             10     700 0.6150070 0.6561211
## 8       0.01                 1             10     800 0.6432554 0.6245864
## 48      0.10                 1             10     800 0.6530960 0.6138263
## 18      0.01                 3             10     800 0.6271387 0.6457687
## 58      0.10                 3             10     800 0.6345305 0.6332610
## 28      0.01                 5             10     800 0.6199827 0.6516347
## 68      0.10                 5             10     800 0.6401342 0.6324348
## 38      0.01                 7             10     800 0.6162280 0.6594652
## 78      0.10                 7             10     800 0.6149197 0.6562806
## 9       0.01                 1             10     900 0.6441928 0.6225042
## 49      0.10                 1             10     900 0.6506646 0.6154692
## 19      0.01                 3             10     900 0.6268415 0.6450356
## 59      0.10                 3             10     900 0.6344205 0.6332638
## 29      0.01                 5             10     900 0.6200692 0.6516683
## 69      0.10                 5             10     900 0.6401205 0.6324678
## 39      0.01                 7             10     900 0.6158605 0.6582995
## 79      0.10                 7             10     900 0.6150066 0.6562628
## 10      0.01                 1             10    1000 0.6447872 0.6204721
## 50      0.10                 1             10    1000 0.6463170 0.6201205
## 20      0.01                 3             10    1000 0.6241034 0.6468253
## 60      0.10                 3             10    1000 0.6344159 0.6333666
## 30      0.01                 5             10    1000 0.6196183 0.6525847
## 70      0.10                 5             10    1000 0.6400592 0.6326501
## 40      0.01                 7             10    1000 0.6157383 0.6582110
## 80      0.10                 7             10    1000 0.6151606 0.6561689
##          MAE     RMSESD RsquaredSD      MAESD
## 1  0.6285084 0.16443036  0.1516553 0.12016847
## 41 0.5076282 0.08990609  0.1509132 0.06842708
## 11 0.5759036 0.15946482  0.1363936 0.10992754
## 51 0.4943505 0.10002350  0.1505903 0.09948099
## 21 0.5694309 0.15550746  0.1397332 0.10841831
## 61 0.4977806 0.10152327  0.1289772 0.08664998
## 31 0.5731071 0.15881993  0.1165412 0.10824970
## 71 0.4747823 0.10819267  0.1060200 0.07173174
## 2  0.5571243 0.14925811  0.1264148 0.11164989
## 42 0.5038580 0.09534245  0.1462850 0.07718247
## 12 0.5137325 0.14060129  0.1237024 0.09982726
## 52 0.4846764 0.10398154  0.1551558 0.10363851
## 22 0.5121446 0.13990199  0.1335460 0.09933155
## 62 0.4985714 0.10743631  0.1478276 0.09109815
## 32 0.5117614 0.13733853  0.1156166 0.09640033
## 72 0.4779749 0.10696435  0.1079221 0.07013431
## 3  0.5222184 0.13445892  0.1207568 0.09964122
## 43 0.5138576 0.11457813  0.1672611 0.09202467
## 13 0.4937927 0.12330602  0.1295427 0.09243104
## 53 0.4903207 0.11104415  0.1588332 0.10962013
## 23 0.4936370 0.12303165  0.1331330 0.09114316
## 63 0.4981790 0.11364234  0.1501363 0.09204837
## 33 0.4911497 0.11796862  0.1202374 0.08747142
## 73 0.4795944 0.10820741  0.1144669 0.07397762
## 4  0.5068306 0.12017286  0.1222800 0.09000111
## 44 0.5084483 0.12488617  0.1680406 0.09260838
## 14 0.4871839 0.11241919  0.1357312 0.08562905
## 54 0.4896592 0.11593248  0.1603144 0.11214187
## 24 0.4821500 0.10822822  0.1312298 0.08290215
## 64 0.4973508 0.11641892  0.1510962 0.09492109
## 34 0.4803781 0.10290082  0.1208292 0.08112290
## 74 0.4799144 0.10726902  0.1148804 0.07274370
## 5  0.5017326 0.10839236  0.1250390 0.08063896
## 45 0.5064897 0.13155826  0.1729824 0.09938307
## 15 0.4835880 0.10282998  0.1367351 0.07972333
## 55 0.4900969 0.11663636  0.1613866 0.11182817
## 25 0.4770692 0.09914728  0.1322074 0.08048758
## 65 0.4979276 0.11751456  0.1520895 0.09620910
## 35 0.4763950 0.09756924  0.1189086 0.07650617
## 75 0.4804050 0.10834247  0.1157598 0.07364424
## 6  0.4968657 0.10055680  0.1267186 0.07778481
## 46 0.5067756 0.12932843  0.1685978 0.09652661
## 16 0.4822654 0.09709117  0.1348243 0.07463410
## 56 0.4899324 0.11802488  0.1617016 0.11278947
## 26 0.4762391 0.09562225  0.1343885 0.07999505
## 66 0.4967783 0.11911588  0.1529129 0.09756950
## 36 0.4744400 0.09056611  0.1160942 0.07283020
## 76 0.4801829 0.10935617  0.1165642 0.07455271
## 7  0.4932830 0.09316419  0.1287774 0.07306938
## 47 0.5031826 0.13401398  0.1708754 0.09860558
## 17 0.4822756 0.09529484  0.1372832 0.07264736
## 57 0.4899748 0.11834597  0.1616426 0.11300202
## 27 0.4747810 0.09225400  0.1358916 0.07785425
## 67 0.4966653 0.11989727  0.1529012 0.09839046
## 37 0.4727232 0.08735079  0.1175702 0.07100221
## 77 0.4807565 0.10996055  0.1172351 0.07535395
## 8  0.4909691 0.09063532  0.1324369 0.07133621
## 48 0.5021194 0.13211170  0.1675897 0.09846761
## 18 0.4820657 0.09425827  0.1412937 0.07272520
## 58 0.4901549 0.11830500  0.1612524 0.11279943
## 28 0.4732901 0.09140975  0.1353172 0.07708045
## 68 0.4965069 0.12047352  0.1531855 0.09887541
## 38 0.4717629 0.08701562  0.1183873 0.06943626
## 78 0.4806449 0.11018409  0.1173793 0.07552141
## 9  0.4909784 0.08695043  0.1350637 0.06926181
## 49 0.5007653 0.13021601  0.1639124 0.09811168
## 19 0.4832094 0.09386598  0.1414543 0.07262097
## 59 0.4900049 0.11860580  0.1616478 0.11301262
## 29 0.4740556 0.09173110  0.1357893 0.07720748
## 69 0.4964976 0.12045783  0.1531325 0.09902283
## 39 0.4710362 0.08977628  0.1219452 0.07265605
## 79 0.4808125 0.11022117  0.1173345 0.07559880
## 10 0.4910317 0.08527586  0.1388171 0.06996667
## 50 0.4969537 0.13326596  0.1659476 0.09992424
## 20 0.4827440 0.09554731  0.1427071 0.07471998
## 60 0.4900608 0.11880256  0.1617082 0.11320096
## 30 0.4747160 0.09100641  0.1343966 0.07648370
## 70 0.4964336 0.12038356  0.1529517 0.09903179
## 40 0.4704661 0.08831407  0.1221065 0.07175592
## 80 0.4808572 0.11021003  0.1174975 0.07566336

gbm_model$bestTune

##    n.trees interaction.depth shrinkage n.minobsinnode
## 76     600                 7       0.1             10

The best GBM model gives a training RMSE of 0.5891383. With 1000 trees an interaction depth of 7 shrinkage of 0.01 and n.minobsinnode of 10.

cforest_grid <- expand.grid(mtry = seq(2, ncol(train_data) - 1, by = 2))

cforest_model <- train( Yield ~ ., data = train_data, method = "cforest", trControl = train_control, tuneGrid = cforest_grid)
cforest_model$results

##    mtry      RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1     2 0.7734728 0.5594844 0.6289582 0.1837761  0.1714856 0.1333175
## 2     4 0.6997209 0.5994145 0.5654223 0.1742697  0.1763820 0.1308662
## 3     6 0.6775019 0.6109421 0.5415066 0.1711173  0.1768761 0.1269020
## 4     8 0.6647450 0.6168151 0.5276584 0.1718698  0.1762917 0.1259805
## 5    10 0.6554215 0.6253686 0.5207841 0.1757410  0.1878354 0.1290685
## 6    12 0.6515951 0.6225941 0.5156571 0.1784861  0.1860254 0.1301737
## 7    14 0.6489156 0.6246828 0.5121425 0.1710228  0.1841013 0.1231531
## 8    16 0.6428756 0.6284961 0.5046522 0.1693079  0.1870171 0.1270090
## 9    18 0.6423794 0.6272220 0.5047527 0.1729098  0.1887565 0.1241703
## 10   20 0.6428234 0.6236857 0.4999498 0.1736440  0.1881186 0.1252323
## 11   22 0.6411323 0.6256643 0.5026311 0.1773223  0.1845369 0.1277351
## 12   24 0.6381360 0.6264726 0.4979470 0.1750743  0.1913018 0.1253593
## 13   26 0.6392536 0.6271483 0.4997648 0.1711606  0.1846612 0.1261184
## 14   28 0.6428853 0.6192072 0.5009162 0.1774025  0.1882015 0.1250092
## 15   30 0.6434630 0.6191341 0.5011892 0.1746963  0.1892417 0.1248982
## 16   32 0.6436844 0.6172850 0.4993313 0.1765668  0.1910195 0.1241853
## 17   34 0.6423142 0.6162902 0.5000669 0.1798426  0.1928890 0.1285508
## 18   36 0.6435443 0.6145516 0.4972752 0.1776916  0.1890725 0.1262068
## 19   38 0.6426027 0.6163303 0.4974580 0.1730651  0.1938310 0.1205295
## 20   40 0.6458635 0.6134709 0.4993183 0.1766333  0.1956111 0.1224962
## 21   42 0.6439463 0.6143831 0.4979921 0.1774857  0.1899462 0.1239965
## 22   44 0.6435595 0.6163993 0.4991125 0.1754255  0.2026876 0.1219318
## 23   46 0.6482714 0.6073737 0.5021270 0.1790927  0.1998760 0.1245366
## 24   48 0.6459890 0.6088103 0.4983062 0.1778831  0.1939567 0.1237126
## 25   50 0.6492002 0.6051626 0.5020355 0.1813599  0.2005273 0.1247196
## 26   52 0.6477088 0.6085115 0.4976728 0.1802608  0.1954737 0.1226758
## 27   54 0.6486575 0.6055660 0.5000974 0.1824987  0.2066237 0.1242330
## 28   56 0.6499689 0.6022810 0.5009551 0.1836456  0.2043052 0.1247678

cforest_model$bestTune

##    mtry
## 12   24

The best cforest model gives a training RMSE of 0.6573449 and mttry of 24.

cubist_grid <- expand.grid(committees = c(1, 5, 10, 50),neighbors = c(0, 5, 9))

cubist_model <- train(Yield ~ ., data = train_data, method = "cubist", trControl = train_control, tuneGrid = cubist_grid)
cubist_model$results

##    committees neighbors      RMSE  Rsquared       MAE    RMSESD RsquaredSD
## 1           1         0 0.9706433 0.5013836 0.6401350 0.6343657  0.2571346
## 2           1         5 0.9161788 0.5678770 0.5810554 0.6385652  0.2528166
## 3           1         9 0.9421657 0.5405531 0.5945510 0.6393674  0.2526094
## 4           5         0 0.7261803 0.5366424 0.5660007 0.1721344  0.1570068
## 5           5         5 0.6614559 0.6229593 0.4960204 0.1647036  0.1280034
## 6           5         9 0.6879967 0.5920203 0.5172612 0.1807471  0.1400315
## 7          10         0 0.7190147 0.5178746 0.5677932 0.1456933  0.1480944
## 8          10         5 0.6600953 0.5989722 0.4990179 0.1374905  0.1248024
## 9          10         9 0.6832044 0.5699978 0.5194153 0.1487794  0.1259344
## 10         50         0 0.6637612 0.5838547 0.5154338 0.1439223  0.1555719
## 11         50         5 0.6182942 0.6465053 0.4628700 0.1325178  0.1259115
## 12         50         9 0.6359563 0.6242026 0.4746983 0.1383566  0.1263256
##         MAESD
## 1  0.24681456
## 2  0.24119594
## 3  0.23529875
## 4  0.13910000
## 5  0.12744630
## 6  0.14297280
## 7  0.12435678
## 8  0.10718378
## 9  0.12410651
## 10 0.11225113
## 11 0.08845716
## 12 0.10428653

cubist_model$bestTune

##    committees neighbors
## 11         50         5

The best cubist model gives a training RMSE of 0.5351024 with 50 committees and 5 neighbors.

Now lets calculate the test RMSE for each best model to find the best model.

# Predict on test set
cart_pred <- predict(cart_model, test_data)
rf_pred <- predict(rf_model, test_data)
cforest_pred <- predict(cforest_model, test_data)
gbm_pred <- predict(gbm_model, test_data)
cubist_pred <- predict(cubist_model, test_data)

# Compute RMSE for each
cart_rmse <- RMSE(cart_pred, test_data$Yield)
rf_rmse <- RMSE(rf_pred, test_data$Yield)
cforest_rmse <- RMSE(cforest_pred, test_data$Yield)
gbm_rmse <- RMSE(gbm_pred, test_data$Yield)
cubist_rmse <- RMSE(cubist_pred, test_data$Yield)

data.frame(
  Model = c("CART", "Random Forest", "cforest", "GBM", "Cubist"),
  RMSE = c(cart_rmse, rf_rmse, cforest_rmse, gbm_rmse, cubist_rmse)
)

##           Model      RMSE
## 1          CART 0.9632089
## 2 Random Forest 0.7054907
## 3       cforest 0.7193889
## 4           GBM 0.6054759
## 5        Cubist 0.5002221

The best model appears to be the cubist model as it gives the lowest training and test set RMSE.

Part B

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(cubist_model)

## cubist variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32     100
## ManufacturingProcess17      61
## BiologicalMaterial06        36
## ManufacturingProcess13      35
## ManufacturingProcess09      35
## BiologicalMaterial02        34
## ManufacturingProcess04      30
## ManufacturingProcess33      29
## BiologicalMaterial12        23
## ManufacturingProcess25      23
## ManufacturingProcess29      20
## BiologicalMaterial03        19
## ManufacturingProcess26      17
## ManufacturingProcess37      17
## BiologicalMaterial04        14
## ManufacturingProcess14      13
## ManufacturingProcess27      12
## ManufacturingProcess15      12
## BiologicalMaterial08        12
## ManufacturingProcess28      12

The manufacturing processes appear to be more important with 7 of the top 10 being manufacturing and only 3 being biological. The top two also account for 100 and then 61 and are both manufacturing then the first biological is 36. The top predictors are very similar to the top predictors for this model as they were for the optimal linear and non linear regression model. We see ManufacturingProcess32 and ManufacturingProcess17 as very important in all models and we also see BiologicalMaterial06 as the top Biological material in all models. Overall the top predictors vary slightly but seem to be very similar overall.

Part C

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.3.3

rpart.plot(cart_model$finalModel,type = 4,fallen.leaves = TRUE,tweak = 1.2)

Yes, the tree view does give additional knowledge about the process predictors and their relationship with yield. First, as expected it starts by splitting on the most important predictor ManufacturingProcess32 then if that value is < 0.19 it splits on the second most important predictor ManufacturingProcess17. However, if it is >= 0.19 it will instead split on BiologicalMaterial06 the thrid most important predictor. So, we can see here how ManufacturingProcess32 relates to both ManufacturingProcess17 and BiologicalMaterial06. As you get further down the tree you can see how the different predictors relate to differnt outcomes so a sample with ManufacturingProcess32 >= 0.19 and BiologicalMaterial06 >= 0.72 is given a yield of 1.3. While one with the same ManufacturingProcess17 value but BiologicalMaterial06 < 0.72 must be split further on ManufacturingProcess23 and if thats >= -0.31 assigned a value of 0.099 while anything <-0.31 get assigned a yield od 1 so you can really see how the differnt combinations of predictor values interact and lead to a value for Yield.

Homework 9

Joey Bochnik

2025-04-24

Question 8.1

Part A

Part B

Part C

Part D

Question 8.2

Question 8.3

Part A

Part B

Part C

Question 8.7

Part A

Part B

Part C