Recreate the simulated data from Exercise
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.62743275
## V2 6.27437240
## V3 0.72305459
## V4 7.50258584
## V5 2.13575650
## V6 0.12395003
## V7 0.02927888
## V8 -0.11724317
## V9 -0.10344797
## V10 0.04312556
Did the random forest model significantly use the uninformative predictors (V6– V10)?
No the random forest model did not use the uninformative predictors (V6-V10) those variables had the least importance and all were at less then .1 showing that it did not see any of those predictors as signifigantly important.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9485201
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 6.774034589
## V2 6.426340527
## V3 0.613805379
## V4 7.135941576
## V5 2.135242904
## V6 0.171933358
## V7 0.142238552
## V8 -0.073192083
## V9 -0.098719872
## V10 -0.009701234
## duplicate1 3.084990840
When we add a highly correlated predictor the importance score for V1 drops a bit and we see that the additional highly correlated predictor has about half the importance of the V1 predictor. We can also see that with the addition of the variable the importance of the other variables decreases slightly from the original model.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
## Warning: package 'party' was built under R version 4.3.3
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 4.3.3
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.3.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
cforest_model <- cforest(y ~ ., data = simulated, controls = cforest_unbiased(mtry = 2, ntree = 500))
# Conditional = FALSE: standard permutation importance
vi_standard <- varimp(cforest_model, conditional = FALSE)
# Conditional = TRUE: conditional permutation importance (controls for correlated predictors)
vi_conditional <- varimp(cforest_model, conditional = TRUE)
vi_standard
## V1 V2 V3 V4 V5 V6
## 4.88772588 4.26038443 0.26379163 4.84561394 1.36600235 -0.07457394
## V7 V8 V9 V10 duplicate1
## 0.02843078 -0.15398300 -0.01850278 -0.18602640 2.88439981
vi_conditional
## V1 V2 V3 V4 V5 V6
## 2.13845554 3.20962005 0.13958751 3.29907289 0.90217642 -0.03664880
## V7 V8 V9 V10 duplicate1
## -0.04403149 -0.08743505 -0.01021958 -0.07130713 1.19903719
rfImp2
## Overall
## V1 6.774034589
## V2 6.426340527
## V3 0.613805379
## V4 7.135941576
## V5 2.135242904
## V6 0.171933358
## V7 0.142238552
## V8 -0.073192083
## V9 -0.098719872
## V10 -0.009701234
## duplicate1 3.084990840
The patterns are similar in that they all give V4,V1,V2 and duplicate high importance. However, with the conditional you can see that since it looks for correlations and penalizes redundant predictors. In the first two the duplicate1 had pretty high importance of 3.9, 2.96 but then a big decrease to 0.97 for the conditional because it tries to penalize redundant variables which may only seem important because they are highly correlated with another predictor. The decrease in duplicate1’s importance shows this clearly.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Try boosted model first.
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbm_model <- train(y ~ ., data = simulated, method = "gbm", trControl = trainControl(method = "cv", number = 5), verbose = FALSE)
# Extract variable importance
gbm_imp_standard <- varImp(gbm_model, conditional = FALSE)$importance
gbm_imp_standard
## Overall
## V1 72.5245681
## V2 67.7914214
## V3 24.0023527
## V4 100.0000000
## V5 38.0200868
## V6 2.4421133
## V7 3.1574042
## V8 0.7099166
## V9 0.3748497
## V10 0.0000000
## duplicate1 12.0687168
For the gbm the same patterns seem to be appearing with V1, V2 and V4 being the most important
Now lets try a cubist
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.3.3
cubist_model <- train(y ~ ., data = simulated, method = "cubist", trControl = trainControl(method = "cv", number = 5))
cubist_imp <- varImp(cubist_model)$importance
cubist_imp$Variable <- rownames(cubist_imp)
cubist_imp
## Overall Variable
## V1 100.000000 V1
## V3 54.166667 V3
## V2 84.027778 V2
## V4 65.277778 V4
## V5 51.388889 V5
## V6 14.583333 V6
## V8 4.166667 V8
## duplicate1 3.472222 duplicate1
## V7 0.000000 V7
## V9 0.000000 V9
## V10 0.000000 V10
The Cubist model is following the same pattern with V1,V2 and V4 being the most important and V6-V10 being not as important at all although it is surprising to see V6 at 14 thats higher than I expected.
Use a simulation to show tree bias with different granularities.
Generate a simulated data set with 4 predictors X1 is random normal with mean 500 and SD 10, X2 is a random uniform with min 50 and max 300, X3 is a random normal with mean 5 and SD 2, X4 is X1 with some noise
set.seed(123)
# Predictors
X1 <- rnorm(1000, 500, 10)
X2 <- runif(1000, 50, 300)
X3 <- rnorm(1000, 5, 2)
X4 <- X1 + rnorm(1000, 500, 50)
# Response
Y <- X1 + rnorm(500, 250, 10)
df_sim <- data.frame(Y, X1, X2, X3, X4)
library(rpart)
cart_model <- rpart(Y ~ ., data = df_sim)
cart_imp <- varImp(cart_model)
cart_imp
## Overall
## X1 0.98645819
## X2 0.08460103
## X3 0.07810180
## X4 0.20728459
cforest_model <- cforest(Y ~ ., data = df_sim, controls = cforest_unbiased(mtry = 2, ntree = 500))
varImp(cforest_model,conditional = TRUE)
## Overall
## X1 47.75064536
## X2 0.50367189
## X3 0.05330327
## X4 0.07887213
Looking at the results above we can see that X1 is the most important in both models which makes sense because X1 is correlated with the response then X4 is the second most important in both which also makes sense because it is correlated with the X1 which is correlated with the response. We can see that in the conditional tree model it gives less importance in comparison to the other predictors to predictor X4 than in the first tree this makes sense because it will try to get rid of the importance for redundant predictors which is what X4 is.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The model on the right focuses its important on just the first few predictors because the bagging fraction and learning rates are high. When the bagging fraction is high this means that for each tree when it samples the data it will sample a large portion of the data and with the large learning rate the trees will converge fast where the strongest predictors will dominate. This happens because each tree can see a large portion of almost all the data so they can see whats important and with the high learning rate they will end up quickly choosing the same strong predictors.
With the second model on the left that samples a much smaller portion of the data with a low learning rate this means that each tree will most likely be seeing different samples from the data so this means that each tree will explore different parts of the data and with a low learning rate it will take each tree a long time to converge with more splits so they will have less bias and give more importance to more predictors.
So, the one on the right is more overfit to a few strong predictors while the one on the left is less overfit with less bias because it is forced to explore more predictors.
Which model do you think would be more predictive of other samples?
I think that the second model would be more predictive of other samples because it was forced to explore and spread out the importance of each variable. The model on the right with the high bagging fraction and high learning rate is prone to overfitting and may not perform well with new or other samples because it may be so overfit to the training data.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig.8.24?
Increasing the interaction depth would affect the slope of the predictor importance in both models by flattening it out a bit. Increasing the interaction depth would increase the depth of the tress allowing for them to explore the relationships with each predictor and in turn spread out the importance of each predictor more. This would be more likely to flatten the model on the left because that model already has a low bagging fraction and learning rate so it is already spreading out the importance of the predictors while on the right it may have less of an effect because the model could be overfit and it may just continue to reiterate the importance of the top stongest predictors.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Which tree-based regression model gives the optimal resampling and test set performance?
Load in the data and clean it
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
Impute missing values
# Create a pre-processing object to apply KNN imputation
preProc <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
# Apply the imputation to the data
imputed_data <- predict(preProc, ChemicalManufacturingProcess)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
##
## where
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(123)
train_index <- createDataPartition(imputed_data$Yield, p = 0.8, list = FALSE)
train_data <- imputed_data[train_index, ]
test_data <- imputed_data[-train_index, ]
X_train <- train_data %>%
select(-Yield)
Y_train <- train_data$Yield
x_test <- test_data %>%
select(-Yield)
y_test <- test_data$Yield
train_control <- trainControl(method = "cv", number = 10)
Fit Models
cart_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.005))
cart_model <- train( Yield ~ ., data = train_data, method = "rpart", trControl = train_control, tuneGrid = cart_grid)
cart_model$results
## cp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.001 0.7813111 0.4884111 0.5827292 0.2323160 0.2031939 0.1527980
## 2 0.006 0.7813111 0.4884111 0.5827292 0.2323160 0.2031939 0.1527980
## 3 0.011 0.7782476 0.4851441 0.5887368 0.2330517 0.2026556 0.1561120
## 4 0.016 0.7743975 0.4896178 0.5866205 0.2279804 0.1850005 0.1563178
## 5 0.021 0.7744553 0.4871144 0.5820834 0.2236935 0.1896188 0.1461648
## 6 0.026 0.7825698 0.4666704 0.5880287 0.2144527 0.1779456 0.1396202
## 7 0.031 0.7899669 0.4637394 0.5974589 0.2117703 0.1751609 0.1311342
## 8 0.036 0.7907492 0.4623011 0.6007929 0.2136451 0.1778936 0.1355622
## 9 0.041 0.8145829 0.4361676 0.6326022 0.2105704 0.1655689 0.1420356
## 10 0.046 0.7969053 0.4472866 0.6281677 0.2134937 0.1705487 0.1569650
## 11 0.051 0.7871051 0.4623219 0.6240567 0.2062825 0.1698932 0.1530901
## 12 0.056 0.7820799 0.4672019 0.6174642 0.2128024 0.1773442 0.1613884
## 13 0.061 0.8079349 0.4409841 0.6401686 0.2212127 0.1765677 0.1677303
## 14 0.066 0.8133496 0.4289846 0.6485062 0.2055559 0.1439758 0.1499052
## 15 0.071 0.8186225 0.4248306 0.6531754 0.2013650 0.1519766 0.1449756
## 16 0.076 0.8009315 0.4238973 0.6499746 0.1767501 0.1446504 0.1297180
## 17 0.081 0.8153200 0.4025292 0.6606076 0.1939710 0.1697826 0.1400331
## 18 0.086 0.8097735 0.4025745 0.6475950 0.2177559 0.1941403 0.1482161
## 19 0.091 0.8082416 0.4021179 0.6460520 0.2148436 0.1950266 0.1467949
## 20 0.096 0.8049685 0.4120894 0.6487962 0.2249105 0.2159824 0.1556594
cart_model$bestTune
## cp
## 4 0.016
The best CART model gives a training RMSE of 0.7687742 with a CP of 0.041
rf_grid <- expand.grid(mtry = seq(2, ncol(train_data) - 1, by = 2))
rf_model <- train( Yield ~ ., data = train_data, method = "rf", trControl = train_control, tuneGrid = rf_grid)
rf_model$results
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 2 0.6647108 0.6170756 0.5298046 0.1924333 0.2101954 0.1220010
## 2 4 0.6329151 0.6377918 0.5031495 0.2003447 0.2114336 0.1337663
## 3 6 0.6234007 0.6401338 0.4963178 0.2094614 0.2146643 0.1381301
## 4 8 0.6149989 0.6430567 0.4933024 0.2163277 0.2224814 0.1460229
## 5 10 0.6063040 0.6499405 0.4809787 0.2159995 0.2223692 0.1436248
## 6 12 0.6043247 0.6507022 0.4792980 0.2179135 0.2148121 0.1470091
## 7 14 0.5995312 0.6605086 0.4759586 0.2132600 0.2120453 0.1440215
## 8 16 0.6010140 0.6518358 0.4771918 0.2186478 0.2125606 0.1478376
## 9 18 0.6037117 0.6462688 0.4781628 0.2205801 0.2185438 0.1492423
## 10 20 0.5987741 0.6519650 0.4742222 0.2210815 0.2163074 0.1498532
## 11 22 0.6000498 0.6500945 0.4740219 0.2147167 0.2122590 0.1439288
## 12 24 0.5967707 0.6529065 0.4726573 0.2184349 0.2094549 0.1485644
## 13 26 0.5930400 0.6514006 0.4680033 0.2261438 0.2215060 0.1535388
## 14 28 0.5928316 0.6543010 0.4657563 0.2243283 0.2193681 0.1514911
## 15 30 0.5946242 0.6490275 0.4680762 0.2255124 0.2160562 0.1523174
## 16 32 0.5973824 0.6436347 0.4688344 0.2277198 0.2189048 0.1532000
## 17 34 0.5994237 0.6416863 0.4688036 0.2277011 0.2153068 0.1534835
## 18 36 0.5921494 0.6490422 0.4639692 0.2309897 0.2186465 0.1545614
## 19 38 0.5910163 0.6465660 0.4627557 0.2261759 0.2169202 0.1516395
## 20 40 0.5910284 0.6509026 0.4630451 0.2236190 0.2082423 0.1468849
## 21 42 0.5898556 0.6521454 0.4612157 0.2213279 0.2011252 0.1447405
## 22 44 0.5955598 0.6439432 0.4650698 0.2209907 0.2047364 0.1475274
## 23 46 0.5870762 0.6517545 0.4600810 0.2259972 0.2120677 0.1504442
## 24 48 0.5916755 0.6454745 0.4601160 0.2267792 0.2085834 0.1489553
## 25 50 0.5917448 0.6443899 0.4598028 0.2236107 0.2089107 0.1471884
## 26 52 0.5933610 0.6427798 0.4597137 0.2239722 0.2105642 0.1467469
## 27 54 0.5948514 0.6385155 0.4627169 0.2277840 0.2147819 0.1492559
## 28 56 0.5956152 0.6388303 0.4620724 0.2249647 0.2101621 0.1438432
rf_model$bestTune
## mtry
## 23 46
The best Random Forest Model gives a training RMSE of 0.5940813 and mtry of 26.
gbm_grid <- expand.grid( n.trees = seq(100, 1000, by = 100), interaction.depth = c(1, 3, 5, 7), shrinkage = c(0.01, 0.1), n.minobsinnode = 10)
gbm_model <- train( Yield ~ ., data = train_data, method = "gbm", trControl = train_control, tuneGrid = gbm_grid, verbose = FALSE)
gbm_model$results
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 1 0.01 1 10 100 0.7809983 0.5760340
## 41 0.10 1 10 100 0.6640866 0.5908073
## 11 0.01 3 10 100 0.7244056 0.6172035
## 51 0.10 3 10 100 0.6505848 0.6144675
## 21 0.01 5 10 100 0.7206939 0.6180078
## 61 0.10 5 10 100 0.6503624 0.6299850
## 31 0.01 7 10 100 0.7171311 0.6340682
## 71 0.10 7 10 100 0.6179723 0.6544190
## 2 0.01 1 10 200 0.6985794 0.6049695
## 42 0.10 1 10 200 0.6598598 0.6053577
## 12 0.01 3 10 200 0.6557285 0.6359397
## 52 0.10 3 10 200 0.6354595 0.6312861
## 22 0.01 5 10 200 0.6539334 0.6380062
## 62 0.10 5 10 200 0.6435069 0.6296736
## 32 0.01 7 10 200 0.6492588 0.6481695
## 72 0.10 7 10 200 0.6167320 0.6553145
## 3 0.01 1 10 300 0.6644312 0.6136083
## 43 0.10 1 10 300 0.6664763 0.5985661
## 13 0.01 3 10 300 0.6369965 0.6408581
## 53 0.10 3 10 300 0.6383337 0.6282624
## 23 0.01 5 10 300 0.6338418 0.6434588
## 63 0.10 5 10 300 0.6429333 0.6298669
## 33 0.01 7 10 300 0.6316618 0.6496956
## 73 0.10 7 10 300 0.6161145 0.6551645
## 4 0.01 1 10 400 0.6518935 0.6192693
## 44 0.10 1 10 400 0.6629387 0.6041118
## 14 0.01 3 10 400 0.6323506 0.6411671
## 54 0.10 3 10 400 0.6358126 0.6311411
## 24 0.01 5 10 400 0.6252358 0.6485216
## 64 0.10 5 10 400 0.6413786 0.6310735
## 34 0.01 7 10 400 0.6242246 0.6541044
## 74 0.10 7 10 400 0.6149523 0.6560486
## 5 0.01 1 10 500 0.6480774 0.6209010
## 45 0.10 1 10 500 0.6607846 0.6053460
## 15 0.01 3 10 500 0.6293832 0.6437873
## 55 0.10 3 10 500 0.6356505 0.6315482
## 25 0.01 5 10 500 0.6220769 0.6520887
## 65 0.10 5 10 500 0.6417018 0.6306456
## 35 0.01 7 10 500 0.6202433 0.6569844
## 75 0.10 7 10 500 0.6148183 0.6560146
## 6 0.01 1 10 600 0.6449296 0.6214355
## 46 0.10 1 10 600 0.6552671 0.6103385
## 16 0.01 3 10 600 0.6274014 0.6460574
## 56 0.10 3 10 600 0.6348087 0.6326545
## 26 0.01 5 10 600 0.6233996 0.6484248
## 66 0.10 5 10 600 0.6401864 0.6322755
## 36 0.01 7 10 600 0.6189423 0.6591177
## 76 0.10 7 10 600 0.6142960 0.6566133
## 7 0.01 1 10 700 0.6439270 0.6223229
## 47 0.10 1 10 700 0.6523919 0.6124403
## 17 0.01 3 10 700 0.6272419 0.6472043
## 57 0.10 3 10 700 0.6345688 0.6329886
## 27 0.01 5 10 700 0.6225233 0.6492774
## 67 0.10 5 10 700 0.6400327 0.6325948
## 37 0.01 7 10 700 0.6173250 0.6590776
## 77 0.10 7 10 700 0.6150070 0.6561211
## 8 0.01 1 10 800 0.6432554 0.6245864
## 48 0.10 1 10 800 0.6530960 0.6138263
## 18 0.01 3 10 800 0.6271387 0.6457687
## 58 0.10 3 10 800 0.6345305 0.6332610
## 28 0.01 5 10 800 0.6199827 0.6516347
## 68 0.10 5 10 800 0.6401342 0.6324348
## 38 0.01 7 10 800 0.6162280 0.6594652
## 78 0.10 7 10 800 0.6149197 0.6562806
## 9 0.01 1 10 900 0.6441928 0.6225042
## 49 0.10 1 10 900 0.6506646 0.6154692
## 19 0.01 3 10 900 0.6268415 0.6450356
## 59 0.10 3 10 900 0.6344205 0.6332638
## 29 0.01 5 10 900 0.6200692 0.6516683
## 69 0.10 5 10 900 0.6401205 0.6324678
## 39 0.01 7 10 900 0.6158605 0.6582995
## 79 0.10 7 10 900 0.6150066 0.6562628
## 10 0.01 1 10 1000 0.6447872 0.6204721
## 50 0.10 1 10 1000 0.6463170 0.6201205
## 20 0.01 3 10 1000 0.6241034 0.6468253
## 60 0.10 3 10 1000 0.6344159 0.6333666
## 30 0.01 5 10 1000 0.6196183 0.6525847
## 70 0.10 5 10 1000 0.6400592 0.6326501
## 40 0.01 7 10 1000 0.6157383 0.6582110
## 80 0.10 7 10 1000 0.6151606 0.6561689
## MAE RMSESD RsquaredSD MAESD
## 1 0.6285084 0.16443036 0.1516553 0.12016847
## 41 0.5076282 0.08990609 0.1509132 0.06842708
## 11 0.5759036 0.15946482 0.1363936 0.10992754
## 51 0.4943505 0.10002350 0.1505903 0.09948099
## 21 0.5694309 0.15550746 0.1397332 0.10841831
## 61 0.4977806 0.10152327 0.1289772 0.08664998
## 31 0.5731071 0.15881993 0.1165412 0.10824970
## 71 0.4747823 0.10819267 0.1060200 0.07173174
## 2 0.5571243 0.14925811 0.1264148 0.11164989
## 42 0.5038580 0.09534245 0.1462850 0.07718247
## 12 0.5137325 0.14060129 0.1237024 0.09982726
## 52 0.4846764 0.10398154 0.1551558 0.10363851
## 22 0.5121446 0.13990199 0.1335460 0.09933155
## 62 0.4985714 0.10743631 0.1478276 0.09109815
## 32 0.5117614 0.13733853 0.1156166 0.09640033
## 72 0.4779749 0.10696435 0.1079221 0.07013431
## 3 0.5222184 0.13445892 0.1207568 0.09964122
## 43 0.5138576 0.11457813 0.1672611 0.09202467
## 13 0.4937927 0.12330602 0.1295427 0.09243104
## 53 0.4903207 0.11104415 0.1588332 0.10962013
## 23 0.4936370 0.12303165 0.1331330 0.09114316
## 63 0.4981790 0.11364234 0.1501363 0.09204837
## 33 0.4911497 0.11796862 0.1202374 0.08747142
## 73 0.4795944 0.10820741 0.1144669 0.07397762
## 4 0.5068306 0.12017286 0.1222800 0.09000111
## 44 0.5084483 0.12488617 0.1680406 0.09260838
## 14 0.4871839 0.11241919 0.1357312 0.08562905
## 54 0.4896592 0.11593248 0.1603144 0.11214187
## 24 0.4821500 0.10822822 0.1312298 0.08290215
## 64 0.4973508 0.11641892 0.1510962 0.09492109
## 34 0.4803781 0.10290082 0.1208292 0.08112290
## 74 0.4799144 0.10726902 0.1148804 0.07274370
## 5 0.5017326 0.10839236 0.1250390 0.08063896
## 45 0.5064897 0.13155826 0.1729824 0.09938307
## 15 0.4835880 0.10282998 0.1367351 0.07972333
## 55 0.4900969 0.11663636 0.1613866 0.11182817
## 25 0.4770692 0.09914728 0.1322074 0.08048758
## 65 0.4979276 0.11751456 0.1520895 0.09620910
## 35 0.4763950 0.09756924 0.1189086 0.07650617
## 75 0.4804050 0.10834247 0.1157598 0.07364424
## 6 0.4968657 0.10055680 0.1267186 0.07778481
## 46 0.5067756 0.12932843 0.1685978 0.09652661
## 16 0.4822654 0.09709117 0.1348243 0.07463410
## 56 0.4899324 0.11802488 0.1617016 0.11278947
## 26 0.4762391 0.09562225 0.1343885 0.07999505
## 66 0.4967783 0.11911588 0.1529129 0.09756950
## 36 0.4744400 0.09056611 0.1160942 0.07283020
## 76 0.4801829 0.10935617 0.1165642 0.07455271
## 7 0.4932830 0.09316419 0.1287774 0.07306938
## 47 0.5031826 0.13401398 0.1708754 0.09860558
## 17 0.4822756 0.09529484 0.1372832 0.07264736
## 57 0.4899748 0.11834597 0.1616426 0.11300202
## 27 0.4747810 0.09225400 0.1358916 0.07785425
## 67 0.4966653 0.11989727 0.1529012 0.09839046
## 37 0.4727232 0.08735079 0.1175702 0.07100221
## 77 0.4807565 0.10996055 0.1172351 0.07535395
## 8 0.4909691 0.09063532 0.1324369 0.07133621
## 48 0.5021194 0.13211170 0.1675897 0.09846761
## 18 0.4820657 0.09425827 0.1412937 0.07272520
## 58 0.4901549 0.11830500 0.1612524 0.11279943
## 28 0.4732901 0.09140975 0.1353172 0.07708045
## 68 0.4965069 0.12047352 0.1531855 0.09887541
## 38 0.4717629 0.08701562 0.1183873 0.06943626
## 78 0.4806449 0.11018409 0.1173793 0.07552141
## 9 0.4909784 0.08695043 0.1350637 0.06926181
## 49 0.5007653 0.13021601 0.1639124 0.09811168
## 19 0.4832094 0.09386598 0.1414543 0.07262097
## 59 0.4900049 0.11860580 0.1616478 0.11301262
## 29 0.4740556 0.09173110 0.1357893 0.07720748
## 69 0.4964976 0.12045783 0.1531325 0.09902283
## 39 0.4710362 0.08977628 0.1219452 0.07265605
## 79 0.4808125 0.11022117 0.1173345 0.07559880
## 10 0.4910317 0.08527586 0.1388171 0.06996667
## 50 0.4969537 0.13326596 0.1659476 0.09992424
## 20 0.4827440 0.09554731 0.1427071 0.07471998
## 60 0.4900608 0.11880256 0.1617082 0.11320096
## 30 0.4747160 0.09100641 0.1343966 0.07648370
## 70 0.4964336 0.12038356 0.1529517 0.09903179
## 40 0.4704661 0.08831407 0.1221065 0.07175592
## 80 0.4808572 0.11021003 0.1174975 0.07566336
gbm_model$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 76 600 7 0.1 10
The best GBM model gives a training RMSE of 0.5891383. With 1000 trees an interaction depth of 7 shrinkage of 0.01 and n.minobsinnode of 10.
cforest_grid <- expand.grid(mtry = seq(2, ncol(train_data) - 1, by = 2))
cforest_model <- train( Yield ~ ., data = train_data, method = "cforest", trControl = train_control, tuneGrid = cforest_grid)
cforest_model$results
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 2 0.7734728 0.5594844 0.6289582 0.1837761 0.1714856 0.1333175
## 2 4 0.6997209 0.5994145 0.5654223 0.1742697 0.1763820 0.1308662
## 3 6 0.6775019 0.6109421 0.5415066 0.1711173 0.1768761 0.1269020
## 4 8 0.6647450 0.6168151 0.5276584 0.1718698 0.1762917 0.1259805
## 5 10 0.6554215 0.6253686 0.5207841 0.1757410 0.1878354 0.1290685
## 6 12 0.6515951 0.6225941 0.5156571 0.1784861 0.1860254 0.1301737
## 7 14 0.6489156 0.6246828 0.5121425 0.1710228 0.1841013 0.1231531
## 8 16 0.6428756 0.6284961 0.5046522 0.1693079 0.1870171 0.1270090
## 9 18 0.6423794 0.6272220 0.5047527 0.1729098 0.1887565 0.1241703
## 10 20 0.6428234 0.6236857 0.4999498 0.1736440 0.1881186 0.1252323
## 11 22 0.6411323 0.6256643 0.5026311 0.1773223 0.1845369 0.1277351
## 12 24 0.6381360 0.6264726 0.4979470 0.1750743 0.1913018 0.1253593
## 13 26 0.6392536 0.6271483 0.4997648 0.1711606 0.1846612 0.1261184
## 14 28 0.6428853 0.6192072 0.5009162 0.1774025 0.1882015 0.1250092
## 15 30 0.6434630 0.6191341 0.5011892 0.1746963 0.1892417 0.1248982
## 16 32 0.6436844 0.6172850 0.4993313 0.1765668 0.1910195 0.1241853
## 17 34 0.6423142 0.6162902 0.5000669 0.1798426 0.1928890 0.1285508
## 18 36 0.6435443 0.6145516 0.4972752 0.1776916 0.1890725 0.1262068
## 19 38 0.6426027 0.6163303 0.4974580 0.1730651 0.1938310 0.1205295
## 20 40 0.6458635 0.6134709 0.4993183 0.1766333 0.1956111 0.1224962
## 21 42 0.6439463 0.6143831 0.4979921 0.1774857 0.1899462 0.1239965
## 22 44 0.6435595 0.6163993 0.4991125 0.1754255 0.2026876 0.1219318
## 23 46 0.6482714 0.6073737 0.5021270 0.1790927 0.1998760 0.1245366
## 24 48 0.6459890 0.6088103 0.4983062 0.1778831 0.1939567 0.1237126
## 25 50 0.6492002 0.6051626 0.5020355 0.1813599 0.2005273 0.1247196
## 26 52 0.6477088 0.6085115 0.4976728 0.1802608 0.1954737 0.1226758
## 27 54 0.6486575 0.6055660 0.5000974 0.1824987 0.2066237 0.1242330
## 28 56 0.6499689 0.6022810 0.5009551 0.1836456 0.2043052 0.1247678
cforest_model$bestTune
## mtry
## 12 24
The best cforest model gives a training RMSE of 0.6573449 and mttry of 24.
cubist_grid <- expand.grid(committees = c(1, 5, 10, 50),neighbors = c(0, 5, 9))
cubist_model <- train(Yield ~ ., data = train_data, method = "cubist", trControl = train_control, tuneGrid = cubist_grid)
cubist_model$results
## committees neighbors RMSE Rsquared MAE RMSESD RsquaredSD
## 1 1 0 0.9706433 0.5013836 0.6401350 0.6343657 0.2571346
## 2 1 5 0.9161788 0.5678770 0.5810554 0.6385652 0.2528166
## 3 1 9 0.9421657 0.5405531 0.5945510 0.6393674 0.2526094
## 4 5 0 0.7261803 0.5366424 0.5660007 0.1721344 0.1570068
## 5 5 5 0.6614559 0.6229593 0.4960204 0.1647036 0.1280034
## 6 5 9 0.6879967 0.5920203 0.5172612 0.1807471 0.1400315
## 7 10 0 0.7190147 0.5178746 0.5677932 0.1456933 0.1480944
## 8 10 5 0.6600953 0.5989722 0.4990179 0.1374905 0.1248024
## 9 10 9 0.6832044 0.5699978 0.5194153 0.1487794 0.1259344
## 10 50 0 0.6637612 0.5838547 0.5154338 0.1439223 0.1555719
## 11 50 5 0.6182942 0.6465053 0.4628700 0.1325178 0.1259115
## 12 50 9 0.6359563 0.6242026 0.4746983 0.1383566 0.1263256
## MAESD
## 1 0.24681456
## 2 0.24119594
## 3 0.23529875
## 4 0.13910000
## 5 0.12744630
## 6 0.14297280
## 7 0.12435678
## 8 0.10718378
## 9 0.12410651
## 10 0.11225113
## 11 0.08845716
## 12 0.10428653
cubist_model$bestTune
## committees neighbors
## 11 50 5
The best cubist model gives a training RMSE of 0.5351024 with 50 committees and 5 neighbors.
Now lets calculate the test RMSE for each best model to find the best model.
# Predict on test set
cart_pred <- predict(cart_model, test_data)
rf_pred <- predict(rf_model, test_data)
cforest_pred <- predict(cforest_model, test_data)
gbm_pred <- predict(gbm_model, test_data)
cubist_pred <- predict(cubist_model, test_data)
# Compute RMSE for each
cart_rmse <- RMSE(cart_pred, test_data$Yield)
rf_rmse <- RMSE(rf_pred, test_data$Yield)
cforest_rmse <- RMSE(cforest_pred, test_data$Yield)
gbm_rmse <- RMSE(gbm_pred, test_data$Yield)
cubist_rmse <- RMSE(cubist_pred, test_data$Yield)
data.frame(
Model = c("CART", "Random Forest", "cforest", "GBM", "Cubist"),
RMSE = c(cart_rmse, rf_rmse, cforest_rmse, gbm_rmse, cubist_rmse)
)
## Model RMSE
## 1 CART 0.9632089
## 2 Random Forest 0.7054907
## 3 cforest 0.7193889
## 4 GBM 0.6054759
## 5 Cubist 0.5002221
The best model appears to be the cubist model as it gives the lowest training and test set RMSE.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
varImp(cubist_model)
## cubist variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100
## ManufacturingProcess17 61
## BiologicalMaterial06 36
## ManufacturingProcess13 35
## ManufacturingProcess09 35
## BiologicalMaterial02 34
## ManufacturingProcess04 30
## ManufacturingProcess33 29
## BiologicalMaterial12 23
## ManufacturingProcess25 23
## ManufacturingProcess29 20
## BiologicalMaterial03 19
## ManufacturingProcess26 17
## ManufacturingProcess37 17
## BiologicalMaterial04 14
## ManufacturingProcess14 13
## ManufacturingProcess27 12
## ManufacturingProcess15 12
## BiologicalMaterial08 12
## ManufacturingProcess28 12
The manufacturing processes appear to be more important with 7 of the top 10 being manufacturing and only 3 being biological. The top two also account for 100 and then 61 and are both manufacturing then the first biological is 36. The top predictors are very similar to the top predictors for this model as they were for the optimal linear and non linear regression model. We see ManufacturingProcess32 and ManufacturingProcess17 as very important in all models and we also see BiologicalMaterial06 as the top Biological material in all models. Overall the top predictors vary slightly but seem to be very similar overall.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
rpart.plot(cart_model$finalModel,type = 4,fallen.leaves = TRUE,tweak = 1.2)
Yes, the tree view does give additional knowledge about the process predictors and their relationship with yield. First, as expected it starts by splitting on the most important predictor ManufacturingProcess32 then if that value is < 0.19 it splits on the second most important predictor ManufacturingProcess17. However, if it is >= 0.19 it will instead split on BiologicalMaterial06 the thrid most important predictor. So, we can see here how ManufacturingProcess32 relates to both ManufacturingProcess17 and BiologicalMaterial06. As you get further down the tree you can see how the different predictors relate to differnt outcomes so a sample with ManufacturingProcess32 >= 0.19 and BiologicalMaterial06 >= 0.72 is given a yield of 1.3. While one with the same ManufacturingProcess17 value but BiologicalMaterial06 < 0.72 must be split further on ManufacturingProcess23 and if thats >= -0.31 assigned a value of 0.099 while anything <-0.31 get assigned a yield od 1 so you can really see how the differnt combinations of predictor values interact and lead to a value for Yield.