Homework 9

8.1.

Recreate the simulated data from Exercise 7.2:

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.3.3

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

set.seed(122)
model1 <- randomForest(y ~ ., data = simulated,
  importance = TRUE,
  ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)

rfImp1 %>% 
  mutate (var = rownames(rfImp1)) %>%
  ggplot(aes(Overall, reorder(var, Overall, sum), var)) + 
  geom_col(fill = 'pink') + 
  labs(title = 'Variable Importance' , y = 'Variable')

varImp(model1)

##        Overall
## V1  56.8894847
## V2  46.7900310
## V3  11.8772831
## V4  51.0737047
## V5  23.4440838
## V6   2.3399939
## V7   0.7521566
## V8  -2.8220872
## V9  -0.4757812
## V10 -1.0011703

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

It appears that the random forest model did not significantly use the uninformative predictors (V6–V10). These predictors have very low or close to zero importance scores compared to V1, V2, V4, and V5, which have much higher importance scores.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated2 <- simulated

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9378309

Fit another random forest model to these data. Did the importance score for V1 change?

set.seed(123)
model2 <- randomForest(y ~ ., data = simulated2,
                      importance = TRUE,
                      ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
varImp(model2)

##       Overall
## V1  54.694421
## V2  46.803860
## V3  11.010007
## V4  51.399770
## V5  22.598665
## V6   2.498134
## V7   1.277084
## V8  -2.832192
## V9  -2.386409
## V10 -2.883679

V1 value has reduced from 56.8894847 to 54.694421.

What happens when you add another predictor that is also highly correlated with V1?

simulated3 <- simulated2
simulated3$duplicate2 <- simulated3$V1 + rnorm(200) * .1
cor(simulated3$duplicate2, simulated3$V1)

## [1] 0.9498444

set.seed(124)
model3 <- randomForest(y ~ ., data = simulated3,
                      importance = TRUE,
                      ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
varImp(model3)

##               Overall
## V1         36.2976172
## V2         44.1314362
## V3          9.4955850
## V4         53.2184866
## V5         23.5536801
## V6          3.6847594
## V7          1.2445671
## V8         -1.2873440
## V9         -0.5334566
## V10         1.0035823
## duplicate2 25.6476302

When another highly correlated predictor is added the importance of V1 decreased even more now it’s 36.2976172 which is significantly smaller than the two values before.

c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

## Warning: package 'party' was built under R version 4.3.3

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.3.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.3.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.3.3

## 
## Attaching package: 'party'

## The following object is masked from 'package:dplyr':
## 
##     where

set.seed(125)
party_model <- cforest(y ~ ., data = simulated,
                      control = cforest_control(ntree = 50)
                      )

varimp(party_model, conditional = TRUE)

##           V1           V2           V3           V4           V5           V6 
##  2.174880680  4.696646749  0.008201785  4.248047662  0.580985750  0.018011045 
##           V7           V8           V9          V10   duplicate1 
## -0.118920221  0.017864104  0.005922653 -0.026902135  0.268407770

varimp(party_model, conditional = FALSE)

##           V1           V2           V3           V4           V5           V6 
##  8.068908321  7.183965697 -0.239228211  7.386741675  2.207413850  0.114762931 
##           V7           V8           V9          V10   duplicate1 
## -0.077129200  0.009350458  0.024143583  0.095343477  1.680259504

The conditional importance (conditional = TRUE) and traditional importance (conditional = FALSE) methods show similar patterns, with variables V1, V2, and V4 being the most important in both cases. However, the conditional importance values are generally smaller and more stable, as they account for correlations between variables. In contrast, the traditional importance measure tends to give higher values and can overestimate the importance of some variables.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

gbmgrid <- expand.grid(interaction.depth = seq(1,7, by = 2),
                       n.trees = seq(100, 500, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 3)
set.seed(125)
gbmtune <- train(y ~ ., data = simulated,
                 method = "gbm",
                 tuneGrid = gbmgrid,
                 verbose = FALSE)
gbmtune

## Stochastic Gradient Boosting 
## 
## 200 samples
##  11 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   0.01       1                  100      4.025985  0.6095260  3.292619
##   0.01       1                  150      3.723283  0.6539989  3.049312
##   0.01       1                  200      3.480240  0.6875306  2.852652
##   0.01       1                  250      3.275988  0.7122247  2.686021
##   0.01       1                  300      3.106441  0.7321012  2.545712
##   0.01       1                  350      2.965219  0.7460205  2.428331
##   0.01       1                  400      2.847171  0.7588521  2.327823
##   0.01       1                  450      2.743987  0.7707793  2.237843
##   0.01       1                  500      2.654961  0.7803971  2.159588
##   0.01       3                  100      3.506789  0.7114229  2.863898
##   0.01       3                  150      3.123933  0.7453157  2.551115
##   0.01       3                  200      2.849568  0.7689326  2.322464
##   0.01       3                  250      2.640269  0.7897172  2.147377
##   0.01       3                  300      2.479832  0.8065596  2.009731
##   0.01       3                  350      2.354946  0.8188133  1.902602
##   0.01       3                  400      2.257452  0.8281084  1.819125
##   0.01       3                  450      2.179718  0.8354630  1.754417
##   0.01       3                  500      2.120568  0.8405717  1.704437
##   0.01       5                  100      3.339009  0.7355446  2.726759
##   0.01       5                  150      2.948649  0.7641741  2.405704
##   0.01       5                  200      2.679295  0.7870458  2.182475
##   0.01       5                  250      2.491700  0.8032041  2.024575
##   0.01       5                  300      2.351596  0.8161738  1.904030
##   0.01       5                  350      2.249159  0.8256267  1.816579
##   0.01       5                  400      2.172045  0.8327599  1.751218
##   0.01       5                  450      2.115061  0.8377408  1.701069
##   0.01       5                  500      2.070469  0.8418534  1.661168
##   0.01       7                  100      3.258143  0.7503666  2.656502
##   0.01       7                  150      2.865226  0.7748558  2.335328
##   0.01       7                  200      2.606393  0.7942265  2.120253
##   0.01       7                  250      2.428986  0.8090075  1.971420
##   0.01       7                  300      2.304386  0.8196597  1.865051
##   0.01       7                  350      2.215941  0.8277239  1.790378
##   0.01       7                  400      2.151839  0.8334940  1.734929
##   0.01       7                  450      2.102330  0.8381347  1.691689
##   0.01       7                  500      2.067593  0.8412875  1.661831
##   0.10       1                  100      2.136897  0.8310945  1.715040
##   0.10       1                  150      1.984691  0.8428682  1.582699
##   0.10       1                  200      1.936564  0.8462321  1.537561
##   0.10       1                  250      1.934382  0.8445579  1.533205
##   0.10       1                  300      1.929622  0.8449412  1.528449
##   0.10       1                  350      1.936562  0.8433022  1.535880
##   0.10       1                  400      1.946188  0.8410562  1.542785
##   0.10       1                  450      1.950621  0.8400485  1.545190
##   0.10       1                  500      1.954161  0.8396701  1.546084
##   0.10       3                  100      1.966961  0.8442204  1.555410
##   0.10       3                  150      1.940794  0.8465077  1.535860
##   0.10       3                  200      1.938968  0.8464030  1.532269
##   0.10       3                  250      1.940590  0.8456343  1.532829
##   0.10       3                  300      1.940490  0.8453770  1.531802
##   0.10       3                  350      1.941539  0.8451250  1.532339
##   0.10       3                  400      1.940327  0.8452512  1.530979
##   0.10       3                  450      1.940930  0.8450715  1.531387
##   0.10       3                  500      1.941448  0.8449902  1.531750
##   0.10       5                  100      2.022356  0.8350884  1.610613
##   0.10       5                  150      2.012704  0.8358934  1.602173
##   0.10       5                  200      2.009653  0.8359832  1.601010
##   0.10       5                  250      2.009444  0.8358968  1.600470
##   0.10       5                  300      2.008110  0.8360816  1.599553
##   0.10       5                  350      2.008066  0.8360464  1.599460
##   0.10       5                  400      2.007877  0.8360414  1.599252
##   0.10       5                  450      2.007704  0.8360490  1.599133
##   0.10       5                  500      2.007531  0.8360793  1.598984
##   0.10       7                  100      2.036401  0.8347817  1.628206
##   0.10       7                  150      2.028265  0.8351252  1.622561
##   0.10       7                  200      2.026025  0.8353165  1.620206
##   0.10       7                  250      2.025847  0.8352982  1.619681
##   0.10       7                  300      2.026154  0.8351966  1.619884
##   0.10       7                  350      2.026202  0.8351785  1.619972
##   0.10       7                  400      2.026093  0.8351947  1.619910
##   0.10       7                  450      2.026110  0.8351914  1.619977
##   0.10       7                  500      2.026097  0.8351911  1.619974
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 3
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 300, interaction.depth =
##  1, shrinkage = 0.1 and n.minobsinnode = 3.

The best model shows n.trees = 300, interaction.depth = 1, shrinkage =0.1 and n.minobsinnode = 3.

best_model <- gbmtune$finalModel

# Calculating variable importance
varImp(best_model)

##               Overall
## V1         4346.43701
## V2         4171.54558
## V3         1711.07190
## V4         4827.29219
## V5         1914.76704
## V6          221.73680
## V7          162.21968
## V8          143.19312
## V9           64.69469
## V10          47.48801
## duplicate1  307.19967

V1, V4, and V2 are the most important predictors, as they have the highest importance scores (4346.44, 4827.29, and 4171.55, respectively).

Cubist

Cugrid <- expand.grid(committees = seq(1, 10, by = 1),
                       neighbors = seq(1, 9, by=1)
                       )

set.seed(126)

Cutune <- train(y ~ ., data = simulated,
                 method = "cubist",
                 trControl = trainControl(method = "cv", n = 10),
                 tuneGrid = Cugrid,
                 verbose = FALSE)
Cutune

## Cubist 
## 
## 200 samples
##  11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE     
##    1          1          2.711553  0.7238966  2.215276
##    1          2          2.348045  0.7876534  1.871487
##    1          3          2.234907  0.8062750  1.765299
##    1          4          2.214369  0.8107231  1.764077
##    1          5          2.163153  0.8178812  1.726898
##    1          6          2.151031  0.8183637  1.711522
##    1          7          2.154950  0.8169671  1.724436
##    1          8          2.151490  0.8177512  1.722473
##    1          9          2.149519  0.8181752  1.723557
##    2          1          2.539919  0.7559803  2.061367
##    2          2          2.284123  0.7983346  1.821549
##    2          3          2.180270  0.8169137  1.744395
##    2          4          2.144847  0.8241536  1.718802
##    2          5          2.119247  0.8271542  1.699976
##    2          6          2.109801  0.8275539  1.699869
##    2          7          2.109570  0.8263442  1.702196
##    2          8          2.108970  0.8264649  1.699717
##    2          9          2.111416  0.8260197  1.703254
##    3          1          2.484749  0.7643661  2.041314
##    3          2          2.211704  0.8096779  1.766315
##    3          3          2.102951  0.8286532  1.673652
##    3          4          2.083523  0.8332421  1.655971
##    3          5          2.054823  0.8364660  1.634239
##    3          6          2.048148  0.8361307  1.631841
##    3          7          2.054346  0.8342385  1.643854
##    3          8          2.055017  0.8344609  1.644232
##    3          9          2.055026  0.8344156  1.648627
##    4          1          2.516065  0.7595799  2.051457
##    4          2          2.239858  0.8045211  1.788455
##    4          3          2.116739  0.8242333  1.693169
##    4          4          2.084080  0.8308408  1.668192
##    4          5          2.055999  0.8340954  1.643708
##    4          6          2.048131  0.8341428  1.645326
##    4          7          2.049288  0.8332597  1.652279
##    4          8          2.046210  0.8340774  1.653007
##    4          9          2.047109  0.8340515  1.657664
##    5          1          2.495456  0.7635865  2.054273
##    5          2          2.211787  0.8099751  1.766575
##    5          3          2.091259  0.8292124  1.665582
##    5          4          2.065969  0.8345129  1.649284
##    5          5          2.031579  0.8386623  1.620889
##    5          6          2.024578  0.8386888  1.616626
##    5          7          2.024815  0.8378728  1.627215
##    5          8          2.020938  0.8388108  1.627668
##    5          9          2.020922  0.8388672  1.631744
##    6          1          2.467345  0.7681707  2.008971
##    6          2          2.186298  0.8142880  1.737625
##    6          3          2.065182  0.8328439  1.643727
##    6          4          2.032385  0.8391985  1.617222
##    6          5          2.001747  0.8425161  1.591420
##    6          6          1.994132  0.8425788  1.591613
##    6          7          1.990264  0.8423694  1.595081
##    6          8          1.985101  0.8435032  1.591010
##    6          9          1.986109  0.8434881  1.596037
##    7          1          2.433043  0.7745426  1.994020
##    7          2          2.145324  0.8212838  1.706675
##    7          3          2.026929  0.8392774  1.610952
##    7          4          2.003993  0.8440500  1.594171
##    7          5          1.975467  0.8470947  1.566880
##    7          6          1.971246  0.8466234  1.565913
##    7          7          1.970731  0.8460242  1.573014
##    7          8          1.966066  0.8470901  1.569619
##    7          9          1.966153  0.8471663  1.573976
##    8          1          2.451774  0.7707967  1.991866
##    8          2          2.172288  0.8163225  1.725709
##    8          3          2.054929  0.8339464  1.635521
##    8          4          2.025539  0.8395677  1.611045
##    8          5          1.997153  0.8427181  1.585633
##    8          6          1.992250  0.8424542  1.586280
##    8          7          1.988710  0.8422403  1.591324
##    8          8          1.983612  0.8434243  1.589274
##    8          9          1.985278  0.8433297  1.595619
##    9          1          2.423801  0.7762024  1.977252
##    9          2          2.146443  0.8209390  1.708224
##    9          3          2.032620  0.8379540  1.614166
##    9          4          2.007640  0.8429132  1.594746
##    9          5          1.978776  0.8461551  1.568630
##    9          6          1.975994  0.8456213  1.569041
##    9          7          1.974672  0.8451025  1.578013
##    9          8          1.968685  0.8463615  1.574942
##    9          9          1.969094  0.8463836  1.580368
##   10          1          2.438556  0.7735543  1.980604
##   10          2          2.165106  0.8179721  1.723205
##   10          3          2.054465  0.8343055  1.638588
##   10          4          2.025128  0.8397046  1.612519
##   10          5          1.995723  0.8431040  1.584958
##   10          6          1.993518  0.8424311  1.588204
##   10          7          1.990957  0.8420289  1.593642
##   10          8          1.985681  0.8431933  1.591474
##   10          9          1.987938  0.8430046  1.598182
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 7 and neighbors = 8.

The best model shows committees = 7 and neighbors = 8.

best_model_cu <- Cutune$finalModel

varImp(best_model_cu)

##            Overall
## V1            71.5
## V3            36.5
## V2            54.0
## V4            50.0
## V5            50.0
## duplicate1    21.5
## V6             9.0
## V7             0.0
## V8             0.0
## V9             0.0
## V10            0.0

GBM gives high importance to many variables, with V1 and V4 as the top features. All variables contribute to some degree. Cubist focuses on fewer variables, with V1 being the most important, and some variables (V7,V8,V9) are not used at all. In conclusion, in the Cubist model, the importance of variables resembles the pattern seen in both the random forest and GBM models: the first five predictors are the most influential, while predictors V6 through V10 hold much lower or negligible importance.

8.2

Use a simulation to show tree bias with different granularities

In my example, we’re predicting whether an employee is highly productive based on their monthly income and weekly work hours.The goal is to determine if an employee is likely to be “Highly Productive” or not, based on two variables: monthly_income and work_hours. Productivity is a binary outcome (Yes/No).

library(rpart)
library(rpart.plot)  

set.seed(128) 

show_tree_productivity <- function(observations, granularity) {
  
  # Variable 1: Monthly Income, with different levels of granularity
  monthly_income <- round(runif(observations, 2000, 10000) / granularity) * granularity
  
  # Variable 2: Work hours per week
  work_hours <- sample(20:60, observations, replace = TRUE)
  
  # Binary outcome: Highly Productive
  productive <- as.factor(ifelse(monthly_income + work_hours * 100 > 7000, "Yes", "No"))
  
  # Create data frame
  data <- data.frame(monthly_income, work_hours, productive)
  
  # Fit a decision tree model
  model <- rpart(productive ~ monthly_income + work_hours, data = data, method = "class")
  
  # Plot the decision tree
  rpart.plot(model, main = paste("Tree with Granularity =", granularity))
}

# Run the function with different levels of granularity
show_tree_productivity(100, 1000)

show_tree_productivity(100, 2000)

show_tree_productivity(100, 100)

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

a) Why does the model on the right focus its importance on just the first fewof predictors, whereas the model on the left spreads importance across more predictors?

The model on the right, with higher bagging fraction and learning rate, focuses on just a few key predictors because it aggressively fits the data and prioritizes the most predictive features. The model on the left, with lower values, is more conservative and spreads importance across many predictors, exploring a wider range of features.

b) Which model would be more predictive of other samples?

The model on the left would likely be more predictive of other samples, as it spreads importance across more features, reducing the chance of overfitting.

c) How would increasing interaction depth affect the slope of predictor importance?

Increasing interaction depth would likely make the slope steeper for both models by giving more importance to predictors involved in complex interactions. This effect would be stronger in the right model,

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
chem <- ChemicalManufacturingProcess

preProcValues <- preProcess(chem, method = c("knnImpute"))
data_imp <- predict(preProcValues, chem)

set.seed(120)
index <- createDataPartition(data_imp$Yield, p=0.8, list=FALSE) 
Train <- data_imp[index, ]
Test <- data_imp[-index, ]


trans_train <- preProcess(Train, method = c("center", "scale"))
trans_test <- preProcess(Test, method = c("center", "scale"))

Train_prep <- predict(trans_train, Train)
Test_prep <- predict(trans_test, Test)

performance2 <- data.frame(matrix(ncol = 3, nrow = 0,
                                  dimnames = list(NULL, c("RMSE", "Rsquared", "MAE"))),
                           stringsAsFactors = FALSE)

Random Forest

set.seed(120)
rf_model <- train(Yield ~ ., data = Train_prep, method = "rf",
                  trControl = trainControl(method = "cv", number = 5))

# Predict on the test set
rf_pred <- predict(rf_model, Test_prep)

# Evaluate the model
rf_performance <- postResample(rf_pred, Test_prep$Yield)
performance2 <- rbind(performance2, data.frame(RMSE = rf_performance[1], Rsquared = rf_performance[2], MAE = rf_performance[3]))
rownames(performance2)[nrow(performance2)] <- "Random Forest"

Bagged Tree

set.seed(120)
bagged_tree_model <- train(Yield ~ ., data = Train_prep, method = "treebag",
                           trControl = trainControl(method = "cv", number = 5))

# Predict on the test set
bagged_pred <- predict(bagged_tree_model, Test_prep)

# Evaluate the model
bagged_performance <- postResample(bagged_pred, Test_prep$Yield)
performance2 <- rbind(performance2, data.frame(RMSE = bagged_performance[1], Rsquared = bagged_performance[2], MAE = bagged_performance[3]))
rownames(performance2)[nrow(performance2)] <- "Bagged Tree"

Boosted Tree (using GBM)

set.seed(120)
boosted_tree_model <- train(Yield ~ ., data = Train_prep, method = "gbm",
                            trControl = trainControl(method = "cv", number = 5),
                            verbose = FALSE)

# Predict on the test set
boosted_pred <- predict(boosted_tree_model, Test_prep)

# Evaluate the model
boosted_performance <- postResample(boosted_pred, Test_prep$Yield)
performance2 <- rbind(performance2, data.frame(RMSE = boosted_performance[1], Rsquared = boosted_performance[2], MAE = boosted_performance[3]))
rownames(performance2)[nrow(performance2)] <- "Boosted Tree"

Cubist

set.seed(120)
cubist_model <- train(Yield ~ ., data = Train_prep, method = "cubist",
                      trControl = trainControl(method = "cv", number = 5))

# Predict on the test set
cubist_pred <- predict(cubist_model, Test_prep)

# Evaluate the model
cubist_performance <- postResample(cubist_pred, Test_prep$Yield)
performance2 <- rbind(performance2, data.frame(RMSE = cubist_performance[1], Rsquared = cubist_performance[2], MAE = cubist_performance[3]))
rownames(performance2)[nrow(performance2)] <- "Cubist"

a) Which tree-based regression model gives the optimal resampling and test set performance?

performance2

##                    RMSE  Rsquared       MAE
## Random Forest 0.4876054 0.8172552 0.3907875
## Bagged Tree   0.5425959 0.7179919 0.4330012
## Boosted Tree  0.5044564 0.7500334 0.4221457
## Cubist        0.7188146 0.7142570 0.5748481

The Random Forest model is the optimal choice for this dataset, as it provides the best performance across RMSE, R-squared, and MAE. This suggests that Random Forest is better suited to capture the patterns in the ChemicalManufacturingProcess dataset compared to Bagged Tree, Boosted Tree, or Cubist.

b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

plot(varImp(rf_model), 10)

Previous analysis noted a more balanced influence between manufacturing and biological predictors.The current findings indicate a shift toward process dominance, where ManufacturingProcess variables clearly outperform biological predictors in terms of importance for yield optimization. This mirrors the conclusion in the previous analysis but with a stronger emphasis on manufacturing processes as the key factor in improving yield.

c) Plot the optimal single tree with the distribution of yield in the terminalnodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

single_tree <- rpart(Yield ~ ., method = "anova", data = Train_prep)

rpart.plot(single_tree)

While ManufacturingProcess32 is at the top accounting for 100% of the yield, BiologicalMaterial11 and ManufacturingProcess31 split that into 58% and 42% respectively, with the BiologiclMaterial having slightly more impact to final yield than the ManufacturingProcess at that level of the split. There are less biological predictors present in the single tree model than were seen in the more robust tree models. That makes sense because the simpler tree model focuses on the most influential variables, potentially simplifying the decision-making process by highlighting the key manufacturing parameters that are most critical for optimizing yield.

Homework 9

Nikoleta Emanouilidi

2024-11-10

8.1.

Recreate the simulated data from Exercise 7.2:

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

It appears that the random forest model did not significantly use the uninformative predictors (V6–V10). These predictors have very low or close to zero importance scores compared to V1, V2, V4, and V5, which have much higher importance scores.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

Fit another random forest model to these data. Did the importance score for V1 change?

V1 value has reduced from 56.8894847 to 54.694421.

What happens when you add another predictor that is also highly correlated with V1?

When another highly correlated predictor is added the importance of V1 decreased even more now it’s 36.2976172 which is significantly smaller than the two values before.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

The best model shows n.trees = 300, interaction.depth = 1, shrinkage =0.1 and n.minobsinnode = 3.

V1, V4, and V2 are the most important predictors, as they have the highest importance scores (4346.44, 4827.29, and 4171.55, respectively).

Cubist

The best model shows committees = 7 and neighbors = 8.

8.2

Use a simulation to show tree bias with different granularities

8.3

a) Why does the model on the right focus its importance on just the first fewof predictors, whereas the model on the left spreads importance across more predictors?

b) Which model would be more predictive of other samples?

The model on the left would likely be more predictive of other samples, as it spreads importance across more features, reducing the chance of overfitting.

c) How would increasing interaction depth affect the slope of predictor importance?

Increasing interaction depth would likely make the slope steeper for both models by giving more importance to predictors involved in complex interactions. This effect would be stronger in the right model,

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Random Forest

Bagged Tree

Boosted Tree (using GBM)

Cubist

a) Which tree-based regression model gives the optimal resampling and test set performance?

The Random Forest model is the optimal choice for this dataset, as it provides the best performance across RMSE, R-squared, and MAE. This suggests that Random Forest is better suited to capture the patterns in the ChemicalManufacturingProcess dataset compared to Bagged Tree, Boosted Tree, or Cubist.

b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

c) Plot the optimal single tree with the distribution of yield in the terminalnodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?