Homework 9 D624

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

8.1)

#Simulate the Data
library(mlbench)

## Warning: package 'mlbench' was built under R version 4.4.3

#install.packages("mlbench")
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

#Fit a random forest model
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.4.3

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.4.3

model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1)

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

#Adding a correlated predictor
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

#Fit another random forest model and observe change in the V1 score
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

#Using cforest from the party Package

library(party)

## Warning: package 'party' was built under R version 4.4.3

## Loading required package: grid

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 4.4.2

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.4.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.4.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.4.3

model_cforest <- cforest(y ~ ., data = simulated)

#Calculate predictor importance using the varimp function
traditional_importance <- varimp(model_cforest, conditional = FALSE)
conditional_importance <- varimp(model_cforest, conditional = TRUE)

print(traditional_importance)

##            V1            V2            V3            V4            V5 
##  4.6171158805  6.0579730772  0.0003116115  7.6223892727  1.7161194047 
##            V6            V7            V8            V9           V10 
## -0.0289427183  0.0465374951 -0.0380965511  0.0046062409 -0.0310326410 
##    duplicate1 
##  5.0941897280

print(conditional_importance)

##           V1           V2           V3           V4           V5           V6 
##  1.807953145  4.688980360  0.012878752  6.190578351  1.051666850  0.028174759 
##           V7           V8           V9          V10   duplicate1 
## -0.011709437 -0.004356587  0.015118505 -0.022190587  1.926751729

library(gbm)

## Warning: package 'gbm' was built under R version 4.4.3

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

#Boosted Tree
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

rfImp5 <- varImp(gbmTune$finalModel, scale = FALSE)

rfImp5

##       Overall
## V1  4634.0234
## V2  4316.6044
## V3  1310.7178
## V4  4287.0793
## V5  1844.2147
## V6   416.3966
## V7   368.9383
## V8   224.4769
## V9   246.1400
## V10  250.6263

#Fit a Cubist model
set.seed(100)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubistTuned$finalModel, scale = FALSE)

rfImp6

##     Overall
## V1     72.0
## V3     42.0
## V2     54.5
## V4     49.0
## V5     40.0
## V6     11.0
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0

Boosted Trees emphasize stronger predictors and distribute importance among correlated predictors. This may dilute the importance score of highly correlated variables, but it still assigns some relevance to uninformative predictors.

Cubist is stricter in ignoring uninformative predictors and focusing on individual contributions, especially when redundancy is detected among correlated variables.

The difference in scoring reflects fundamental differences in how these models operate—boosted trees are iterative and error-focused, while Cubist models rely on rule-based splitting and selective inclusion.

8.2)

Use a simulation to show tree bias with different granularities.

# Step 1: Set Seed for Reproducibility
set.seed(323)

# Step 2: Simulate Predictors with Varying Granularities
a <- sample(seq(0.1, 1, by = 0.1), 500, replace = TRUE)      # Low granularity
b <- sample(seq(0.01, 1, by = 0.01), 500, replace = TRUE)    # Moderate granularity
c <- sample(seq(0.001, 1, by = 0.001), 500, replace = TRUE)  # Higher granularity
d <- sample(seq(0.0001, 1, by = 0.0001), 500, replace = TRUE) # Very high granularity
e <- sample(seq(0.00001, 1, by = 0.00001), 500, replace = TRUE) # Extreme granularity

# Step 3: Response Variable (Sum of Predictors)
y <- a + b + c + d + e

# Combine Data into a Data Frame
simData <- data.frame(a, b, c, d, e, y)

# Step 4: Fit a Regression Tree Model
library(rpart)

## Warning: package 'rpart' was built under R version 4.4.3

library(partykit)  # For visualizing the tree

## Warning: package 'partykit' was built under R version 4.4.3

## Loading required package: libcoin

## Warning: package 'libcoin' was built under R version 4.4.3

## 
## Attaching package: 'partykit'

## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

rpartTree <- rpart(y ~ ., data = simData)

# Step 5: Visualize the Tree
plot(as.party(rpartTree), gp = gpar(fontsize = 7))

# Optional Step: Print Tree Summary
summary(rpartTree)

## Call:
## rpart(formula = y ~ ., data = simData)
##   n= 500 
## 
##            CP nsplit rel error    xerror       xstd
## 1  0.23218354      0 1.0000000 1.0021300 0.05658153
## 2  0.10083278      1 0.7678165 0.8048953 0.04818408
## 3  0.07447076      2 0.6669837 0.7962823 0.04708562
## 4  0.05549665      3 0.5925129 0.7465688 0.04459921
## 5  0.04550840      4 0.5370163 0.7008307 0.04173267
## 6  0.03190641      5 0.4915079 0.6558520 0.03842755
## 7  0.02725497      6 0.4596015 0.6168658 0.03461709
## 8  0.02547962      7 0.4323465 0.6021841 0.03448468
## 9  0.02544519      8 0.4068669 0.5988423 0.03405274
## 10 0.02460888      9 0.3814217 0.5926054 0.03394358
## 11 0.02290298     10 0.3568128 0.5771842 0.03336694
## 12 0.01575687     11 0.3339098 0.5193457 0.02950763
## 13 0.01492863     12 0.3181529 0.5011157 0.02873436
## 14 0.01324927     13 0.3032243 0.4816342 0.02815082
## 15 0.01243044     14 0.2899750 0.4635757 0.02731975
## 16 0.01187887     15 0.2775446 0.4555549 0.02704505
## 17 0.01176391     16 0.2656657 0.4473977 0.02633426
## 18 0.01155576     17 0.2539018 0.4461843 0.02613483
## 19 0.01102545     18 0.2423461 0.4381016 0.02600914
## 20 0.01033534     19 0.2313206 0.4290242 0.02530933
## 21 0.01016613     20 0.2209853 0.4271599 0.02515414
## 22 0.01000000     21 0.2108191 0.4238257 0.02516233
## 
## Variable importance
##  a  d  e  c  b 
## 28 21 19 19 13 
## 
## Node number 1: 500 observations,    complexity param=0.2321835
##   mean=2.494754, MSE=0.4280819 
##   left son=2 (294 obs) right son=3 (206 obs)
##   Primary splits:
##       a < 0.65     to the left,  improve=0.2321835, (0 missing)
##       d < 0.43065  to the left,  improve=0.1778844, (0 missing)
##       c < 0.3225   to the left,  improve=0.1661138, (0 missing)
##       e < 0.56415  to the left,  improve=0.1640225, (0 missing)
##       b < 0.565    to the left,  improve=0.1526707, (0 missing)
##   Surrogate splits:
##       e < 0.94187  to the left,  agree=0.610, adj=0.053, (0 split)
##       d < 0.9963   to the left,  agree=0.596, adj=0.019, (0 split)
##       c < 0.0045   to the right, agree=0.594, adj=0.015, (0 split)
##       b < 0.945    to the left,  agree=0.592, adj=0.010, (0 split)
## 
## Node number 2: 294 observations,    complexity param=0.1008328
##   mean=2.230854, MSE=0.3468594 
##   left son=4 (120 obs) right son=5 (174 obs)
##   Primary splits:
##       e < 0.353625 to the left,  improve=0.21164000, (0 missing)
##       c < 0.3365   to the left,  improve=0.20933550, (0 missing)
##       d < 0.3604   to the left,  improve=0.20502470, (0 missing)
##       b < 0.605    to the left,  improve=0.13287080, (0 missing)
##       a < 0.45     to the left,  improve=0.08155338, (0 missing)
##   Surrogate splits:
##       c < 0.162    to the left,  agree=0.605, adj=0.033, (0 split)
##       b < 0.015    to the left,  agree=0.595, adj=0.008, (0 split)
##       d < 0.97175  to the right, agree=0.595, adj=0.008, (0 split)
## 
## Node number 3: 206 observations,    complexity param=0.07447076
##   mean=2.871388, MSE=0.3027549 
##   left son=6 (94 obs) right son=7 (112 obs)
##   Primary splits:
##       d < 0.43055  to the left,  improve=0.255578500, (0 missing)
##       b < 0.435    to the left,  improve=0.210222500, (0 missing)
##       e < 0.54058  to the left,  improve=0.175351600, (0 missing)
##       c < 0.726    to the left,  improve=0.126577600, (0 missing)
##       a < 0.75     to the left,  improve=0.009654511, (0 missing)
##   Surrogate splits:
##       b < 0.075    to the left,  agree=0.563, adj=0.043, (0 split)
##       e < 0.54058  to the left,  agree=0.558, adj=0.032, (0 split)
##       c < 0.0045   to the left,  agree=0.553, adj=0.021, (0 split)
## 
## Node number 4: 120 observations,    complexity param=0.0455084
##   mean=1.904597, MSE=0.2554564 
##   left son=8 (44 obs) right son=9 (76 obs)
##   Primary splits:
##       c < 0.306    to the left,  improve=0.31775360, (0 missing)
##       d < 0.41925  to the left,  improve=0.28262920, (0 missing)
##       b < 0.415    to the left,  improve=0.20582300, (0 missing)
##       a < 0.45     to the left,  improve=0.12533690, (0 missing)
##       e < 0.11365  to the left,  improve=0.09756645, (0 missing)
##   Surrogate splits:
##       e < 0.022775 to the left,  agree=0.667, adj=0.091, (0 split)
##       d < 0.2102   to the left,  agree=0.658, adj=0.068, (0 split)
## 
## Node number 5: 174 observations,    complexity param=0.05549665
##   mean=2.455859, MSE=0.2858595 
##   left son=10 (103 obs) right son=11 (71 obs)
##   Primary splits:
##       c < 0.623    to the left,  improve=0.23881510, (0 missing)
##       d < 0.3604   to the left,  improve=0.21725210, (0 missing)
##       b < 0.34     to the left,  improve=0.20125150, (0 missing)
##       e < 0.69551  to the left,  improve=0.09804316, (0 missing)
##       a < 0.45     to the left,  improve=0.07519922, (0 missing)
##   Surrogate splits:
##       e < 0.87638  to the left,  agree=0.603, adj=0.028, (0 split)
##       b < 0.025    to the right, agree=0.598, adj=0.014, (0 split)
## 
## Node number 6: 94 observations,    complexity param=0.02547962
##   mean=2.567752, MSE=0.2235772 
##   left son=12 (46 obs) right son=13 (48 obs)
##   Primary splits:
##       b < 0.485    to the left,  improve=0.25949830, (0 missing)
##       c < 0.7055   to the left,  improve=0.22704820, (0 missing)
##       e < 0.68266  to the left,  improve=0.18339600, (0 missing)
##       d < 0.10175  to the left,  improve=0.06530835, (0 missing)
##       a < 0.85     to the left,  improve=0.04939411, (0 missing)
##   Surrogate splits:
##       d < 0.167    to the right, agree=0.606, adj=0.196, (0 split)
##       c < 0.517    to the right, agree=0.574, adj=0.130, (0 split)
##       e < 0.390695 to the right, agree=0.553, adj=0.087, (0 split)
##       a < 0.75     to the right, agree=0.543, adj=0.065, (0 split)
## 
## Node number 7: 112 observations,    complexity param=0.03190641
##   mean=3.126225, MSE=0.2268881 
##   left son=14 (36 obs) right son=15 (76 obs)
##   Primary splits:
##       b < 0.295    to the left,  improve=0.26874790, (0 missing)
##       e < 0.56941  to the left,  improve=0.21442510, (0 missing)
##       c < 0.545    to the left,  improve=0.19569370, (0 missing)
##       d < 0.6415   to the left,  improve=0.05703654, (0 missing)
##       a < 0.75     to the left,  improve=0.01405603, (0 missing)
##   Surrogate splits:
##       c < 0.906    to the right, agree=0.688, adj=0.028, (0 split)
##       e < 0.06761  to the left,  agree=0.688, adj=0.028, (0 split)
## 
## Node number 8: 44 observations,    complexity param=0.01102545
##   mean=1.530156, MSE=0.1464124 
##   left son=16 (19 obs) right son=17 (25 obs)
##   Primary splits:
##       d < 0.2996   to the left,  improve=0.36632180, (0 missing)
##       b < 0.515    to the left,  improve=0.29567440, (0 missing)
##       e < 0.100205 to the left,  improve=0.16575780, (0 missing)
##       a < 0.25     to the left,  improve=0.08504861, (0 missing)
##       c < 0.104    to the left,  improve=0.05756090, (0 missing)
##   Surrogate splits:
##       b < 0.04     to the left,  agree=0.614, adj=0.105, (0 split)
##       c < 0.226    to the right, agree=0.614, adj=0.105, (0 split)
##       a < 0.35     to the right, agree=0.591, adj=0.053, (0 split)
## 
## Node number 9: 76 observations,    complexity param=0.02460888
##   mean=2.121379, MSE=0.1904205 
##   left son=18 (31 obs) right son=19 (45 obs)
##   Primary splits:
##       d < 0.3773   to the left,  improve=0.36396660, (0 missing)
##       a < 0.45     to the left,  improve=0.31683720, (0 missing)
##       b < 0.465    to the left,  improve=0.21318170, (0 missing)
##       c < 0.7965   to the left,  improve=0.07443169, (0 missing)
##       e < 0.192335 to the left,  improve=0.05967533, (0 missing)
##   Surrogate splits:
##       b < 0.81     to the right, agree=0.645, adj=0.129, (0 split)
##       e < 0.295615 to the right, agree=0.632, adj=0.097, (0 split)
## 
## Node number 10: 103 observations,    complexity param=0.02544519
##   mean=2.23893, MSE=0.2108027 
##   left son=20 (64 obs) right son=21 (39 obs)
##   Primary splits:
##       b < 0.595    to the left,  improve=0.25083570, (0 missing)
##       d < 0.0839   to the left,  improve=0.23352090, (0 missing)
##       c < 0.0415   to the left,  improve=0.13071970, (0 missing)
##       a < 0.25     to the left,  improve=0.11282860, (0 missing)
##       e < 0.588155 to the left,  improve=0.08867248, (0 missing)
##   Surrogate splits:
##       e < 0.394675 to the right, agree=0.650, adj=0.077, (0 split)
##       d < 0.0109   to the right, agree=0.641, adj=0.051, (0 split)
## 
## Node number 11: 71 observations,    complexity param=0.02725497
##   mean=2.770559, MSE=0.2274412 
##   left son=22 (24 obs) right son=23 (47 obs)
##   Primary splits:
##       d < 0.3604   to the left,  improve=0.36125610, (0 missing)
##       e < 0.6826   to the left,  improve=0.27973680, (0 missing)
##       b < 0.545    to the left,  improve=0.25715570, (0 missing)
##       a < 0.15     to the left,  improve=0.11839750, (0 missing)
##       c < 0.9035   to the left,  improve=0.02365452, (0 missing)
##   Surrogate splits:
##       c < 0.667    to the left,  agree=0.690, adj=0.083, (0 split)
##       b < 0.045    to the left,  agree=0.676, adj=0.042, (0 split)
## 
## Node number 12: 46 observations,    complexity param=0.01033534
##   mean=2.321703, MSE=0.145168 
##   left son=24 (20 obs) right son=25 (26 obs)
##   Primary splits:
##       e < 0.41158  to the left,  improve=0.33127820, (0 missing)
##       c < 0.525    to the left,  improve=0.25854880, (0 missing)
##       a < 0.85     to the left,  improve=0.17549160, (0 missing)
##       d < 0.1208   to the left,  improve=0.12440450, (0 missing)
##       b < 0.29     to the left,  improve=0.05403345, (0 missing)
##   Surrogate splits:
##       b < 0.26     to the right, agree=0.630, adj=0.15, (0 split)
##       c < 0.6115   to the right, agree=0.609, adj=0.10, (0 split)
##       d < 0.00525  to the left,  agree=0.609, adj=0.10, (0 split)
## 
## Node number 13: 48 observations,    complexity param=0.01492863
##   mean=2.80355, MSE=0.1851008 
##   left son=26 (16 obs) right son=27 (32 obs)
##   Primary splits:
##       c < 0.383    to the left,  improve=0.3596393, (0 missing)
##       e < 0.690655 to the left,  improve=0.2643536, (0 missing)
##       b < 0.905    to the left,  improve=0.1551256, (0 missing)
##       d < 0.3325   to the left,  improve=0.1185120, (0 missing)
##       a < 0.85     to the left,  improve=0.0476301, (0 missing)
##   Surrogate splits:
##       d < 0.02435  to the left,  agree=0.688, adj=0.063, (0 split)
## 
## Node number 14: 36 observations,    complexity param=0.01324927
##   mean=2.76744, MSE=0.1543918 
##   left son=28 (19 obs) right son=29 (17 obs)
##   Primary splits:
##       c < 0.5385   to the left,  improve=0.51022560, (0 missing)
##       e < 0.494075 to the left,  improve=0.24646140, (0 missing)
##       d < 0.6821   to the left,  improve=0.14247690, (0 missing)
##       b < 0.145    to the left,  improve=0.08971731, (0 missing)
##       a < 0.95     to the left,  improve=0.00424777, (0 missing)
##   Surrogate splits:
##       a < 0.75     to the right, agree=0.667, adj=0.294, (0 split)
##       d < 0.92445  to the left,  agree=0.667, adj=0.294, (0 split)
##       e < 0.494075 to the left,  agree=0.611, adj=0.176, (0 split)
##       b < 0.045    to the right, agree=0.556, adj=0.059, (0 split)
## 
## Node number 15: 76 observations,    complexity param=0.01575687
##   mean=3.296176, MSE=0.1713695 
##   left son=30 (36 obs) right son=31 (40 obs)
##   Primary splits:
##       e < 0.56941  to the left,  improve=0.25895220, (0 missing)
##       c < 0.797    to the left,  improve=0.22478680, (0 missing)
##       b < 0.635    to the left,  improve=0.11625230, (0 missing)
##       d < 0.90795  to the left,  improve=0.08394581, (0 missing)
##       a < 0.75     to the left,  improve=0.02900237, (0 missing)
##   Surrogate splits:
##       a < 0.75     to the left,  agree=0.592, adj=0.139, (0 split)
##       c < 0.4145   to the right, agree=0.592, adj=0.139, (0 split)
##       b < 0.845    to the right, agree=0.579, adj=0.111, (0 split)
##       d < 0.9885   to the right, agree=0.566, adj=0.083, (0 split)
## 
## Node number 16: 19 observations
##   mean=1.264504, MSE=0.1018277 
## 
## Node number 17: 25 observations
##   mean=1.732052, MSE=0.08590081 
## 
## Node number 18: 31 observations
##   mean=1.804194, MSE=0.07559345 
## 
## Node number 19: 45 observations,    complexity param=0.01155576
##   mean=2.339885, MSE=0.1524723 
##   left son=38 (27 obs) right son=39 (18 obs)
##   Primary splits:
##       a < 0.45     to the left,  improve=0.3604892, (0 missing)
##       b < 0.49     to the left,  improve=0.3585369, (0 missing)
##       d < 0.78125  to the left,  improve=0.2087995, (0 missing)
##       e < 0.243245 to the left,  improve=0.1249551, (0 missing)
##       c < 0.8505   to the left,  improve=0.1240914, (0 missing)
##   Surrogate splits:
##       c < 0.9385   to the left,  agree=0.667, adj=0.167, (0 split)
##       d < 0.9222   to the left,  agree=0.644, adj=0.111, (0 split)
##       e < 0.32271  to the left,  agree=0.644, adj=0.111, (0 split)
## 
## Node number 20: 64 observations,    complexity param=0.02290298
##   mean=2.059426, MSE=0.1716652 
##   left son=40 (21 obs) right son=41 (43 obs)
##   Primary splits:
##       d < 0.3569   to the left,  improve=0.44619690, (0 missing)
##       a < 0.45     to the left,  improve=0.25882950, (0 missing)
##       b < 0.215    to the left,  improve=0.15155740, (0 missing)
##       c < 0.0415   to the left,  improve=0.15020180, (0 missing)
##       e < 0.68938  to the left,  improve=0.07323307, (0 missing)
##   Surrogate splits:
##       b < 0.51     to the right, agree=0.734, adj=0.190, (0 split)
##       c < 0.0415   to the left,  agree=0.719, adj=0.143, (0 split)
##       e < 0.93172  to the right, agree=0.703, adj=0.095, (0 split)
## 
## Node number 21: 39 observations,    complexity param=0.01016613
##   mean=2.533501, MSE=0.1353791 
##   left son=42 (23 obs) right son=43 (16 obs)
##   Primary splits:
##       d < 0.45985  to the left,  improve=0.41213210, (0 missing)
##       e < 0.57714  to the left,  improve=0.23353110, (0 missing)
##       c < 0.2795   to the left,  improve=0.09536788, (0 missing)
##       a < 0.25     to the left,  improve=0.07568759, (0 missing)
##       b < 0.805    to the left,  improve=0.06390883, (0 missing)
##   Surrogate splits:
##       a < 0.55     to the left,  agree=0.641, adj=0.125, (0 split)
##       b < 0.785    to the right, agree=0.615, adj=0.063, (0 split)
##       e < 0.387345 to the right, agree=0.615, adj=0.063, (0 split)
## 
## Node number 22: 24 observations,    complexity param=0.01243044
##   mean=2.369429, MSE=0.1563738 
##   left son=44 (16 obs) right son=45 (8 obs)
##   Primary splits:
##       b < 0.57     to the left,  improve=0.70893770, (0 missing)
##       d < 0.13255  to the left,  improve=0.32503910, (0 missing)
##       a < 0.25     to the left,  improve=0.13540690, (0 missing)
##       e < 0.558405 to the left,  improve=0.10050460, (0 missing)
##       c < 0.9085   to the left,  improve=0.04950711, (0 missing)
##   Surrogate splits:
##       c < 0.6885   to the right, agree=0.750, adj=0.250, (0 split)
##       e < 0.6853   to the left,  agree=0.750, adj=0.250, (0 split)
##       d < 0.1888   to the left,  agree=0.708, adj=0.125, (0 split)
## 
## Node number 23: 47 observations,    complexity param=0.01187887
##   mean=2.975391, MSE=0.13961 
##   left son=46 (26 obs) right son=47 (21 obs)
##   Primary splits:
##       e < 0.70399  to the left,  improve=0.3874874, (0 missing)
##       a < 0.15     to the left,  improve=0.2050755, (0 missing)
##       b < 0.71     to the left,  improve=0.2025735, (0 missing)
##       d < 0.7688   to the left,  improve=0.1106212, (0 missing)
##       c < 0.847    to the left,  improve=0.0293410, (0 missing)
##   Surrogate splits:
##       c < 0.9035   to the left,  agree=0.660, adj=0.238, (0 split)
##       a < 0.35     to the left,  agree=0.596, adj=0.095, (0 split)
##       b < 0.055    to the right, agree=0.574, adj=0.048, (0 split)
##       d < 0.4804   to the left,  agree=0.574, adj=0.048, (0 split)
## 
## Node number 24: 20 observations
##   mean=2.071666, MSE=0.1153821 
## 
## Node number 25: 26 observations
##   mean=2.514038, MSE=0.08299623 
## 
## Node number 26: 16 observations
##   mean=2.438667, MSE=0.1048169 
## 
## Node number 27: 32 observations
##   mean=2.985991, MSE=0.1253885 
## 
## Node number 28: 19 observations
##   mean=2.501955, MSE=0.08534179 
## 
## Node number 29: 17 observations
##   mean=3.064159, MSE=0.06474845 
## 
## Node number 30: 36 observations
##   mean=3.074123, MSE=0.124228 
## 
## Node number 31: 40 observations,    complexity param=0.01176391
##   mean=3.496023, MSE=0.1294815 
##   left son=62 (23 obs) right son=63 (17 obs)
##   Primary splits:
##       c < 0.495    to the left,  improve=0.486161600, (0 missing)
##       d < 0.9054   to the left,  improve=0.235055400, (0 missing)
##       b < 0.625    to the left,  improve=0.187899300, (0 missing)
##       e < 0.86598  to the left,  improve=0.158909100, (0 missing)
##       a < 0.85     to the left,  improve=0.005767893, (0 missing)
##   Surrogate splits:
##       b < 0.935    to the left,  agree=0.650, adj=0.176, (0 split)
##       d < 0.9054   to the left,  agree=0.650, adj=0.176, (0 split)
##       e < 0.86598  to the left,  agree=0.625, adj=0.118, (0 split)
## 
## Node number 38: 27 observations
##   mean=2.148461, MSE=0.07506179 
## 
## Node number 39: 18 observations
##   mean=2.627021, MSE=0.1311765 
## 
## Node number 40: 21 observations
##   mean=1.663395, MSE=0.08807269 
## 
## Node number 41: 43 observations
##   mean=2.252836, MSE=0.09848541 
## 
## Node number 42: 23 observations
##   mean=2.33649, MSE=0.07741102 
## 
## Node number 43: 16 observations
##   mean=2.816704, MSE=0.08271016 
## 
## Node number 44: 16 observations
##   mean=2.133994, MSE=0.0564159 
## 
## Node number 45: 8 observations
##   mean=2.840299, MSE=0.02371172 
## 
## Node number 46: 26 observations
##   mean=2.766361, MSE=0.07663657 
## 
## Node number 47: 21 observations
##   mean=3.234191, MSE=0.09650266 
## 
## Node number 62: 23 observations
##   mean=3.28032, MSE=0.08316986 
## 
## Node number 63: 17 observations
##   mean=3.787855, MSE=0.04402332

Meaning of the results

The analysis of variable importance reveals a clear trend: Variable a, with the lowest granularity of 0.1 increments, exhibits the highest importance at a value of 28. Conversely, Variable b, characterized by a moderate granularity of 0.01 increments, shows the lowest importance with a value of 13. Notably, the general trend indicates that variable importance tends to decrease as the granularity of the variable increases.

The decision tree’s initial split occurs on Variable a at a threshold of a < 0.65. This split, focusing on the coarsest variable, yields the greatest improvement in the model, quantified as 0.2321835. This is particularly noteworthy because the true underlying model (y = a + b + c + d + e) assigns equal importance to all variables. Subsequent splits within the tree continue to favor variables with lower granularity. Even when finer-grained variables are utilized for splitting, they typically appear deeper within the tree structure.

Several factors contribute to this observed preference for coarser variables. Information Gain Preference plays a role, as variables with fewer distinct values (lower granularity) facilitate the creation of cleaner and more decisive splits. Finer-grained variables, possessing a greater number of potential split points, make it more challenging for the algorithm to identify truly optimal splits. Computational Efficiency also influences this bias, as the tree algorithm evaluates fewer potential splits for coarser variables, increasing their likelihood of selection during the greedy splitting process. Furthermore, with a limited dataset size of n=500, finer-grained variables can appear more “noisy,” leading the algorithm to favor the more stable splits offered by their coarser counterparts.

The implications of this granularity bias are significant for both Model Interpretation and Model Building. The tree structure might misleadingly suggest that coarser variables are more important in predicting the outcome, while in reality, as demonstrated by the simulation, all variables contribute equally. In real-world applications where variables are measured at varying granularities, this bias could lead to suboptimal feature selection.

8.3)

a) Why does the model on the right focus on just the first few predictors, while the one on the left spreads importance across more predictors?

Model on the right bagging fraction = 0.9, learning rate = 0.9: A higher bagging fraction allows the model to utilize most of the data set for each tree. Combined with a high learning rate, this enables the algorithm to quickly converge on a solution, often relying on a small subset of predictors. The model prioritizes only the most influential variables, ignoring others. This can lead to over fitting on a narrow set of predictors.

Model on the left (bagging fraction = 0.1, learning rate = 0.1): A lower bagging fraction means each tree is built on smaller, randomized subsets of data. Combined with a lower learning rate, this forces the model to spread importance across a larger range of predictors as it learns incrementally and more robustly. Predictors that might have secondary or minor contributions are also given more weight.

Which model would be more predictive of other samples? The left model bagging = 0.1, learning rate = 0.1 is likely to generalize better to new data. Its slower learning process and greater diversity from using smaller data subsets reduce the risk of over fitting. By incorporating a wider array of predictors, it creates a more balanced and comprehensive model.

The right model, while potentially accurate on training data, risks over fitting because it relies heavily on just a few predictors. This narrow focus makes it less adaptable to new or unseen data.

How would increasing interaction depth affect the slope of predictor importance for either model? Interaction Depth controls how complex each tree can become, i.e., the number of variable interactions it considers:

Increasing Interaction Depth as Higher Values: More complex trees can amplify the differences in predictor importance. Variables with strong interactions specifically the most influential predictors will have their importance boosted even further, steepening the slope of predictor importance.

For the Left Model: The importance scores would still be distributed across more predictors, but with deeper trees, the stronger predictors might start to dominate more significantly.

For the Right Model: Already focusing on a few predictors, deeper trees would likely exaggerate this effect, further increasing the dominance of the top predictors while diminishing the importance of others even more.

8.7)

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models.

(Exercise 6.3: A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro- cess. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch.)

# Load required libraries
library(AppliedPredictiveModeling)  # For the chemical manufacturing data

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3

library(caret)       # For modeling and preprocessing
library(rpart)       # For single trees
library(rpart.plot)  # For tree visualization

## Warning: package 'rpart.plot' was built under R version 4.4.3

library(randomForest)# For random forests
library(gbm)         # For gradient boosting machines
library(ipred)       # For bagged trees

## Warning: package 'ipred' was built under R version 4.4.2

### Step 1: Load and Prepare Data ###
# Load the chemical manufacturing process data
data(ChemicalManufacturingProcess)

# Extract predictors and response
predictors <- ChemicalManufacturingProcess[, -1]  # All columns except Yield
response <- ChemicalManufacturingProcess$Yield    # Yield is the response variable

### Step 2: Impute Missing Values ###
# Create pre-processing method to impute missing values using k-nearest neighbors
preProc <- preProcess(predictors, method = "knnImpute")

# Apply the imputation to the predictor data
imputedPredictors <- predict(preProc, predictors)

### Step 3: Remove Near-Zero Variance Predictors ###
# Identify and remove predictors with near-zero variance
nzv <- nearZeroVar(imputedPredictors)
if(length(nzv) > 0) {
  imputedPredictors <- imputedPredictors[, -nzv]
}

### Step 4: Identify Correlated Predictors ###
# Calculate correlation matrix and find highly correlated predictors
corMatrix <- cor(imputedPredictors)
highCorr <- findCorrelation(corMatrix, cutoff = 0.9)
if(length(highCorr) > 0) {
  imputedPredictors <- imputedPredictors[, -highCorr]
}

### Step 5: Center and Scale Predictors ###
# Center and scale all predictors (mean=0, sd=1)
preProcParams <- preProcess(imputedPredictors, method = c("center", "scale"))
processedPredictors <- predict(preProcParams, imputedPredictors)

### Step 6: Create Final Dataset ###
# Combine processed predictors with response
finalData <- data.frame(processedPredictors, Yield = response)

### Step 7: Data Splitting ###
# Create stratified random split (75% training, 25% testing)
set.seed(123)  # For reproducibility
trainIndex <- createDataPartition(finalData$Yield, p = 0.75, list = FALSE)
trainData <- finalData[trainIndex, ]
testData <- finalData[-trainIndex, ]

### Step 8: Train Control Setup ###
# Configure resampling method (10-fold cross-validation)
ctrl <- trainControl(method = "cv", number = 10)

### Step 9: Train Different Tree-Based Models ###

# A) Single Regression Tree (CART)
set.seed(123)
cartModel <- train(Yield ~ ., 
                   data = trainData,
                   method = "rpart",
                   tuneLength = 10,
                   trControl = ctrl)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

# B) Bagged Trees
set.seed(123)
baggedModel <- train(Yield ~ .,
                     data = trainData,
                     method = "treebag",
                     trControl = ctrl)

# C) Random Forest
set.seed(123)
rfModel <- train(Yield ~ .,
                 data = trainData,
                 method = "rf",
                 tuneLength = 5,  # Number of mtry values to try
                 importance = TRUE,
                 trControl = ctrl)

# D) Gradient Boosting Machine (GBM)
set.seed(123)
gbmModel <- train(Yield ~ .,
                  data = trainData,
                  method = "gbm",
                  tuneLength = 5,  # Number of combinations to try
                  verbose = FALSE,
                  trControl = ctrl)

### Step 10: Compare Model Performance ###
# Collect all models
models <- list(CART = cartModel,
               Bagged = baggedModel,
               RF = rfModel,
               GBM = gbmModel)

# Extract resampling results
results <- resamples(models)

# Summarize performance metrics (RMSE, R-squared)
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: CART, Bagged, RF, GBM 
## Number of resamples: 10 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean  3rd Qu.     Max. NA's
## CART   0.6717444 1.0691619 1.1194398 1.1290210 1.284819 1.399238    0
## Bagged 0.5291455 0.6935190 0.9062978 0.8963208 1.101562 1.288520    0
## RF     0.5968466 0.6752065 0.8380164 0.8756641 1.062368 1.195199    0
## GBM    0.6070728 0.7183006 0.8088874 0.8488673 1.020816 1.144700    0
## 
## RMSE 
##             Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## CART   0.7641467 1.3158443 1.395158 1.382098 1.543698 1.704534    0
## Bagged 0.7791655 0.8973311 1.170472 1.161674 1.432176 1.493835    0
## RF     0.7821437 0.8586156 1.156088 1.113859 1.341078 1.419569    0
## GBM    0.8169668 0.8936120 1.012913 1.094928 1.308232 1.455427    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART   0.1992020 0.4665218 0.5197715 0.5278977 0.6028799 0.8061917    0
## Bagged 0.4471003 0.6040089 0.6748113 0.6605904 0.7474090 0.8581289    0
## RF     0.5038359 0.6372653 0.7077093 0.7054304 0.7912648 0.8678798    0
## GBM    0.4937063 0.6103678 0.7289240 0.7005498 0.8172402 0.8461209    0

# Visual comparison of models
bwplot(results, metric = "RMSE")  # Lower is better

bwplot(results, metric = "Rsquared")  # Higher is better

### Step 11: Evaluate Best Model on Test Set ###
# Select best model based on cross-validation performance
bestModel <- rfModel  # Example - replace with actual best performer

# Make predictions on test set
predictions <- predict(bestModel, testData)

# Calculate test set performance
postResample(pred = predictions, obs = testData$Yield)

##      RMSE  Rsquared       MAE 
## 1.1828675 0.5445539 0.8565184

### Step 12: Variable Importance ###
# Plot variable importance for the best model
varImpPlot <- varImp(bestModel)
plot(varImpPlot, top = 20)  # Top 20 most important variables

### Step 13: Visualize Single Tree (for CART model only) ###
# Plot the regression tree structure
prp(cartModel$finalModel, 
    extra = 1,  # Display extra information
    fallen.leaves = TRUE,  # Show leaves at bottom
    roundint = FALSE,  # Don't round to integers
    main = "Regression Tree for Chemical Yield Prediction")

The analysis of model accuracy, as indicated by RMSE and R², reveals that Gradient Boosting (GBM) and Random Forest (RF) achieve the lowest Root Mean Squared Error, ranging from 0.8 to 1.2. This significantly outperforms the single decision tree model, CART, which exhibits an RMSE reaching up to 1.6. This performance difference confirms that ensemble methods effectively reduce prediction errors by an estimated 20–30% when compared to individual trees. Regarding explanatory power, GBM and RF also lead with R² values ranging from approximately 0.6 to 0.8, signifying their ability to capture a greater proportion of the variance in yield. In contrast, CART lags behind with R² values between 0.2 and 0.4.

An examination of variable importance highlights the dominance of Manufacturing Process Variables, such as ManufacturingProcess32 and 30, in the top ranks. This suggests that adjustments and optimization within the manufacturing process are critical for maximizing yield. Additionally, BiologicalMaterial03 emerges as a key predictor related to raw materials, underscoring the importance of monitoring its quality. The structure of the CART regression tree shows its initial split occurring on ManufacturingProcess32 at a threshold of 0.19, followed by a split on ManufacturingProcess30 at a threshold of 0.6, and subsequently on BiologicalMaterial03 at a threshold of 1.1.

The terminal nodes of the tree, with sample sizes such as n=59 and n=17, indicate yield values clustering around 39–43 units for specific decision paths. This simplicity in the CART model’s structure helps explain its lower accuracy compared to the more complex ensemble methods.

The key takeaways from this analysis are that GBM and RF are the superior models for predicting yield due to their higher accuracy (low RMSE) and greater explanatory power (high R²).

Optimizing ManufacturingProcess32 and 30 should be prioritized, along with careful monitoring of BiologicalMaterial03 quality. While the single CART tree exhibits higher prediction errors due to its rigid splitting rules, it offers intuitive and easily interpretable rules that can provide initial guidance for process understanding.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Homework 9

Jose Fuentes

2025-04-25