Exercises:

8.1 Recreate the simulated data from Exercise 7.2:

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

 set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Answer:

  • The type of random forest is regression trees set to 1000.
  • No. of variables tried at each split: 3
  • Mean of squared residuals: 6.754258
  • % Var explained: 72.3
  • As observed in the Importance function, V1, V4 and V2 are the most important predictors.
  • The predictors V1-V5 have high %IncMSE value, indicating model’s prediction increases when these variable’s values are randomly permuted .
  • The predictors V7- V10 have negative values, which means permuting these variable actually improved model performance a little, meaning these variables might be useless or noise.
  • Increase in Node Purity, based on Gini Index , also shows predictors V1-V5 have high value and hence they are important.
  • Hence,The random forest model did not significantly use the uninformative predictors (V6 – V10).
model1 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
# model summary
print(model1)
## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.754258
##                     % Var explained: 72.3
print(" ")
## [1] " "
# Extract variable importance
# gives a table with metrics like %IncMSE (importance based on % mean squared error increase) and IncNodePurity (based on Gini impurity).
rfImp1 <- importance(model1)
print( rfImp1)
##        %IncMSE IncNodePurity
## V1  57.1035713     1063.5046
## V2  47.0088894      941.2445
## V3  10.8342807      299.2105
## V4  51.6960809     1077.4011
## V5  22.0568333      478.4097
## V6   2.9419111      182.3768
## V7  -0.1001392      204.3141
## V8  -3.2180039      147.7224
## V9  -1.8067946      154.3783
## V10 -1.3228315      184.4402
# Plot variable importance
varImpPlot(model1, main = "Random Forest Default model with trees = 1000")

b) Now add an additional predictor that is highly correlated with one of the informative predictors.

Answer: * The type of random forest is regression trees set to 1000.

  • No. of variables tried at each split: 3

  • Mean of squared residuals: 6.922537

  • % Var explained: 71.61

  • As observed in the Importance function, The importance score of V1 reduced significantly to 32.9767755( earlier value was 57.1035713).It did not impact other important predictors.

  • Hence the importance of V1 reduces when we add a predictor that is highly correlated with V1.

  • When we add another predictor duplicate 2, the importance score of V1 further reduces to 28.4776156 (with duplicate 1 it was 32.9767755).It also reduced the importance of the first duplicate variable.

  • This can be because some of the trees in the random forest chose the duplicate predictors over V1, which is not unusual given it’s high degree of correlation.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
model2_corr1 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
# model summary
print(model2_corr1)
## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.922537
##                     % Var explained: 71.61
# Extract variable importance
# gives a table with metrics like %IncMSE (importance based on % mean squared error increase) and IncNodePurity (based on Gini impurity).
rfImp2_corr1 <- importance(model2_corr1)
print(rfImp2_corr1)
##               %IncMSE IncNodePurity
## V1         32.9767755      798.1209
## V2         47.1898096      833.3881
## V3         10.4550932      253.4221
## V4         51.0810910      937.5096
## V5         22.5698231      416.2464
## V6          2.8046502      164.9521
## V7         -0.2664751      170.5892
## V8         -1.0005274      126.4463
## V9          0.1925953      136.1965
## V10         0.5977125      161.6598
## duplicate1 28.4077023      701.3079
# Plot variable importance
varImpPlot(model2_corr1, main = "Random Forest Default model with trees = 1000 , incl duplicate 1")

# Add another duplicate variable
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1


model2_corr2 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
# model summary
print(model2_corr2)
## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 6.784205
##                     % Var explained: 72.18
# Extract variable importance
# gives a table with metrics like %IncMSE (importance based on % mean squared error increase) and IncNodePurity (based on Gini impurity).
rfImp2_corr2 <- importance(model2_corr2)
print(rfImp2_corr2)
##               %IncMSE IncNodePurity
## V1         28.4726297      689.7017
## V2         50.7990567      859.5530
## V3         10.8410623      219.7327
## V4         54.8742896      952.4116
## V5         24.0380225      397.1578
## V6          3.3237982      147.4934
## V7          2.3928835      137.1524
## V8         -2.1256429      107.8981
## V9         -0.2832118      109.2677
## V10         2.1062647      126.0633
## duplicate1 23.2238354      604.6521
## duplicate2 17.0728309      370.5212
# Plot variable importance
varImpPlot(model2_corr2, main = "Random Forest Default model with trees = 1000,incl duplicate 2")

c) Use the cforest function in the party package to fit a random forest model using conditional inference trees.

The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Answer:

  • Using cforest function, the predictor importance with conditional = TRUE showed V4, V2,V1 and then duplicate1.In this case, the pattern changed from the traditional random forest model as the conditional importance corrects the bias, and swapped ranks.

  • This is because When conditional = TRUE, the function shuffles the variable within subgroups defined by correlated variables(V1 and duplicate1 in this case).It measures importance after controlling for correlation.

  • Using cforest function, the predictor importance with conditional = FALSE showed V4,V2,V1 and then duplicate1, duplicate2.In this case too, the pattern differed with the traditional random forest model.

# Fit cforest model
model3 <- cforest(y ~ ., 
                  data = simulated, ntree = 1000)
        

# Calculate variable importance with conditional = TRUE.

cforestImp_true <- varimp(model3, conditional = TRUE)

# Calculate variable importance with conditional = FALSE.
cforestImp_false <- varimp(model3, conditional = FALSE)


# Plot
barplot(sort(cforestImp_true, decreasing = TRUE), 
        main = "Variable Importance from cforest (partykit) with condiational = TRUE",
        horiz = TRUE, las = 1, col = "blue")

# Plot
barplot(sort(cforestImp_false, decreasing = TRUE), 
        main = "Variable Importance from cforest (partykit)with condiational = FALSE",
        horiz = TRUE, las = 1, col = "blue")

d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Answer:

Using gradient boosting model(gbm) function, the predictor importance by relative influence showed V4,V2, V1 and then V5.In this case, the pattern remained mostly same as the traditional random forest model.

Using Cubist model function, the predictor importance showed V3,V1, V2 and then duplicate1, V4.In this case, the pattern differed from all other previous models.

# boosted trees
# Distribution defines the type of loss function that will be optimized during boosting.
# for continuous variable, it should be gaussian
model4 <- gbm(y ~ .,
                    data = simulated,
                    distribution = "gaussian",
                    n.trees=1000)

# save the importance to a variable and view it
# shows relative influence of each variable.
gbmImp <- summary(model4, plotit = FALSE)
print(gbmImp)
##                   var   rel.inf
## V4                 V4 24.727673
## V2                 V2 19.265064
## V1                 V1 16.716224
## V5                 V5 10.480562
## V3                 V3  8.650080
## duplicate1 duplicate1  8.234744
## V7                 V7  2.542862
## V6                 V6  2.423038
## V10               V10  1.909883
## V9                 V9  1.859935
## duplicate2 duplicate2  1.809675
## V8                 V8  1.380259
# Plot
barplot(gbmImp$rel.inf, 
        names.arg = rownames(gbmImp), 
        las = 2, 
        main = "Variable Importance from GBM model", 
        col = "blue", 
        horiz = TRUE)

model5 <- cubist(simulated[,c(1:10, 12)], simulated[,11] ,
                    committees = 100)

# 1. Get variable importance
cubistImp <- varImp(model5)
print(cubistImp)
##            Overall
## V3            43.5
## V1            52.5
## V2            59.5
## duplicate1    27.5
## V4            46.0
## V8             4.0
## V5            27.0
## V6            10.0
## V10            1.0
## V7             0.0
## V9             0.0
# 2. Plot simple barplot
barplot(sort(cubistImp$Overall, decreasing = TRUE), 
        names.arg = rownames(cubistImp), 
        las = 2, 
        main = "Variable Importance (Cubist)", 
        col = "lightgreen", 
        horiz = TRUE)

8.2 Use a simulation to show tree bias with different granularities.

# Simulating a dataset with one important predictor and some noise
set.seed(42)
n <- 1000  # Number of observations
x1 <- rnorm(n)
x2 <- rnorm(n)  # Noise feature
y <- 2 * x1 + rnorm(n) * 0.5  # Linear relationship + some noise

# Creating a data frame
simulated_data <- data.frame(x1 = x1, x2 = x2, y = y)

# Train decision trees with different depths (granularities)
tree_depths <- c(1, 2, 3, 5, 10)  # Different tree depths (granularities)
tree_predictions <- data.frame(x1 = x1, x2 = x2, y = y)

for (depth in tree_depths) {
  model <- rpart(y ~ x1 + x2, data = simulated_data, control = rpart.control(maxdepth = depth))
  tree_predictions[paste("depth", depth, sep = "_")] <- predict(model, simulated_data)
}

# Compute errors for each model to visualize bias (mean squared error)
errors <- sapply(tree_depths, function(depth) {
  pred <- tree_predictions[[paste("depth", depth, sep = "_")]]
  mean((pred - y)^2)  # Mean Squared Error (MSE)
})

# Plot the errors to show how tree bias changes with depth
error_df <- data.frame(Depth = tree_depths, Error = errors)

ggplot(error_df, aes(x = Depth, y = Error)) +
  geom_point() +
  geom_line() +
  labs(title = "Tree Bias and Granularity (Tree Depth)", 
       x = "Tree Depth (Granularity)", 
       y = "Mean Squared Error (MSE)") +
  theme_minimal()

# 5. Plot predictions with different tree depths to visualize bias at each granularity
ggplot(tree_predictions, aes(x = x1, y = y)) +
  geom_point(alpha = 0.3) +  # Original data points
  geom_line(aes(y = depth_1), color = "red", linetype = "dashed") +
  geom_line(aes(y = depth_2), color = "blue", linetype = "dashed") +
  geom_line(aes(y = depth_3), color = "green", linetype = "dashed") +
  geom_line(aes(y = depth_5), color = "purple", linetype = "dashed") +
  geom_line(aes(y = depth_10), color = "orange", linetype = "dashed") +
  labs(title = "Effect of Tree Granularity on Predictions",
       x = "Predictor x1", y = "Prediction y") +
  theme_minimal() 

8.3 Bagging fraction and learning rate

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

imagename
imagename

a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has a higher bagging fraction(0.9) and a higher learning rate (0.9). Each tree takes 90% of the data due to which there is less bias but high overfitting. With learning rate of 0.9, it makes makes big corrections and model fits training data quickly. Hence it overfits to a few variables (top predictors dominate), and other variables are largely ignored.

The model on the left has a low bagging fraction(0.1) and a low learning rate (0.1). Each tree takes only 10% of the data due to which there will be more randomness, more diverse trees,less overfitting, but slightly higher bias. With learning rate of 0.1, each tree makes small corrections. The model builds up slowly but carefully,usually needs more trees, but tends to generalize better.Hence allowing more predictors to contribute meaningfully over time.

b) Which model do you think would be more predictive of other samples?

The left-hand model (bagging fraction 0.1, learning rate 0.1) would likely be more predictive on new/unseen data (better generalization).As explained above, this model builds up slowly but carefully,usually needs more trees, but tends to generalize better.Hence allowing more predictors to contribute meaningfully over time.

c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

If we increase interaction depth:

In the left-hand model, we would likely see more variables gaining importance (because complex relationships among multiple features would be discovered). Hence slope would get a little flatter.

In the right-hand model, it might further concentrate importance among the top few predictors because deeper trees can exploit even finer-grained overfitting (making the dominance of top features even steeper). Hence slope would get even steeper.

8.7 Refer to Exercises 6.3 and 7.5 and train several tree-based models:

Load the data for chemical manufacturing process

library(AppliedPredictiveModeling)

data(ChemicalManufacturingProcess)

cmp <- data.frame(ChemicalManufacturingProcess)

Data imputation to fill in the missing values.

# summary of missing values
paste0(" Missing values: ",sum(is.na(cmp)))
## [1] " Missing values: 106"
#data imputation
cmp_impute <- preProcess(cmp, method = "knnImpute")
cmp_impute  <- predict(cmp_impute,cmp)


# summary post data imputation
paste0(" Missing values post data kNN Impute:",sum(is.na(cmp_impute)))
## [1] " Missing values post data kNN Impute:0"

Split the data into a training and a test set, pre-process the data.

  • For evaluation i used Rsquared as the performance metric.
  • Filtered predictors with low frequencies.
  • Checked for skewness revealed BiologicalMaterial01 -05 and Yield have skewness.Hence we need to apply Box-cox transformation.
# pre-process to filter predictor with low frequencies
cmp_transform <- cmp_impute[, -nearZeroVar(cmp_impute)]

# Check for skewness
library(e1071)
skewVals <- apply(cmp_transform, 2, skewness)
head(skewVals)
##                Yield BiologicalMaterial01 BiologicalMaterial02 
##           0.31095956           0.27331650           0.24412685 
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05 
##           0.02851075           1.73231530           0.30400532
#data transformation 
cmp_boxcox <- preProcess(cmp_transform, method = c("BoxCox","knnImpute"))
cmp_transform  <- predict(cmp_boxcox,cmp_transform)

set.seed(100)

# sample data for training
index <- createDataPartition(cmp_transform$Yield, p = .8, list = FALSE)

# train and test data
train_cmp <- cmp_transform[index, ]
test_cmp <- cmp_transform[-index, ]

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

#pre process
pre_process <- c("center","scale")

Single Regression Tree

RMSE Rsquared MAE 0.6943058 0.4138988 0.5543733

cartModel <- train(train_cmp[ ,-1], train_cmp$Yield,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"),
                  preProcess=pre_process)

cartPred <- predict(cartModel, test_cmp[ ,-1])

postResample(cartPred, test_cmp$Yield)
##      RMSE  Rsquared       MAE 
## 0.6943058 0.4138988 0.5543733

Bagged Tree

RMSE Rsquared MAE 0.6011820 0.5487259 0.4892362

baggedTree <- ipredbagg(train_cmp$Yield,train_cmp[ ,-1])
 
baggedPred <- predict(baggedTree, test_cmp[ ,-1])

postResample(baggedPred, test_cmp$Yield)
##      RMSE  Rsquared       MAE 
## 0.5377956 0.6597275 0.4358674

Random Forest Model

RMSE Rsquared MAE 0.5337560 0.6695344 0.4505818

# default mmtry is number of predictors  divided by 3 , so around 17.
#mtry is the number of predictors sampled for each split.
rfGrid <- expand.grid(mtry=seq(2,38,by=3) )


rfModel <- train(x = train_cmp[ ,-1], y = train_cmp$Yield, 
                 method = "rf", 
                 tuneGrid = rfGrid, 
                 importance = TRUE, 
                 trControl = ctrl,
                 ntree = 1000,
                 preProcess=pre_process)

# rfModel <- randomForest(train_cmp[ ,-1], train_cmp$Yield, 
#                        importance = TRUE,
#                        tuneGrid = rfGrid)

#validation plot
plot(rfModel)

rfModel
## Random Forest 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 129, 131, 128, 132, 129, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.6648184  0.6460245  0.5304425
##    5    0.6136384  0.6847991  0.4861535
##    8    0.6004020  0.6906913  0.4711585
##   11    0.5966490  0.6869064  0.4659671
##   14    0.5980594  0.6807980  0.4638020
##   17    0.5951567  0.6830679  0.4624478
##   20    0.5956728  0.6802582  0.4612384
##   23    0.5939083  0.6803502  0.4569980
##   26    0.5906581  0.6819413  0.4532038
##   29    0.5974513  0.6716757  0.4578175
##   32    0.5952718  0.6743171  0.4548119
##   35    0.5961569  0.6716593  0.4556231
##   38    0.5993002  0.6643732  0.4577768
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.
#predict the permeability response for the test data
rfPred <- predict(rfModel, test_cmp[ ,-1])

#compare the variance of the predicted values to the test values.
postResample(rfPred, test_cmp$Yield)
##      RMSE  Rsquared       MAE 
## 0.5329399 0.6711639 0.4516756

Gradient Boosting Model( GBM)

The final values used for the model were n.trees = 1000, interaction.depth = 7, shrinkage = 0.01 and n.minobsinnode = 5. RMSE Rsquared MAE 0.4232880 0.7962436 0.3474775

# boosted trees

gbmGrid <- expand.grid(interaction.depth=seq(1,7,by=1),
                       n.trees=c(100,200,1000),
                       shrinkage=c(0.01,0.1,0.2),
                       n.minobsinnode=5)


# Distribution defines the type of loss function that will be optimized during boosting.
# for continuous variable, it should be gaussian

gbmModel <-train(x = train_cmp[ ,-1], y = train_cmp$Yield,
                 method = "gbm",
                 verbose = FALSE, 
                 tuneGrid = gbmGrid, 
                 trControl = ctrl, 
                 preProcess=pre_process)

# gbmModel <- gbm.fit(train_cmp[ ,-1], train_cmp$Yield,
#                    distribution = "gaussian")


gbmModel
## Stochastic Gradient Boosting 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 128, 130, 128, 130, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE       Rsquared   MAE      
##   0.01       1                   100     0.7949873  0.5614275  0.6439822
##   0.01       1                   200     0.7108323  0.6000413  0.5685745
##   0.01       1                  1000     0.6447759  0.6453416  0.4942067
##   0.01       2                   100     0.7575022  0.5867715  0.6151400
##   0.01       2                   200     0.6776275  0.6242034  0.5360107
##   0.01       2                  1000     0.6310216  0.6540692  0.4851540
##   0.01       3                   100     0.7356672  0.6201522  0.5940267
##   0.01       3                   200     0.6610748  0.6403695  0.5185862
##   0.01       3                  1000     0.6180795  0.6675513  0.4740618
##   0.01       4                   100     0.7322117  0.6042125  0.5892057
##   0.01       4                   200     0.6563940  0.6359813  0.5118489
##   0.01       4                  1000     0.6071720  0.6730395  0.4628975
##   0.01       5                   100     0.7270426  0.6162607  0.5860861
##   0.01       5                   200     0.6540353  0.6370434  0.5074453
##   0.01       5                  1000     0.6092770  0.6649472  0.4603749
##   0.01       6                   100     0.7140387  0.6320304  0.5721952
##   0.01       6                   200     0.6454468  0.6496612  0.5005246
##   0.01       6                  1000     0.5922546  0.6873008  0.4483384
##   0.01       7                   100     0.7146530  0.6202685  0.5755882
##   0.01       7                   200     0.6395350  0.6475126  0.4972033
##   0.01       7                  1000     0.5960109  0.6809773  0.4557017
##   0.10       1                   100     0.6629602  0.6326668  0.5189936
##   0.10       1                   200     0.6763677  0.6213301  0.5252572
##   0.10       1                  1000     0.7197298  0.5796031  0.5618885
##   0.10       2                   100     0.6499817  0.6303965  0.5132366
##   0.10       2                   200     0.6497993  0.6319189  0.5045489
##   0.10       2                  1000     0.6549580  0.6277247  0.5010916
##   0.10       3                   100     0.6325445  0.6459484  0.4875629
##   0.10       3                   200     0.6367840  0.6472474  0.4882845
##   0.10       3                  1000     0.6361177  0.6444604  0.4857790
##   0.10       4                   100     0.6262783  0.6421746  0.4872065
##   0.10       4                   200     0.6230969  0.6445489  0.4829454
##   0.10       4                  1000     0.6233832  0.6434539  0.4843281
##   0.10       5                   100     0.6100910  0.6714449  0.4747903
##   0.10       5                   200     0.6056374  0.6693468  0.4742842
##   0.10       5                  1000     0.6075889  0.6660154  0.4760716
##   0.10       6                   100     0.6277625  0.6474907  0.4974891
##   0.10       6                   200     0.6285653  0.6477977  0.4997165
##   0.10       6                  1000     0.6281427  0.6482223  0.5009528
##   0.10       7                   100     0.5971046  0.6712353  0.4492362
##   0.10       7                   200     0.5967403  0.6684985  0.4497500
##   0.10       7                  1000     0.5972372  0.6660236  0.4511523
##   0.20       1                   100     0.6584252  0.6242897  0.5282094
##   0.20       1                   200     0.6675373  0.6136230  0.5275327
##   0.20       1                  1000     0.6813559  0.6052637  0.5336996
##   0.20       2                   100     0.6773656  0.6021799  0.5203261
##   0.20       2                   200     0.6797596  0.5942099  0.5148179
##   0.20       2                  1000     0.6792201  0.5930233  0.5142170
##   0.20       3                   100     0.6810359  0.5975683  0.5259855
##   0.20       3                   200     0.6779337  0.6019289  0.5270753
##   0.20       3                  1000     0.6773681  0.6020045  0.5274302
##   0.20       4                   100     0.6413194  0.6298069  0.5082579
##   0.20       4                   200     0.6425801  0.6289266  0.5073091
##   0.20       4                  1000     0.6423629  0.6291289  0.5069582
##   0.20       5                   100     0.6509071  0.6377616  0.4984452
##   0.20       5                   200     0.6440456  0.6450716  0.4948838
##   0.20       5                  1000     0.6429377  0.6461838  0.4943461
##   0.20       6                   100     0.6881063  0.5541484  0.5282064
##   0.20       6                   200     0.6868702  0.5557430  0.5252629
##   0.20       6                  1000     0.6871605  0.5554477  0.5254399
##   0.20       7                   100     0.6590602  0.6068400  0.5135123
##   0.20       7                   200     0.6559479  0.6101664  0.5095901
##   0.20       7                  1000     0.6557410  0.6104720  0.5094169
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  6, shrinkage = 0.01 and n.minobsinnode = 5.
#predict the permeability response for the test data
gbmPred <- predict(gbmModel, test_cmp[ ,-1])

#compare the variance of the predicted values to the test values.
postResample(gbmPred, test_cmp$Yield)
##      RMSE  Rsquared       MAE 
## 0.4125580 0.8054418 0.3384450

Cubist Model

Best tuned with committes = 100 and neighbor = 1. RMSE Rsquared MAE 0.4115425 0.8076333 0.2910282

# committess are the n.trees
# neighbors can take a integer value between 0 to 9
cubistGrid <- expand.grid(committees = c(100, 200, 1000), 
                          neighbors = c(0, 1, 3, 5, 7, 9))

cubistModel <- train(x = train_cmp[ ,-1], y = train_cmp$Yield,
                 method = "cubist",
                 verbose = FALSE, 
                 tuneGrid = cubistGrid, 
                 trControl = ctrl, 
                 preProcess=pre_process)

# cubistModel <- cubist(train_cmp[ ,-1], train_cmp$Yield)


cubistModel$bestTune
##   committees neighbors
## 2        100         1
#predict the permeability response for the test data
cubistPred <- predict(cubistModel, test_cmp[ ,-1])

#compare the variance of the predicted values to the test values.
postResample(cubistPred, test_cmp$Yield)
##      RMSE  Rsquared       MAE 
## 0.3998226 0.8162177 0.2786997

a) Which tree-based regression model gives the optimal resampling and test set performance?

Answer:

  • Based on R squared as the performance metric , cubist model performed the best in optimal resampling and test set performance as it had the highest R squared and lowest RMSE.
# Define the data
results_table <- data.frame(
  Model = c("CART", "Bagged", "Random Forest ", "GBM", "Cubist"),
  RMSE   = c(0.6943058, 0.6011820, 0.5304502, 0.4125580, 0.3998226),
  Rsquared = c(0.4138988, 0.5487259,0.6762934,0.8054418, 0.8162177)
)

# Display the table using kable
kable(arrange(results_table, desc("Rsquared")), format = "markdown", align = "l")
Model RMSE Rsquared
CART 0.6943058 0.4138988
Bagged 0.6011820 0.5487259
Random Forest 0.5304502 0.6762934
GBM 0.4125580 0.8054418
Cubist 0.3998226 0.8162177

b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Answer:

  • In the cubist model,among the Top 10 predictors,process variables(6) are slightly dominant than biological variables(4).ManufacturingProcess32 is the most important, followed by ManufacturingProcess13 and BiologicalMaterial03.

  • In the optimal non linear model(nNet),there is an similar case .In the Top 10 selection, process variables(6) are more than biological variables(4). Hence neither the biological or process variables dominate the list, but process variables outnumber. e.g: ManufacturingProcess32 had the highest positive correlation with Yield, followed by BiologicalMaterial06 and BiologicalMaterial03. ManufacturingProcess13, 17 and 36 were negatively correlated with Yield.

  • In the optimal linear model(pls), ManufacturingProcess32 has the highest positive correlation with Yield, followed by ManufacturingProcess09 and BiologicalMaterial02. ManufacturingProcess13,ManufacturingProcess36 and ManufacturingProcess17 which were in the Top5 are negatively correlated with Yield.

  • Hence there is difference in the order of importance of the predictors.

top_predictors <-data.frame(varImp(cubistModel,scale = TRUE)$importance)

top_predictors  |>
  arrange(desc(Overall))
##                           Overall
## ManufacturingProcess32 100.000000
## ManufacturingProcess13  49.438202
## BiologicalMaterial03    48.314607
## ManufacturingProcess09  43.820225
## BiologicalMaterial02    31.460674
## ManufacturingProcess04  30.337079
## BiologicalMaterial06    25.842697
## ManufacturingProcess17  22.471910
## ManufacturingProcess33  20.224719
## BiologicalMaterial04    16.853933
## ManufacturingProcess14  16.853933
## ManufacturingProcess29  16.853933
## ManufacturingProcess10  15.730337
## BiologicalMaterial08    13.483146
## BiologicalMaterial10    12.359551
## BiologicalMaterial09    12.359551
## BiologicalMaterial12    11.235955
## ManufacturingProcess15  11.235955
## ManufacturingProcess39  10.112360
## ManufacturingProcess01  10.112360
## ManufacturingProcess25  10.112360
## BiologicalMaterial11     8.988764
## ManufacturingProcess27   8.988764
## BiologicalMaterial05     7.865169
## ManufacturingProcess28   7.865169
## ManufacturingProcess11   7.865169
## ManufacturingProcess20   5.617978
## ManufacturingProcess26   5.617978
## ManufacturingProcess31   5.617978
## ManufacturingProcess19   4.494382
## ManufacturingProcess18   4.494382
## ManufacturingProcess42   4.494382
## BiologicalMaterial01     3.370787
## ManufacturingProcess22   3.370787
## ManufacturingProcess02   3.370787
## ManufacturingProcess12   3.370787
## ManufacturingProcess21   2.247191
## ManufacturingProcess45   2.247191
## ManufacturingProcess44   2.247191
## ManufacturingProcess23   2.247191
## ManufacturingProcess05   2.247191
## ManufacturingProcess30   2.247191
## ManufacturingProcess06   2.247191
## ManufacturingProcess34   1.123596
## ManufacturingProcess03   1.123596
## ManufacturingProcess07   1.123596
## ManufacturingProcess08   0.000000
## ManufacturingProcess16   0.000000
## ManufacturingProcess24   0.000000
## ManufacturingProcess35   0.000000
## ManufacturingProcess36   0.000000
## ManufacturingProcess37   0.000000
## ManufacturingProcess38   0.000000
## ManufacturingProcess40   0.000000
## ManufacturingProcess41   0.000000
## ManufacturingProcess43   0.000000
top_10_predictors <- top_predictors  |>
  arrange(desc(Overall)) |>
    slice_head(n = 10)

#correlation
cmp_transform |>
  select(c("Yield", row.names(top_10_predictors))) |>
  cor() |>
  corrplot()

c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Answer:

  • The optimal single tree model used the same ManufacturingProcess32 variable as the first split, indicating it too found that variable to be important. For the second split, it used BiologicalMaterial12 which was not in the top 10 of the cubist model above, but lower on the list.

  • The view of the data is interpretative and provides additional knowledge of the predictors and its relationship with yield.

  • For the first split, ManufacturingProcess32 < 0.199, it is a early critical decision point as low values split off a different yield group.

  • For the second split, BiologicalMaterial12 < -0.497 , indicates that for low ManufacturingProcess32, the nature of BiologicalMaterial12 hugely impacts Yield.

  • ManufacturingProcess06 & ManufacturingProcess17 are important thresholds that control whether yield is good or bad

  • High yields(Yield ≈ 1.8) is achieved when ManufacturingProcess32 >= 0.199 and ManufacturingProcess17 < -0.699.

rpartGrid <- expand.grid(maxdepth= seq(1,10,by=1))
rp_model <- train(Yield ~ ., data = train_cmp, method = "rpart2", tuneGrid = rpartGrid,
                       trControl = ctrl, preProcess=pre_process)

#rpartTree <- rpart(Yield ~ ., data = train_cmp)

plot(as.party(rp_model$finalModel), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 6))

#final_cart <- cartModel$finalModel

# Plot the tree
#rpart.plot(final_cart,
#           type = 3,         # Draw split labels under nodes
#           extra = 101,      # Show number of obs and mean response at each node
#           fallen.leaves = TRUE, # Put leaves at the bottom
#           box.palette = "RdBu", # Color nodes by prediction
#           tweak = 1.2,       # Size of the plot
#           main = "Optimal CART Tree with Yield Distribution")