Homework 9

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file. 8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.3.3

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.3

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

importance_scores <- importance(model1)
importance_scores

##        %IncMSE IncNodePurity
## V1  57.1035713     1063.5046
## V2  47.0088894      941.2445
## V3  10.8342807      299.2105
## V4  51.6960809     1077.4011
## V5  22.0568333      478.4097
## V6   2.9419111      182.3768
## V7  -0.1001392      204.3141
## V8  -3.2180039      147.7224
## V9  -1.8067946      154.3783
## V10 -1.3228315      184.4402

# Check predictors V6-V10
importance_scores[6:10, ]

##        %IncMSE IncNodePurity
## V6   2.9419111      182.3768
## V7  -0.1001392      204.3141
## V8  -3.2180039      147.7224
## V9  -1.8067946      154.3783
## V10 -1.3228315      184.4402

Predictors V6–V10 have very low or negative %IncMSE values and low IncNodePurity scores. This indicates that these predictors do not contribute significantly to the model’s predictive performance and are likely uninformative.

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

library(gbm)

## Warning: package 'gbm' was built under R version 4.3.3

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

# Fit a gradient boosting model
set.seed(200)
model2 <- gbm(y ~ ., data = simulated, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)

# Summary of variable importance
gbmImp <- summary(model2)

gbmImp

##                   var    rel.inf
## V4                 V4 28.2876438
## V2                 V2 22.3019746
## V1                 V1 16.5804351
## V5                 V5 10.9690893
## duplicate1 duplicate1 10.6121171
## V3                 V3  7.9011476
## V6                 V6  0.8946482
## V7                 V7  0.8437566
## V9                 V9  0.7311044
## V8                 V8  0.4810117
## V10               V10  0.3970716

In the plot provided, V1V1 has high relative influence, but introducing a highly correlated predictor would likely decrease the relative influence of V1V1 as both predictors share the same underlying information. This leads to a dilution effect in importance scores for V1V1 and its correlated counterpart.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

# Tune random forest
set.seed(200)
train_control <- trainControl(method = "cv", number = 10)
rf_tuned <- train(y ~ ., data = simulated, method = "rf", trControl = train_control, tuneGrid = expand.grid(mtry = 1:5), ntree = 1000)

# Best model
rf_tuned

## Random Forest 
## 
## 200 samples
##  11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   1     3.207039  0.7620831  2.645926
##   2     2.789937  0.7935222  2.304074
##   3     2.620201  0.7984088  2.164653
##   4     2.526978  0.8021796  2.087469
##   5     2.491741  0.7967627  2.053201
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.

# Plot results
plot(rf_tuned)

The plot indicates a decreasing RMSE as the number of randomly selected predictors (mtry) increases, leveling off as more predictors are included. This pattern suggests that selecting a larger subset of predictors improves the model’s performance, likely because it incorporates more informative variables.This indicates a more nuanced and accurate evaluation of predictors’ unique contributions to the model.

8.2. Use a simulation to show tree bias with different granularities

library(rpart)
library(partykit)

## Warning: package 'partykit' was built under R version 4.3.3

## Loading required package: grid

## Loading required package: libcoin

## Warning: package 'libcoin' was built under R version 4.3.3

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 4.3.3

set.seed(23414)

a <- sample(1:10 / 10, 500, replace = TRUE)
b <- sample(1:100 / 100, 500, replace = TRUE)
c <- sample(1:1000 / 1000, 500, replace = TRUE)
d <- sample(1:10000 / 10000, 500, replace = TRUE)
e <- sample(1:100000 / 100000, 500, replace = TRUE)

y <- a + b + c + d + e

simData <- data.frame(a,b,c,d,e,y) 

rpartTree <- rpart(y ~ ., data = simData)

plot(as.party(rpartTree), gp = gpar(fontsize = 7))

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9: (a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across

set.seed(20320)

# Ensure sufficient data by modifying the bag.fraction and n.minobsinnode parameters
model_low <- gbm(
  y ~ ., data = simulated, distribution = "gaussian", 
  n.trees = 1000, interaction.depth = 3, shrinkage = 0.1, 
  bag.fraction = 0.5, n.minobsinnode = 5, cv.folds = 5
)

model_high <- gbm(
  y ~ ., data = simulated, distribution = "gaussian", 
  n.trees = 1000, interaction.depth = 3, shrinkage = 0.9, 
  bag.fraction = 0.5, n.minobsinnode = 5, cv.folds = 5
)

# Variable importance for each model
importance_low <- summary(model_low)

importance_high <- summary(model_high)

importance_low

##                   var   rel.inf
## V4                 V4 26.467740
## V2                 V2 21.565109
## duplicate1 duplicate1 13.257782
## V1                 V1 13.253856
## V5                 V5 10.418938
## V3                 V3  8.092216
## V7                 V7  2.012433
## V6                 V6  1.329586
## V8                 V8  1.279923
## V9                 V9  1.191215
## V10               V10  1.131201

importance_high

##                   var   rel.inf
## duplicate1 duplicate1 19.368710
## V4                 V4 18.805022
## V2                 V2 14.139969
## V3                 V3 10.264213
## V5                 V5  8.225311
## V7                 V7  7.691462
## V9                 V9  7.434942
## V8                 V8  4.484968
## V10               V10  3.840566
## V1                 V1  3.270189
## V6                 V6  2.474647

The model on the right, with higher values for both the learning rate and bagging fraction, emphasizes the most influential predictors. This happens because a high learning rate quickly optimizes based on the strongest signals in the data With higher values for both the bagging fraction and learning rate, the model tends to focus on a smaller set of predictors because the larger learning rate emphasizes the steepest gradients (most influential predictors), and the high bagging fraction reduces diversity among subsets of predictors used for training. In contrast, the lower values encourage spreading importance across more predictors, as smaller learning rates allow finer adjustments, and a smaller bagging fraction increases the diversity of predictor subsets.
The model on the left (with lower learning rate and bagging fraction) is likely to generalize better to new samples. Its slower learning process and diverse sampling reduce the risk of overfitting by not overly focusing on a few predictors. The model on the right may overfit the training data due to its aggressive optimization, which can lead to poor performance on unseen data.

The model with the lower bagging fraction and learning rate is generally more predictive because it avoids overfitting by spreading importance across more predictors and making smaller, incremental updates to the trees.The model with higher values might overfit to the training data, leading to poorer generalization to other samples.

Increasing the interaction depth allows the model to consider more complex interactions between predictors, which might flatten the importance distribution as additional predictors contribute through interactions. For the model on the left (lower parameters), the slope of predictor importance may become slightly less steep as more predictors gain importance. For the model on the right (higher parameters), the steepness may persist but with higher variability as deeper interactions further concentrate importance among the most influential predictors.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Problem 8.7

Data Preparation and Preprocessing

# Load required libraries
library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3

library(caret)
library(randomForest)
library(ipred)
library(gbm)
library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.3.3

# Load the data
data("ChemicalManufacturingProcess")

# Verify dataset is loaded correctly
if (nrow(ChemicalManufacturingProcess) == 0) {
  stop("Dataset is empty. Check the data source.")
}

# Handle missing values using k-NN imputation
imputer <- preProcess(ChemicalManufacturingProcess, method = "knnImpute", k = 5)
imputed_data <- predict(imputer, ChemicalManufacturingProcess)

# Filter out near-zero variance variables
filtered_data <- imputed_data[, -nearZeroVar(imputed_data)]

# Ensure filtered data is not empty
if (nrow(filtered_data) == 0) {
  stop("Filtered data is empty. Check preprocessing steps.")
}

# Split data into training and testing sets
set.seed(613)  # For reproducibility
train_indices <- sample(nrow(filtered_data), size = floor(0.8 * nrow(filtered_data)), replace = FALSE)
training_data <- filtered_data[train_indices, ]
testing_data <- filtered_data[-train_indices, ]

# Ensure splits are valid
if (nrow(training_data) == 0 || nrow(testing_data) == 0) {
  stop("Training or testing data is empty. Check data splitting logic.")
}

# Train models
# Random Forest Model
random_forest_model <- randomForest(Yield ~ ., 
                                    data = training_data, 
                                    importance = TRUE, 
                                    ntree = 1000)
rf_predictions <- predict(random_forest_model, testing_data)

# Bagged Tree Model
bagged_tree_model <- bagging(Yield ~ ., 
                             data = training_data)
bagged_tree_predictions <- predict(bagged_tree_model, testing_data)

# Boosted Tree Model
boosted_tree_model <- gbm(Yield ~ ., 
                          data = training_data, 
                          distribution = "gaussian", 
                          n.trees = 1000)
boosted_tree_predictions <- predict(boosted_tree_model, testing_data)

## Using 1000 trees...

# Cubist Model
cubist_model <- train(Yield ~ ., 
                      data = training_data, 
                      method = "cubist")
cubist_predictions <- predict(cubist_model, testing_data)

# Compare model performance
performance_comparison <- rbind(
  RandomForest = postResample(rf_predictions, testing_data$Yield),
  BaggedTree = postResample(bagged_tree_predictions, testing_data$Yield),
  BoostedTree = postResample(boosted_tree_predictions, testing_data$Yield),
  Cubist = postResample(cubist_predictions, testing_data$Yield)
)
print(performance_comparison)

##                   RMSE  Rsquared       MAE
## RandomForest 0.6240582 0.5225684 0.5052709
## BaggedTree   0.6770303 0.4435007 0.5418391
## BoostedTree  0.6618618 0.5656576 0.5076747
## Cubist       0.5847081 0.6158759 0.4676564

# Identify the optimal model based on RMSE and R-squared
cat("The model with the best performance is:", 
    rownames(performance_comparison)[which.min(performance_comparison[, "RMSE"])])

## The model with the best performance is: Cubist

# Variable importance for the Cubist model
plot(varImp(cubist_model), 10)

# Plot a single tree
single_tree_model <- rpart(Yield ~ ., 
                           data = training_data)
rpart.plot(single_tree_model)

(a) Which tree-based regression model gives the optimal resampling and test set performance?

The Cubist model gives the optimal performance with the highest R2R2 and the lowest RMSE and MAE values compared to the Random Forest, Bagged Tree, and Boosted Tree models. The model with the highest R-squared and the lowest RMSE is the optimal model.

(b) Which predictors are most important in the optimal tree-based regression model?

Process variables dominate the list of top predictors.The top predictors align closely with the nonlinear SVM and lasso models, but there are some differences in ranking and shared variables.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes

# Fit a single decision tree
single_tree_model <- rpart(Yield ~ ., 
                           data = training_data)
rpart.plot(single_tree_model)