Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file. 8.1. Recreate the simulated data from Exercise 7.2:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
importance_scores <- importance(model1)
importance_scores
## %IncMSE IncNodePurity
## V1 57.1035713 1063.5046
## V2 47.0088894 941.2445
## V3 10.8342807 299.2105
## V4 51.6960809 1077.4011
## V5 22.0568333 478.4097
## V6 2.9419111 182.3768
## V7 -0.1001392 204.3141
## V8 -3.2180039 147.7224
## V9 -1.8067946 154.3783
## V10 -1.3228315 184.4402
# Check predictors V6-V10
importance_scores[6:10, ]
## %IncMSE IncNodePurity
## V6 2.9419111 182.3768
## V7 -0.1001392 204.3141
## V8 -3.2180039 147.7224
## V9 -1.8067946 154.3783
## V10 -1.3228315 184.4402
Predictors V6–V10 have very low or negative %IncMSE
values and low IncNodePurity scores. This indicates that
these predictors do not contribute significantly to the model’s
predictive performance and are likely uninformative.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# Fit a gradient boosting model
set.seed(200)
model2 <- gbm(y ~ ., data = simulated, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
# Summary of variable importance
gbmImp <- summary(model2)
gbmImp
## var rel.inf
## V4 V4 28.2876438
## V2 V2 22.3019746
## V1 V1 16.5804351
## V5 V5 10.9690893
## duplicate1 duplicate1 10.6121171
## V3 V3 7.9011476
## V6 V6 0.8946482
## V7 V7 0.8437566
## V9 V9 0.7311044
## V8 V8 0.4810117
## V10 V10 0.3970716
In the plot provided, V1V1 has high relative influence, but introducing a highly correlated predictor would likely decrease the relative influence of V1V1 as both predictors share the same underlying information. This leads to a dilution effect in importance scores for V1V1 and its correlated counterpart.
# Tune random forest
set.seed(200)
train_control <- trainControl(method = "cv", number = 10)
rf_tuned <- train(y ~ ., data = simulated, method = "rf", trControl = train_control, tuneGrid = expand.grid(mtry = 1:5), ntree = 1000)
# Best model
rf_tuned
## Random Forest
##
## 200 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 3.207039 0.7620831 2.645926
## 2 2.789937 0.7935222 2.304074
## 3 2.620201 0.7984088 2.164653
## 4 2.526978 0.8021796 2.087469
## 5 2.491741 0.7967627 2.053201
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
# Plot results
plot(rf_tuned)
The plot indicates a decreasing RMSE as the number of randomly selected predictors (mtry) increases, leveling off as more predictors are included. This pattern suggests that selecting a larger subset of predictors improves the model’s performance, likely because it incorporates more informative variables.This indicates a more nuanced and accurate evaluation of predictors’ unique contributions to the model.
8.2. Use a simulation to show tree bias with different granularities
library(rpart)
library(partykit)
## Warning: package 'partykit' was built under R version 4.3.3
## Loading required package: grid
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.3.3
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.3.3
set.seed(23414)
a <- sample(1:10 / 10, 500, replace = TRUE)
b <- sample(1:100 / 100, 500, replace = TRUE)
c <- sample(1:1000 / 1000, 500, replace = TRUE)
d <- sample(1:10000 / 10000, 500, replace = TRUE)
e <- sample(1:100000 / 100000, 500, replace = TRUE)
y <- a + b + c + d + e
simData <- data.frame(a,b,c,d,e,y)
rpartTree <- rpart(y ~ ., data = simData)
plot(as.party(rpartTree), gp = gpar(fontsize = 7))
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9: (a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across
set.seed(20320)
# Ensure sufficient data by modifying the bag.fraction and n.minobsinnode parameters
model_low <- gbm(
y ~ ., data = simulated, distribution = "gaussian",
n.trees = 1000, interaction.depth = 3, shrinkage = 0.1,
bag.fraction = 0.5, n.minobsinnode = 5, cv.folds = 5
)
model_high <- gbm(
y ~ ., data = simulated, distribution = "gaussian",
n.trees = 1000, interaction.depth = 3, shrinkage = 0.9,
bag.fraction = 0.5, n.minobsinnode = 5, cv.folds = 5
)
# Variable importance for each model
importance_low <- summary(model_low)
importance_high <- summary(model_high)
importance_low
## var rel.inf
## V4 V4 26.467740
## V2 V2 21.565109
## duplicate1 duplicate1 13.257782
## V1 V1 13.253856
## V5 V5 10.418938
## V3 V3 8.092216
## V7 V7 2.012433
## V6 V6 1.329586
## V8 V8 1.279923
## V9 V9 1.191215
## V10 V10 1.131201
importance_high
## var rel.inf
## duplicate1 duplicate1 19.368710
## V4 V4 18.805022
## V2 V2 14.139969
## V3 V3 10.264213
## V5 V5 8.225311
## V7 V7 7.691462
## V9 V9 7.434942
## V8 V8 4.484968
## V10 V10 3.840566
## V1 V1 3.270189
## V6 V6 2.474647
The model on the right, with higher values for both the learning rate and bagging fraction, emphasizes the most influential predictors. This happens because a high learning rate quickly optimizes based on the strongest signals in the data With higher values for both the bagging fraction and learning rate, the model tends to focus on a smaller set of predictors because the larger learning rate emphasizes the steepest gradients (most influential predictors), and the high bagging fraction reduces diversity among subsets of predictors used for training. In contrast, the lower values encourage spreading importance across more predictors, as smaller learning rates allow finer adjustments, and a smaller bagging fraction increases the diversity of predictor subsets.
The model on the left (with lower learning rate and bagging fraction) is likely to generalize better to new samples. Its slower learning process and diverse sampling reduce the risk of overfitting by not overly focusing on a few predictors. The model on the right may overfit the training data due to its aggressive optimization, which can lead to poor performance on unseen data.
The model with the lower bagging fraction and learning rate is generally more predictive because it avoids overfitting by spreading importance across more predictors and making smaller, incremental updates to the trees.The model with higher values might overfit to the training data, leading to poorer generalization to other samples.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
# Load required libraries
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3
library(caret)
library(randomForest)
library(ipred)
library(gbm)
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
# Load the data
data("ChemicalManufacturingProcess")
# Verify dataset is loaded correctly
if (nrow(ChemicalManufacturingProcess) == 0) {
stop("Dataset is empty. Check the data source.")
}
# Handle missing values using k-NN imputation
imputer <- preProcess(ChemicalManufacturingProcess, method = "knnImpute", k = 5)
imputed_data <- predict(imputer, ChemicalManufacturingProcess)
# Filter out near-zero variance variables
filtered_data <- imputed_data[, -nearZeroVar(imputed_data)]
# Ensure filtered data is not empty
if (nrow(filtered_data) == 0) {
stop("Filtered data is empty. Check preprocessing steps.")
}
# Split data into training and testing sets
set.seed(613) # For reproducibility
train_indices <- sample(nrow(filtered_data), size = floor(0.8 * nrow(filtered_data)), replace = FALSE)
training_data <- filtered_data[train_indices, ]
testing_data <- filtered_data[-train_indices, ]
# Ensure splits are valid
if (nrow(training_data) == 0 || nrow(testing_data) == 0) {
stop("Training or testing data is empty. Check data splitting logic.")
}
# Train models
# Random Forest Model
random_forest_model <- randomForest(Yield ~ .,
data = training_data,
importance = TRUE,
ntree = 1000)
rf_predictions <- predict(random_forest_model, testing_data)
# Bagged Tree Model
bagged_tree_model <- bagging(Yield ~ .,
data = training_data)
bagged_tree_predictions <- predict(bagged_tree_model, testing_data)
# Boosted Tree Model
boosted_tree_model <- gbm(Yield ~ .,
data = training_data,
distribution = "gaussian",
n.trees = 1000)
boosted_tree_predictions <- predict(boosted_tree_model, testing_data)
## Using 1000 trees...
# Cubist Model
cubist_model <- train(Yield ~ .,
data = training_data,
method = "cubist")
cubist_predictions <- predict(cubist_model, testing_data)
# Compare model performance
performance_comparison <- rbind(
RandomForest = postResample(rf_predictions, testing_data$Yield),
BaggedTree = postResample(bagged_tree_predictions, testing_data$Yield),
BoostedTree = postResample(boosted_tree_predictions, testing_data$Yield),
Cubist = postResample(cubist_predictions, testing_data$Yield)
)
print(performance_comparison)
## RMSE Rsquared MAE
## RandomForest 0.6240582 0.5225684 0.5052709
## BaggedTree 0.6770303 0.4435007 0.5418391
## BoostedTree 0.6618618 0.5656576 0.5076747
## Cubist 0.5847081 0.6158759 0.4676564
# Identify the optimal model based on RMSE and R-squared
cat("The model with the best performance is:",
rownames(performance_comparison)[which.min(performance_comparison[, "RMSE"])])
## The model with the best performance is: Cubist
# Variable importance for the Cubist model
plot(varImp(cubist_model), 10)
# Plot a single tree
single_tree_model <- rpart(Yield ~ .,
data = training_data)
rpart.plot(single_tree_model)
The Cubist model gives the optimal performance with the highest R2R2 and the lowest RMSE and MAE values compared to the Random Forest, Bagged Tree, and Boosted Tree models. The model with the highest R-squared and the lowest RMSE is the optimal model.
Process variables dominate the list of top predictors.The top predictors align closely with the nonlinear SVM and lasso models, but there are some differences in ranking and shared variables.
# Fit a single decision tree
single_tree_model <- rpart(Yield ~ .,
data = training_data)
rpart.plot(single_tree_model)