Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson.
library(caret)
library(tidyverse)
library(AppliedPredictiveModeling)
library(corrplot)
library(e1071)
library(mlbench)
library(randomForest)
library(rpart)
library(party)
library(gbm)
library(Cubist)
library(partykit)
Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
Positive scores in the model indicate that the predictor is contributing to the model’s performance and negative scores indicate that the predictor is having a negative impact on the model’s performance.Since all predictors (V6 – V10) have non-zero importance scores the random forest model did consider them during the training process. However, the magnitude of the importance scores should also be considered. Even if a predictor has a non-zero importance score, if it’s relatively small compared to the other predictors, therefore I will conclude that thesr predictors are not significant to the model performance.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
Because the new predictor duplicate1 is highly correlated with V1, the importance score for V1 decreases because some of the predictive power captured by V1 is now also captured by duplicate1. If you add another predictor that is also highly correlated with V1, it can further decrease the importance score for V1, as now there are multiple predictors capturing similar information.
model3 <- cforest(y ~ ., data = simulated)
cfImp3 <- varimp(model3, conditional = TRUE)
(as.data.frame(cbind(Overall = rfImp2, Condition = cfImp3)))
## Overall Condition
## V1 5.69119973 3.33065561
## V2 6.06896061 5.22819343
## V3 0.62970218 0.04936459
## V4 7.04752238 5.78489850
## V5 1.87238438 1.39002706
## V6 0.13569065 -0.07029802
## V7 -0.01345645 -0.15477759
## V8 -0.04370565 -0.35461203
## V9 0.00840438 -0.11458990
## V10 0.02894814 -0.17347857
## duplicate1 4.28331581 2.78120557
The importance of the conditional inference tree differs from that of the traditional random forest model.
Boosted Tree
modelgbm <- gbm(y ~ ., data = simulated, distribution = "gaussian")
(gbmsum <- summary(modelgbm))
## var rel.inf
## V4 V4 30.3669149
## V1 V1 24.4090151
## V2 V2 21.2680313
## V5 V5 11.4582221
## V3 V3 8.1351552
## duplicate1 duplicate1 4.1515584
## V10 V10 0.2111029
## V6 V6 0.0000000
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
simulatedx <- simulated[, -which(names(simulated) == "y")]
modelcubist <- cubist(x = simulatedx, y = simulated$y)
(cubeImp <- varImp(modelcubist))
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## duplicate1 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
The importance rankings continue to depict predictors V6 to V10 as the least influential in the model, while predictors V1 to V5 and ‘duplicate1’ emerge as the most crucial. However, there appears to be variability in determining the most important variables among the top-ranked predictors.
Use a simulation to show tree bias with different granularities.
In the context of tree models, “granularity” refers to the level of detail or resolution in the data that the model considers when making decisions. Lower granularity means the model is making decisions based on less detailed information and highergranularity means the model is making decisions based on more detailed information. With low granularity, the decision tree might generalize too much, leading to biases. With high granularity, the decision tree might overfit the training data, capturing noise or specific patterns that don’t generalize well to new data.
set.seed(100)
n <- 100 # Number of observations
granularities <- c(0.1, 1, 2, 5) # Different granularities
results <- data.frame()
for (granularity in granularities) {
x <- seq(0, 10, by = granularity)
y <- 2 * x + rnorm(length(x), sd = 0.1) # Adding noise to the function
data <- data.frame(x = rep(x, each = n), y = rep(y, each = n))
model <- rpart(y ~ x, data = data)
calculate_rmse <- function(model, data) {
predictions <- predict(model, data)
mse <- mean((predictions - data$y)^2)
rmse <- sqrt(mse)
return(rmse)
}
rmse <- calculate_rmse(model, data)
results <- rbind(results, data.frame(granularity = granularity, rmse = rmse))
}
ggplot(results, aes(x = granularity, y = rmse)) +
geom_line() +
geom_point() +
labs(title = "Effect of Tree Bias with Different Granularities",
x = "Granularity",
y = "Root Mean Squared Error") +
theme_minimal()
A low RMSE indicates that the tree model is performing well at that granularity, with predictions closely aligned with the actual values. A high RMSE suggests poor performance at that granularity, indicating significant deviation between predicted and actual values.
An RMSE of zero suggests that the model has perfectly captured the relationship between predictor variables and the target variable, without any error on the training data. However, such a model is likely overfitted and may perform poorly on unseen data due to its excessive reliance on training data specifics.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a)Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The bagging fraction and learning rate in stochastic gradient boosting affect how trees are constructed and how much they learn from each iteration. When the parameters are set to high values (0.9), as in the right-hand plot, it means that each tree is given a lot of weight and contributes significantly to the final model. As a result, the model tends to focus more on the features that are most informative or have the strongest relationships with the target variable. Therefore, the importance is concentrated on just a few predictors.
When the parameters are set to low values (0.1), as in the left-hand plot, each tree contributes less to the final model, and the model may need more trees to capture the complexity of the data. This can lead to a more balanced distribution of importance across multiple predictors, as each tree has a smaller influence on the overall model.
(b)Which model do you think would be more predictive of other samples?
The model on the left, with low values for both the bagging fraction and learning rate, has more iterations and each predictor’s weight is decreased with each iteration. This often leads to a model that generalizes better and is more robust to overfitting.
Increasing the interaction depth would likely amplify the differences in predictor importance between the two model.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
a <- data(ChemicalManufacturingProcess)
set.seed(300)
yield_index <- which(names(ChemicalManufacturingProcess) == "Yield")
imputed <- preProcess(ChemicalManufacturingProcess[, -yield_index], method = c("bagImpute"))
preProcess_data <- predict(imputed , ChemicalManufacturingProcess)
low_values <- nearZeroVar(preProcess_data)
chem_data <- preProcess_data[,-low_values]
set.seed(300)
train_index <- createDataPartition(chem_data$Yield, p = 0.80, list = FALSE)
train_cmp <- chem_data[train_index, ]
test_cmp <- chem_data[-train_index, ]
train <- train_cmp[-c(1)]
test <- train_cmp[-c(1)]
Simple Tree
set.seed(300)
model_st <- rpart(Yield ~ ., data = train_cmp)
stPred <- predict(model_st, newdata = test_cmp)
postResample(pred = stPred, obs = test_cmp$Yield)
## RMSE Rsquared MAE
## 1.5531424 0.3771933 1.3252130
Boosted Trees
set.seed(300)
model_gbm <- gbm.fit(train, train_cmp$Yield, distribution = "gaussian")
gbmPred <- predict(model_gbm, newdata = test_cmp)
postResample(pred = gbmPred, obs = test_cmp$Yield)
## RMSE Rsquared MAE
## 1.6841877 0.2440884 1.3033518
set.seed(300)
model_cubist <- cubist(train, train_cmp$Yield)
cubistPred <- predict(model_cubist , newdata = test_cmp)
postResample(pred = cubistPred, obs = test_cmp$Yield)
## RMSE Rsquared MAE
## 1.3417320 0.4570655 1.0894852
Random Forest
set.seed(300)
model_rf <- randomForest(train, train_cmp$Yield, importance = TRUE, ntrees = 1000)
rfPred <- predict(model_rf, newdata = test_cmp)
postResample(pred = rfPred, obs = test_cmp$Yield)
## RMSE Rsquared MAE
## 1.0150538 0.6411125 0.7783584
(a)Which tree-based regression model gives the optimal resampling and test set performance?
Tthe random forest model, with an RMSE of 1.0150538, appears to provide a more balanced and reliable prediction, making it the preferable choice.
varImpPlot(model_rf)
The process variables still dominate the list even in the tree
model.
When comparing the linear and nonlinear models, I observed that in the linear model, ManufacturingProcess37 emerged as the top predictor, whereas in the nonlinear model, ManufacturingProcess32 took precedence. This discrepancy suggests that the linear model may not adequately capture the underlying relationships present in the data or might be missing certain nonlinear patterns that the nonlinear model is able to capture
(c)Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
This perspective on the data offers further insights into the predictive factors and their correlation with yield. It reveals that as the values of these predictors increase, the yield also tends to increase.
#produce tree plot
simple_tree <- as.party(model_st)
plot(simple_tree, ip_args = list(abbreviate = 4), gp = gpar(fontsize = 8))