HW9 Data 624 Predictive Analytics 2024 Spring Term

Question 8.1
Question 8.2
Question 8.3
Question 8.7

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson.

library(caret)
library(tidyverse)
library(AppliedPredictiveModeling)
library(corrplot)
library(e1071)
library(mlbench)
library(randomForest)
 library(rpart)
library(party)
library(gbm)
library(Cubist)
library(partykit)

Question 8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

Positive scores in the model indicate that the predictor is contributing to the model’s performance and negative scores indicate that the predictor is having a negative impact on the model’s performance.Since all predictors (V6 – V10) have non-zero importance scores the random forest model did consider them during the training process. However, the magnitude of the importance scores should also be considered. Even if a predictor has a non-zero importance score, if it’s relatively small compared to the other predictors, therefore I will conclude that thesr predictors are not significant to the model performance.

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

Because the new predictor duplicate1 is highly correlated with V1, the importance score for V1 decreases because some of the predictive power captured by V1 is now also captured by duplicate1. If you add another predictor that is also highly correlated with V1, it can further decrease the importance score for V1, as now there are multiple predictors capturing similar information.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model3 <- cforest(y ~ ., data = simulated)
cfImp3 <- varimp(model3, conditional = TRUE)
(as.data.frame(cbind(Overall = rfImp2, Condition = cfImp3)))

##                Overall   Condition
## V1          5.69119973  3.33065561
## V2          6.06896061  5.22819343
## V3          0.62970218  0.04936459
## V4          7.04752238  5.78489850
## V5          1.87238438  1.39002706
## V6          0.13569065 -0.07029802
## V7         -0.01345645 -0.15477759
## V8         -0.04370565 -0.35461203
## V9          0.00840438 -0.11458990
## V10         0.02894814 -0.17347857
## duplicate1  4.28331581  2.78120557

The importance of the conditional inference tree differs from that of the traditional random forest model.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Tree

modelgbm <- gbm(y ~ ., data = simulated, distribution = "gaussian")

(gbmsum <- summary(modelgbm))

##                   var    rel.inf
## V4                 V4 30.3669149
## V1                 V1 24.4090151
## V2                 V2 21.2680313
## V5                 V5 11.4582221
## V3                 V3  8.1351552
## duplicate1 duplicate1  4.1515584
## V10               V10  0.2111029
## V6                 V6  0.0000000
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000

simulatedx <- simulated[, -which(names(simulated) == "y")]

modelcubist <- cubist(x = simulatedx, y = simulated$y)

(cubeImp <- varImp(modelcubist))

##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## duplicate1      50
## V3               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0

The importance rankings continue to depict predictors V6 to V10 as the least influential in the model, while predictors V1 to V5 and ‘duplicate1’ emerge as the most crucial. However, there appears to be variability in determining the most important variables among the top-ranked predictors.

Question 8.2

Use a simulation to show tree bias with different granularities.

In the context of tree models, “granularity” refers to the level of detail or resolution in the data that the model considers when making decisions. Lower granularity means the model is making decisions based on less detailed information and highergranularity means the model is making decisions based on more detailed information. With low granularity, the decision tree might generalize too much, leading to biases. With high granularity, the decision tree might overfit the training data, capturing noise or specific patterns that don’t generalize well to new data.

set.seed(100)
n <- 100  # Number of observations
granularities <- c(0.1, 1, 2, 5)  # Different granularities

results <- data.frame()
for (granularity in granularities) {
  x <- seq(0, 10, by = granularity)
  y <- 2 * x + rnorm(length(x), sd = 0.1) # Adding noise to the function
  data <- data.frame(x = rep(x, each = n), y = rep(y, each = n))
  
  model <- rpart(y ~ x, data = data)
  
  calculate_rmse <- function(model, data) {
    predictions <- predict(model, data)
    mse <- mean((predictions - data$y)^2)
    rmse <- sqrt(mse)
    return(rmse)
  }
  rmse <- calculate_rmse(model, data)
  results <- rbind(results, data.frame(granularity = granularity, rmse = rmse))
}
ggplot(results, aes(x = granularity, y = rmse)) +
  geom_line() +
  geom_point() +
  labs(title = "Effect of Tree Bias with Different Granularities",
       x = "Granularity",
       y = "Root Mean Squared Error") +
  theme_minimal()

A low RMSE indicates that the tree model is performing well at that granularity, with predictions closely aligned with the actual values. A high RMSE suggests poor performance at that granularity, indicating significant deviation between predicted and actual values.

An RMSE of zero suggests that the model has perfectly captured the relationship between predictor variables and the target variable, without any error on the training data. However, such a model is likely overfitted and may perform poorly on unseen data due to its excessive reliance on training data specifics.

Question 8.3

Fig. 8.24

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a)Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The bagging fraction and learning rate in stochastic gradient boosting affect how trees are constructed and how much they learn from each iteration. When the parameters are set to high values (0.9), as in the right-hand plot, it means that each tree is given a lot of weight and contributes significantly to the final model. As a result, the model tends to focus more on the features that are most informative or have the strongest relationships with the target variable. Therefore, the importance is concentrated on just a few predictors.

When the parameters are set to low values (0.1), as in the left-hand plot, each tree contributes less to the final model, and the model may need more trees to capture the complexity of the data. This can lead to a more balanced distribution of importance across multiple predictors, as each tree has a smaller influence on the overall model.

(b)Which model do you think would be more predictive of other samples?

The model on the left, with low values for both the bagging fraction and learning rate, has more iterations and each predictor’s weight is decreased with each iteration. This often leads to a model that generalizes better and is more robust to overfitting.

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth would likely amplify the differences in predictor importance between the two model.

Question 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

a <- data(ChemicalManufacturingProcess)

set.seed(300)
yield_index <- which(names(ChemicalManufacturingProcess) == "Yield")

imputed <- preProcess(ChemicalManufacturingProcess[, -yield_index], method = c("bagImpute"))
preProcess_data <- predict(imputed , ChemicalManufacturingProcess)

low_values <- nearZeroVar(preProcess_data)
chem_data <- preProcess_data[,-low_values]

set.seed(300)

train_index <- createDataPartition(chem_data$Yield, p = 0.80, list = FALSE)

train_cmp <- chem_data[train_index, ]
test_cmp <- chem_data[-train_index, ]

train <- train_cmp[-c(1)]
test <- train_cmp[-c(1)]

Simple Tree

set.seed(300)
model_st <- rpart(Yield ~ ., data = train_cmp)
stPred <- predict(model_st, newdata = test_cmp)
postResample(pred = stPred, obs = test_cmp$Yield)

##      RMSE  Rsquared       MAE 
## 1.5531424 0.3771933 1.3252130

Boosted Trees

set.seed(300)
model_gbm <- gbm.fit(train, train_cmp$Yield, distribution = "gaussian")
gbmPred <- predict(model_gbm, newdata = test_cmp)

postResample(pred = gbmPred, obs = test_cmp$Yield)

##      RMSE  Rsquared       MAE 
## 1.6841877 0.2440884 1.3033518

set.seed(300)
model_cubist <- cubist(train, train_cmp$Yield)
cubistPred <- predict(model_cubist , newdata = test_cmp)
postResample(pred = cubistPred, obs = test_cmp$Yield)

##      RMSE  Rsquared       MAE 
## 1.3417320 0.4570655 1.0894852

Random Forest

set.seed(300)
model_rf <- randomForest(train, train_cmp$Yield, importance = TRUE, ntrees = 1000)
rfPred <- predict(model_rf, newdata = test_cmp)
postResample(pred = rfPred, obs = test_cmp$Yield)

##      RMSE  Rsquared       MAE 
## 1.0150538 0.6411125 0.7783584

(a)Which tree-based regression model gives the optimal resampling and test set performance?

Tthe random forest model, with an RMSE of 1.0150538, appears to provide a more balanced and reliable prediction, making it the preferable choice.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImpPlot(model_rf)

The process variables still dominate the list even in the tree model.

When comparing the linear and nonlinear models, I observed that in the linear model, ManufacturingProcess37 emerged as the top predictor, whereas in the nonlinear model, ManufacturingProcess32 took precedence. This discrepancy suggests that the linear model may not adequately capture the underlying relationships present in the data or might be missing certain nonlinear patterns that the nonlinear model is able to capture

(c)Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

This perspective on the data offers further insights into the predictive factors and their correlation with yield. It reveals that as the values of these predictors increase, the yield also tends to increase.

#produce tree plot
simple_tree <- as.party(model_st)
plot(simple_tree, ip_args = list(abbreviate = 4), gp = gpar(fontsize = 8))

HW9 Data 624 Predictive Analytics 2024 Spring Term

Melissa Bowman

2024-04-14

Question 8.1

Question 8.2

Question 8.3

Question 8.7