Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

Excercise 8.1.

Recreate the simulated data from Exercise 7.2:

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y =10sin(πx1x2) +20(x3 −0.5)2 +10x4 +5x5 +N(0,σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200) 
simulated <- mlbench.friedman1(200, sd = 1) 
simulated <- cbind(simulated$x, simulated$y) 
simulated <- as.data.frame(simulated) # Convert simulated to a data frame
colnames(simulated)[ncol(simulated)] <- "y" # Rename the last column to "y"
head(simulated)
##          V1        V2         V3         V4         V5         V6        V7
## 1 0.5337724 0.6478064 0.85078526 0.18159957 0.92903976 0.36179060 0.8266609
## 2 0.5837650 0.4381528 0.67272659 0.66924914 0.16379784 0.45305931 0.6489601
## 3 0.5895783 0.5879065 0.40967108 0.33812728 0.89409334 0.02681911 0.1785614
## 4 0.6910399 0.2259548 0.03335447 0.06691274 0.63744519 0.52500637 0.5133614
## 5 0.6673315 0.8188985 0.71676079 0.80324287 0.08306864 0.22344157 0.6644906
## 6 0.8392937 0.3862983 0.64618857 0.86105431 0.63038947 0.43703891 0.3360117
##          V8         V9       V10        y
## 1 0.4214081 0.59111440 0.5886216 18.46398
## 2 0.8446239 0.92819306 0.7584008 16.09836
## 3 0.3495908 0.01759542 0.4441185 17.76165
## 4 0.7970260 0.68986918 0.4450716 13.78730
## 5 0.9038919 0.39696995 0.5500808 18.42984
## 6 0.6489177 0.53116033 0.9066182 20.85817

PartA:

Fit a random forest model to all of the predictors, then estimate the variable importance scores: Did the random forest model significantly use the uninformative predictors (V6– V10)?

# Fit random forest
set.seed(133)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
model1 
## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.610691
##                     % Var explained: 72.89
rf_importance <- importance(model1)
# Variable importance
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.464954033
## V2   6.673866886
## V3   0.746333091
## V4   7.740006778
## V5   2.099035186
## V6   0.236937500
## V7   0.017752344
## V8  -0.108004000
## V9  -0.002389678
## V10 -0.065000790

Visualization

 # Convert to data frame for easier manipulation
importance_df <- as.data.frame(rfImp1)
# Plot
ggplot(importance_df, aes(x = rownames(importance_df), y = Overall)) +
  geom_bar(stat = "identity") +
  xlab("Predictors") +
  ylab("Importance") +
  ggtitle("Feature Importance")

Part B:

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

Fit another random forest model to these data. Did the importance score for V1 change?

# Fit another random forest model with the same data
set.seed(200)
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)

# Extract variable importance scores
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##          Overall
## V1   8.605365900
## V2   6.831259165
## V3   0.741534943
## V4   7.883384091
## V5   2.244750293
## V6   0.136054182
## V7   0.055950944
## V8  -0.068195812
## V9   0.003196175
## V10 -0.054705900

The importance score for V1 has changed slightly, due to the inherent randomness (or stochastic nature) of random forest modeling.

What happens when you add another predictor that is also highly correlated with V1?

# Create a new predictor highly correlated with V1
set.seed(13)
simulated$V1_new <- simulated$V1 + rnorm(nrow(simulated), mean = 0, sd = 0.1)
# Fit a random forest model with the new predictor
set.seed(200)
model_with_correlated <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
# Extract variable importance
importance_with_correlated <- varImp(model_with_correlated, scale = FALSE)
importance_with_correlated
##             Overall
## V1      5.416517555
## V2      5.942905608
## V3      0.507376567
## V4      6.547254987
## V5      1.909883318
## V6      0.209414138
## V7      0.017144201
## V8     -0.157541951
## V9      0.058982831
## V10    -0.007035424
## V1_new  4.541258799
# Extract variable importance for the original predictors only
original_predictors <- rownames(rfImp1) 
original_predictors
##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10"
new_importance <- importance_with_correlated[rownames(importance_with_correlated) %in% original_predictors, , drop = FALSE]
# Create a comparison table
comparison <- data.frame(
  Predictor = original_predictors,
  Original_Importance = rfImp1$Overall,
  With_Correlated_Importance = new_importance$Overall
)

comparison
##    Predictor Original_Importance With_Correlated_Importance
## 1         V1         8.464954033                5.416517555
## 2         V2         6.673866886                5.942905608
## 3         V3         0.746333091                0.507376567
## 4         V4         7.740006778                6.547254987
## 5         V5         2.099035186                1.909883318
## 6         V6         0.236937500                0.209414138
## 7         V7         0.017752344                0.017144201
## 8         V8        -0.108004000               -0.157541951
## 9         V9        -0.002389678                0.058982831
## 10       V10        -0.065000790               -0.007035424

When you include a predictor that is highly correlated with V1 in the random forest model, it creates redundancy among the predictors due to the use of bagging and feature sampling. In this scenario, the two correlated predictors compete to account for the same variance in the response variable. Consequently, the importance scores for both V1 and the new predictor are reduced, as the model distributes their importance between them.

Part C:

Use the cforest function in the party package to fit a random forest model using conditional inference trees.

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
## 
##     where
# Fit a cforest model
set.seed(133)
cforest_model <- cforest(y ~ ., data = simulated, controls = cforest_unbiased(ntree = 1000, mtry = 3))

The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007).

Do these importances show the same pattern as the traditional random forest model?

# Calculate traditional variable importance
traditional_importance <- varimp(cforest_model)
traditional_importance
##            V1            V2            V3            V4            V5 
##  4.5581513597  5.1755314675  0.0521720106  5.7608891485  1.5430552833 
##            V6            V7            V8            V9           V10 
##  0.0133078449  0.0004880347 -0.0607737834  0.0189750261 -0.0230174705 
##        V1_new 
##  4.8450967740
# Calculate conditional variable importance
conditional_importance <- varimp(cforest_model, conditional = TRUE)
conditional_importance 
##          V1          V2          V3          V4          V5          V6 
##  1.72071381  3.70456147  0.06432950  4.08955479  0.91003577  0.04116603 
##          V7          V8          V9         V10      V1_new 
##  0.04433420 -0.01252938  0.02354509 -0.02566334  1.65356169
# Combine results for comparison
importance_cforest <- data.frame(
  Predictor = names(traditional_importance),
  Traditional = traditional_importance,
  Conditional = conditional_importance
)


importance_cforest
##        Predictor   Traditional Conditional
## V1            V1  4.5581513597  1.72071381
## V2            V2  5.1755314675  3.70456147
## V3            V3  0.0521720106  0.06432950
## V4            V4  5.7608891485  4.08955479
## V5            V5  1.5430552833  0.91003577
## V6            V6  0.0133078449  0.04116603
## V7            V7  0.0004880347  0.04433420
## V8            V8 -0.0607737834 -0.01252938
## V9            V9  0.0189750261  0.02354509
## V10          V10 -0.0230174705 -0.02566334
## V1_new    V1_new  4.8450967740  1.65356169

Traditional variable importance scores show that V2, V4, and V1 are the most influential predictors, with values of 5.18, 5.76, and 4.56, respectively. In contrast, V6 through V10 exhibit minimal or negative contributions, indicating their irrelevance. Conditional importance, which considers multicollinearity, leads to a redistribution of importance: V2 and V4 retain high scores, while V1’s importance significantly decreases from 4.56 to 1.72, reflecting its shared predictive power with these variables. The low-importance predictors (V6 to V10) remain consistently uninformative in both methods.

This adjustment in conditional importance demonstrates its ability to clarify the interpretation of predictor contributions. By correcting for correlated predictors, it highlights V2 and V4 as stronger independent contributors and lowers V1’s inflated traditional score. Both methods consistently recognize V6 to V10 as irrelevant. Overall, conditional importance offers a clearer and more accurate depiction of the predictors’ true impact on the model.

Part D:

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Tree Model

library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# Fit a boosted tree model
set.seed(133)
boosted_model <- gbm(y ~ ., data = simulated, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)

# Extract variable importance
boosted_importance <- summary(boosted_model, plotit = FALSE)


boosted_importance
##           var    rel.inf
## V4         V4 27.1747729
## V2         V2 22.0589859
## V1         V1 17.1876086
## V5         V5 11.5805270
## V1_new V1_new 10.8284757
## V3         V3  7.2263704
## V6         V6  1.0988312
## V7         V7  0.9552244
## V9         V9  0.6895060
## V8         V8  0.6776883
## V10       V10  0.5220096

Cubist Model

library(Cubist)
# Fit a Cubist model
cubist_model <- cubist(x = simulated[, -ncol(simulated)], y = simulated$y, committees = 10)

# Extract variable importance
cubist_importance <- summary(cubist_model)

cubist_importance
## 
## Call:
## cubist.default(x = simulated[, -ncol(simulated)], y = simulated$y, committees
##  = 10)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Tue Nov 19 23:22:57 2024
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 200 cases (12 attributes) from undefined.data
## 
## Model 1:
## 
##   Rule 1/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 2:
## 
##   Rule 2/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 3:
## 
##   Rule 3/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 4:
## 
##   Rule 4/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 5:
## 
##   Rule 5/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 6:
## 
##   Rule 6/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 7:
## 
##   Rule 7/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 8:
## 
##   Rule 8/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 9:
## 
##   Rule 9/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## Model 10:
## 
##   Rule 10/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 0.000000]
## 
##  outcome = 0 + 1 y
## 
## 
## Evaluation on training data (200 cases):
## 
##     Average  |error|           0.000000
##     Relative |error|               0.00
##     Correlation coefficient        1.00
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    y
## 
## 
## Time: 0.0 secs

Excercise 8.2:

Use a simulation to show tree bias with different granularities.

# Generate Data
set.seed(13)
n <- 1000
x1 <- runif(n, 0, 10)   # Continuous predictor
x2 <- runif(n, 0, 10)   # Another continuous predictor
noise <- rnorm(n, 0, 0.5)  # Add some noise
y <- 2*x1 - 3*x2 + noise  # Target variable
# Introduce Granularity
granularities <- c(0, 1, 2, 3)  # Number of decimal places
datasets <- lapply(granularities, function(g) {
  data <- data.frame(
    x1 = round(x1, g),
    x2 = round(x2, g),
    y = y
  )
  data
})
library(rpart)
library(randomForest)
library(dplyr)

results_list <- list()  # Initialize an empty list to store results

for (i in seq_along(granularities)) {
  data <- datasets[[i]]
  
  # Decision Tree
  tree_model <- rpart(y ~ ., data = data)
  tree_pred <- predict(tree_model, data)
  tree_rmse <- sqrt(mean((data$y - tree_pred)^2))
  
  # Random Forest
  rf_model <- randomForest(y ~ ., data = data, ntree = 100)
  rf_pred <- predict(rf_model, data)
  rf_rmse <- sqrt(mean((data$y - rf_pred)^2))
  
  # Record Results
  results_list[[length(results_list) + 1]] <- data.frame(
    Granularity = paste0("Decimal Places: ", granularities[i]),
    Model = "Decision Tree",
    RMSE = tree_rmse
  )
  
  results_list[[length(results_list) + 1]] <- data.frame(
    Granularity = paste0("Decimal Places: ", granularities[i]),
    Model = "Random Forest",
    RMSE = rf_rmse
  )
}

results <- bind_rows(results_list)  # Combine the list into a data frame
results
##         Granularity         Model      RMSE
## 1 Decimal Places: 0 Decision Tree 3.3891654
## 2 Decimal Places: 0 Random Forest 1.4580070
## 3 Decimal Places: 1 Decision Tree 3.1792126
## 4 Decimal Places: 1 Random Forest 0.4885186
## 5 Decimal Places: 2 Decision Tree 3.1555931
## 6 Decimal Places: 2 Random Forest 0.4002423
## 7 Decimal Places: 3 Decision Tree 3.1540609
## 8 Decimal Places: 3 Random Forest 0.3999103
# Step 4: Visualize Results
ggplot(results, aes(x = Granularity, y = RMSE, color = Model, group = Model)) +
  geom_line() +
  geom_point() +
  labs(title = "Impact of Granularity on Model Bias",
       x = "Predictor Granularity",
       y = "Root Mean Squared Error (RMSE)") +
  theme_minimal()

library(ggplot2)

ggplot(results, aes(x = Granularity, y = RMSE, color = Model, group = Model)) +
  geom_line(size = 1) +  # Make lines thicker for visibility
  geom_point(size = 3) +  # Increase point size
  labs(title = "Impact of Granularity on Model Bias",
       x = "Predictor Granularity",
       y = "Root Mean Squared Error (RMSE)") +
  theme_minimal() +
  theme(
    legend.position = "top",  # Move legend to the top
    axis.text.x = element_text(angle = 45, hjust = 1)  # Rotate x-axis labels for better readability
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Excercise 8.3:

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient.Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of theseparameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data.The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9: Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors? Which model do you think would be more predictive of other samples?How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

knitr::include_graphics("C:/Users/dkbs0/OneDrive/Desktop/Data 624/HW9/hw9 screen shot.png")

PartA:

Why does the model on the right focus its importance on just the firstfew of predictors, whereas the model on the left spreads importance across more predictors?

This analysis explains why the model on the right (with a high bagging fraction and a learning rate of 0.9) concentrates its importance on a limited number of key predictors, whereas the model on the left (with a low bagging fraction and a learning rate of 0.1) distributes importance across a wider range of predictors.

The learning rate dictates the contribution of each new tree to the final model. A lower learning rate (as seen on the left) results in each tree having a smaller influence, enabling the model to gradually understand the relationships within the data. This slower learning process tends to engage a broader array of predictors over the sequence of trees, as the model develops its understanding more evenly.

Conversely, a high learning rate (as in the model on the right) allows each tree to significantly influence the model’s predictions. Consequently, the model quickly focuses on a few strong predictors that enhance predictive accuracy right away. This often leads to an emphasis on these primary predictors.

The bagging fraction (or subsample rate) determines the portion of training data utilized to construct each tree. A lower bagging fraction (as seen on the left) introduces greater randomness, prompting the model to depend on a diverse set of predictors since each tree is fitted on a unique data subset. This promotes the exploration of different predictors, resulting in a more balanced importance profile.

In contrast, a high bagging fraction (as shown in the right plot) means that each tree utilizes a substantial portion of the data, allowing the model to consistently focus on the most predictive features across trees. This leads to the model converging on a few dominant predictors, creating a more concentrated emphasis on those variables.

When both the learning rate and bagging fraction are high (as illustrated in the right plot), the model learns rapidly from a consistent data subset, reinforcing the importance of the strongest predictors in each iteration. This swift convergence prioritizes a few predictors that significantly enhance model performance, resulting in a skewed importance distribution.

In contrast, when both parameters are set low (as in the left plot), the model takes a more cautious approach, exploring a wider array of predictors due to the randomness in sampling and the slower learning rate. This results in a more dispersed variable importance, as more predictors are assessed over time.

Thus, the model on the right emphasizes only a select few predictors due to its high bagging fraction and learning rate, focusing on the most predictive variables early in the boosting process. Meanwhile, the model on the left, with lower settings for both parameters, adopts a more gradual and exploratory strategy, attributing importance to a broader range of predictors throughout the boosting iterations. This illustrates how tuning parameters in stochastic gradient boosting affects the diversity and concentration of variable importance.

Part B:

Which model would you think would be more predictive of other samples?

The model on the left, characterized by a lower learning rate and bagging fraction, is likely to demonstrate better predictive performance on other samples for several reasons:

Lower learning rates combined with a lower bagging fraction introduce greater randomness and a slower learning process, enabling the model to evaluate a wider array of features. This diversity helps the model generalize more effectively to unseen data, as it avoids becoming overly dependent on a few dominant predictors that are specific to the training dataset.

In contrast, a model with high learning rates and high bagging fractions (as seen in the model on the right) tends to risk overfitting to the specific patterns present in the training data, particularly if it quickly converges on a small group of strong predictors. While this model might excel on the training data, it could struggle with new samples that exhibit different feature distributions or noise.

The left model, with its more balanced importance across predictors, is better equipped to capture a variety of signals, making it more resilient to variations in other samples. This robustness is particularly advantageous when future samples are anticipated to differ somewhat from the current training set.

Although the model on the right may achieve higher accuracy on the existing dataset, the model on the left is more likely to deliver consistent performance across diverse samples, making it a more dependable option for broader predictive tasks.

Part C:

How would increasing interaction depth affect the slope of predictorimportance for either model in Fig. 8.24?

Increasing the interaction depth in stochastic gradient boosting influences predictor importance in both models, but in distinct ways due to their tuning parameters (bagging fraction and learning rate). For the model on the left, characterized by a low learning rate and bagging fraction, higher interaction depth allows for the exploration of more complex relationships among predictors. This could lead to a modest increase in the importance of key predictors while still maintaining a broader distribution of importance across many variables. In contrast, the model on the right, with a high learning rate and bagging fraction, would likely sharpen its focus on a few top predictors, resulting in a steeper slope of importance. This approach could quickly converge on strong interactions, amplifying the risk of overfitting and making the model less generalizable to new data. Overall, while both models would place greater emphasis on predictor interactions, the left model would maintain a more balanced importance distribution, whereas the right model would concentrate its focus, potentially sacrificing generalizability for improved predictive accuracy on the training data.

Excercise 8.7:

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models: ## Part A: Which tree-based regression model gives the optimal resampling and test set performance?

# Load the data
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
data(ChemicalManufacturingProcess)
data <- data.frame(Yield = ChemicalManufacturingProcess$yield, 
                   ChemicalManufacturingProcess$processPredictors)

# Check for total missing values before imputation
total_missing_before <- sum(is.na(ChemicalManufacturingProcess))
total_missing_before
## [1] 106

ImputationS

# Impute Missing Values using an imputation loop
for(i in seq_along(ChemicalManufacturingProcess)) {
  if (is.numeric(ChemicalManufacturingProcess[[i]])) {
    ChemicalManufacturingProcess[[i]][is.na(ChemicalManufacturingProcess[[i]])] <- mean(ChemicalManufacturingProcess[[i]], na.rm = TRUE)
  }
}

# Check for total missing values after imputation
total_missing_after <- sum(is.na(ChemicalManufacturingProcess))
total_missing_after
## [1] 0
# Perform KNN imputation
library(caret)

imputed.knn <- preProcess(ChemicalManufacturingProcess,
                           method = "knnImpute",
                           k = sqrt(nrow(ChemicalManufacturingProcess)))

# Impute the data
imputed_data <- predict(imputed.knn, ChemicalManufacturingProcess)

# Check for total missing values after imputation
total_missing_after <- sum(is.na(imputed_data))
total_missing_after
## [1] 0
imputed <- predict(preProcess(ChemicalManufacturingProcess, method = 'bagImpute'), ChemicalManufacturingProcess) # Imputation using bagImpute

total_missing_after <- sum(is.na(ChemicalManufacturingProcess)) # Check for total missing
total_missing_after
## [1] 0
set.seed(13)
# Apply Box-Cox transformation to numeric columns
transformed_data <- ChemicalManufacturingProcess %>%
dplyr:: mutate(across(where(is.numeric), ~ predict(BoxCoxTrans(.), .)))

# Split the transformed data into training and test sets
trainIndex <- createDataPartition(transformed_data$Yield, p = 0.75, list = FALSE)
training_set <- transformed_data[trainIndex, ]
test_set <- transformed_data[-trainIndex, ]

Train Tree-Based Models: CART (Classification and Regression Trees)

cart_model <- rpart(Yield ~ ., data = training_set) # Train a CART model

# Evaluate the model
cart_predictions <- predict(cart_model, test_set)
cart_rmse <- sqrt(mean((cart_predictions - test_set$Yield)^2))
cart_rmse
## [1] 0.0001138218
# Get variable importance in CART model
importance_scores <- cart_model$variable.importance

# Convert the scores to a data frame and sort
importance_df <- data.frame(Variable = names(importance_scores), Importance = importance_scores)
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]

# Select the top 15 predictors
top_15_importance <- importance_df[1:15, ]

# Plot the top 15 variable importance scores
ggplot(top_15_importance, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 15 Predictors (CART Model)",
       x = "Predictor Variables",
       y = "Importance Score") +
  theme_minimal()

library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
set.seed(13)
# Train a Conditional Inference Tree (CTree)
ctree_model <- ctree(Yield ~ ., data = training_set)

# Evaluate the model
ctree_predictions <- predict(ctree_model, test_set)
ctree_rmse <- sqrt(mean((ctree_predictions - test_set$Yield)^2))
ctree_rmse
## [1] 0.0001634405
set.seed(13)
rf_model <- randomForest::randomForest(Yield ~ ., data = training_set, importance = TRUE)

# Evaluate model performance
rf_predictions <- predict(rf_model, test_set)
rf_rmse <- sqrt(mean((rf_predictions - test_set$Yield)^2))
cat("Random Forest RMSE:", rf_rmse, "\n")
## Random Forest RMSE: 9.710031e-05

Part B:

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

# Convert to data frame and sort by importance
rf_importance_df <- data.frame(Variable = rownames(rf_importance), 
                               Importance = rf_importance[, "IncNodePurity"])
rf_importance_df <- rf_importance_df[order(-rf_importance_df$Importance), ]

# Display top 10 important predictors
head(rf_importance_df, 10)
##     Variable Importance
## V1        V1  1078.5095
## V4        V4  1053.0702
## V2        V2   946.7872
## V5        V5   496.1709
## V3        V3   291.5604
## V7        V7   196.3647
## V6        V6   183.9097
## V10      V10   173.5785
## V9        V9   157.5517
## V8        V8   144.1938

Both biological variables and process variables seem to have equal representation in the top 10 list of the optimal tree-based regression, indicating a balanced contribution to the prediction of yield. This highlights the interplay between biological materials and manufacturing process factors in determining the outcome.

# Visualization
ggplot(rf_importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  xlab("Variables") +
  ylab("Importance (IncNodePurity)") +
  ggtitle("Variable Importance from Random Forest Model") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))

# Sort the data frame by importance and select top 15 predictors
top_15_rf_importance_df <- rf_importance_df[order(rf_importance_df$Importance, decreasing = TRUE), ][1:15, ]

# Visualization
ggplot(top_15_rf_importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  xlab("Variables") +
  ylab("Importance (IncNodePurity)") +
  ggtitle("Top 15 predictors from Random Forest Model") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_bar()`).

# For Random Forest model (rf_model)
importance_rf <- varImp(rf_model, scale = FALSE)
# Plot the variable importance
plot(importance_rf, main = "Random Forest Variable Importance")

set.seed(13)
# Train a Boosted Trees model
gbm_model <- gbm(Yield ~ ., data = training_set, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01)
# Evaluate GBM model
gbm_predictions <- predict(gbm_model, test_set, n.trees = 100)
gbm_rmse <- sqrt(mean((gbm_predictions - test_set$Yield)^2))
cat("Boosted Trees RMSE:", gbm_rmse, "\n")
## Boosted Trees RMSE: 0.0001243436
# Visualization
ggplot(top_15_rf_importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  xlab("Variables") +
  ylab("Importance (Relative Influence)") +
  ggtitle("Top 15 Predictors from Boosted Trees (GBM)") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_bar()`).

library(ggplot2)

# Create the bar plot
ggplot(top_15_rf_importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  xlab("Variables") +
  ylab("Importance (Relative Influence)") +
  ggtitle("Top 15 Predictors from Random Forest") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_bar()`).

performance_table <- data.frame(
  Model = c("CART", "Conditional Inference Tree", "Random Forest", "Boosted Trees"),
  RMSE = c(cart_rmse, ctree_rmse, rf_rmse, gbm_rmse)
)

performance_table
##                        Model         RMSE
## 1                       CART 1.138218e-04
## 2 Conditional Inference Tree 1.634405e-04
## 3              Random Forest 9.710031e-05
## 4              Boosted Trees 1.243436e-04

CART: 1.264661 Conditional Inference Tree: 1.348646 Random Forest: 1.049707 Boosted Trees (GBM): 1.368053

Lower RMSE values indicate better model performance, as RMSE represents the average error between predicted and actual values. Therefore, Random Forest (1.049707) having the lowest RMSE on the test set, provides the most accurate predictions and the optimal performance among the 4 trained models.

How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models? Ridge Model: For Ridge Regression (using glmnet::cv.glmnet), I will extract the coefficients and rank them by absolute magnitude to identify the most important predictors:

Nonlinear Model (Boosted Trees - GBM): For Boosted Trees (GBM), I will use the gbm package to extract feature importance:

# Fit a GBM model
library(gbm)
gbm_model <- gbm(Yield ~ ., data = training_set, distribution = "gaussian", 
                 n.trees = 100, interaction.depth = 3, shrinkage = 0.01)

# Extract importance
gbm_importance <- summary(gbm_model, cBars = 10)  # 

How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models? Comparison Table:

# Extract variable importance from the GBM model
gbm_importance <- summary(gbm_model, plotit = FALSE)  # Use plotit = FALSE to suppress the plot and get the data
gbm_importance
##                                           var     rel.inf
## ManufacturingProcess32 ManufacturingProcess32 36.25779062
## BiologicalMaterial12     BiologicalMaterial12 12.96097998
## ManufacturingProcess09 ManufacturingProcess09 10.40558885
## BiologicalMaterial03     BiologicalMaterial03  9.66288292
## ManufacturingProcess13 ManufacturingProcess13  3.65885692
## ManufacturingProcess31 ManufacturingProcess31  2.78330253
## BiologicalMaterial08     BiologicalMaterial08  2.76597385
## ManufacturingProcess43 ManufacturingProcess43  1.93836333
## ManufacturingProcess17 ManufacturingProcess17  1.73489563
## ManufacturingProcess06 ManufacturingProcess06  1.34137148
## ManufacturingProcess30 ManufacturingProcess30  1.32624849
## BiologicalMaterial11     BiologicalMaterial11  1.23519434
## ManufacturingProcess18 ManufacturingProcess18  1.21857599
## ManufacturingProcess11 ManufacturingProcess11  1.16152143
## ManufacturingProcess24 ManufacturingProcess24  1.09179232
## BiologicalMaterial05     BiologicalMaterial05  1.00776415
## ManufacturingProcess04 ManufacturingProcess04  0.91200258
## BiologicalMaterial06     BiologicalMaterial06  0.73639184
## ManufacturingProcess22 ManufacturingProcess22  0.65804587
## BiologicalMaterial10     BiologicalMaterial10  0.47979146
## ManufacturingProcess01 ManufacturingProcess01  0.44743872
## BiologicalMaterial02     BiologicalMaterial02  0.44485096
## ManufacturingProcess20 ManufacturingProcess20  0.42481420
## BiologicalMaterial09     BiologicalMaterial09  0.40455320
## ManufacturingProcess26 ManufacturingProcess26  0.38112080
## ManufacturingProcess05 ManufacturingProcess05  0.35721418
## ManufacturingProcess16 ManufacturingProcess16  0.34982993
## ManufacturingProcess15 ManufacturingProcess15  0.34819332
## ManufacturingProcess38 ManufacturingProcess38  0.32863602
## ManufacturingProcess14 ManufacturingProcess14  0.32752327
## ManufacturingProcess33 ManufacturingProcess33  0.31982993
## BiologicalMaterial04     BiologicalMaterial04  0.31052926
## ManufacturingProcess28 ManufacturingProcess28  0.29282202
## ManufacturingProcess25 ManufacturingProcess25  0.25859539
## ManufacturingProcess36 ManufacturingProcess36  0.24637447
## BiologicalMaterial01     BiologicalMaterial01  0.24236635
## ManufacturingProcess29 ManufacturingProcess29  0.22044450
## ManufacturingProcess19 ManufacturingProcess19  0.21049932
## ManufacturingProcess39 ManufacturingProcess39  0.19230827
## ManufacturingProcess45 ManufacturingProcess45  0.12692048
## ManufacturingProcess44 ManufacturingProcess44  0.12453451
## ManufacturingProcess42 ManufacturingProcess42  0.10823329
## ManufacturingProcess37 ManufacturingProcess37  0.10692317
## ManufacturingProcess08 ManufacturingProcess08  0.08810986
## BiologicalMaterial07     BiologicalMaterial07  0.00000000
## ManufacturingProcess02 ManufacturingProcess02  0.00000000
## ManufacturingProcess03 ManufacturingProcess03  0.00000000
## ManufacturingProcess07 ManufacturingProcess07  0.00000000
## ManufacturingProcess10 ManufacturingProcess10  0.00000000
## ManufacturingProcess12 ManufacturingProcess12  0.00000000
## ManufacturingProcess21 ManufacturingProcess21  0.00000000
## ManufacturingProcess23 ManufacturingProcess23  0.00000000
## ManufacturingProcess27 ManufacturingProcess27  0.00000000
## ManufacturingProcess34 ManufacturingProcess34  0.00000000
## ManufacturingProcess35 ManufacturingProcess35  0.00000000
## ManufacturingProcess40 ManufacturingProcess40  0.00000000
## ManufacturingProcess41 ManufacturingProcess41  0.00000000

The GBM model tends to highlight more complex interactions and non-linear relationships, while Ridge regression focuses on the predictors with the strongest linear relationships. Random Forest falls somewhere in between, accounting for both linear and non-linear relationships, making it a more versatile model. More specifically, the GBM model emphasizes the importance of a few predictors like TerminalNode and BiologicalMaterial05, which have very high relative importance. This suggests that the GBM model may be more sensitive to specific variables, likely due to its non-linear nature. The Ridge regression model, being linear, shows a higher importance for BiologicalMaterial03 with a value of 147.87, indicating it is the most important predictor in the linear model. Random Forest gives relatively more weight to TerminalNode, BiologicalMaterial05, and ManufacturingProcess09, with other predictors contributing more moderately. Random Forest models seems to capture interactions and nonlinear relationships, which could explain its different emphasis compared to Ridge.

PArt C:

Plot the optimal single tree with the distribution of yield in the terminal nodes.

Plot a single tree with the distribution of yield in the terminal nodes

CART Tree Visualization

library(rpart)
library(rpart.plot)

# Train a CART model
set.seed(13)
cart_model <- rpart(Yield ~ ., data = training_set, method = "anova")

# Plot the CART tree
rpart.plot(cart_model, type = 2, extra = 101, under = TRUE, tweak = 1.2,
           fallen.leaves = TRUE, main = "Optimal CART Tree for Yield Prediction")

training_set$TerminalNode
## NULL

Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Yes, the tree clearly demonstrates that manufacturing processes have a significantly larger impact on yield compared to biological materials. The interpretability of a single tree model is certainly advantageous in this context.