Question 8.2:
set.seed(123)
V1 <- runif(1000, 1, 1000)
V2 <- runif(1000, 20, 500)
V3 <- rnorm(1000, 100, 15)
y <- 0.3*V1 + 0.7*V2 + rnorm(1000, 0, 20)
data <- data.frame(V1, V2, V3, y)
tree_model <- cforest(y ~ ., data = data, ntree = 10)
var_imp <- varimp(tree_model, conditional = FALSE)
print(var_imp)
## V1 V2 V3
## 10938.2479 12455.4838 13.9709
barplot(var_imp, main = "Variable Importance (Unconditional)", col = "lightblue", las = 2, horiz = TRUE, cex.names = 0.8)
The tests above indicate that the response variable y is mainly influenced by V2 and V1, with V2 having the most significant impact. It also highlights that V3 does not contribute meaningfully to y, which aligns with the way the response variable was generated.

Question 8.3: (a) Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?
The model on the right has both the bagging fraction and learning rate set to high values (0.9). This means it uses a majority of the data and learns at a much quicker rate, creating strong decisions early on. As a result, it quickly pinpoints a few key predictors and focuses heavily on them to make assesements.
Comparatively the model on the left has both parameters set to low values (0.1). This results in the use of less data and a slower pace of learning, so it considers a broader range of predictors as important. This creates a more even distribution of importance across many predictors.
The left model, as a lower bagging fraction and learning rate it is likely to be more predictive of new samples. It takes in a wider range of predictors, which assists in the models ability to generalize new data. The model on the right may narrow in on too few predictors, making it more prone to over fitting and less capable of handling new, unseen data well.
Increasing the interaction depth would allow the model to capture more complex relationships between predictors. For both models, this would likely cause the importance scores to become more spread out. In the left model, it would enhance the spread even further, while in the right model, it could potentially reduce the focus on the top few predictors by recognizing more nuanced interactions among a larger set of predictors.
Question 8.7:
data("ChemicalManufacturingProcess")
impute_model <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
pharm_data <- predict(impute_model, ChemicalManufacturingProcess)
pharm_data <- pharm_data[, -nearZeroVar(pharm_data)]
set.seed(5678)
train_indices <- createDataPartition(pharm_data$Yield, p = .8, list = FALSE)
train_features <- pharm_data[train_indices, -1]
train_target <- pharm_data[train_indices, 1]
test_features <- pharm_data[-train_indices, -1]
test_target <- pharm_data[-train_indices, 1]
CART Model
set.seed(12)
cart_model <- train(train_features, train_target, method = "rpart", tuneLength = 10, trControl = trainControl(method = "cv"))
cart_predictions <- predict(cart_model, test_features)
cart_performance <- postResample(cart_predictions, test_target)
Bagged Trees
set.seed(12)
bagged_model <- ipredbagg(train_target, train_features)
bagged_predictions <- predict(bagged_model, test_features)
bagged_performance <- postResample(bagged_predictions, test_target)
Random Forest
set.seed(12)
rf_model <- randomForest(train_features, train_target, importance = TRUE, ntree = 1000)
rf_predictions <- predict(rf_model, test_features)
rf_performance <- postResample(rf_predictions, test_target)
Boosted Trees
gbm_grid <- expand.grid(interaction.depth = c(1, 3, 5),
n.trees = c(100, 300, 500),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
# Reduce the number of cross-validation folds
set.seed(12)
gbm_model <- train(train_features, train_target, method = "gbm", tuneGrid = gbm_grid, verbose = FALSE, trControl = trainControl(method = "cv", number = 3))
gbm_predictions <- predict(gbm_model, test_features)
gbm_performance <- postResample(gbm_predictions, test_target)
Cubist
set.seed(12)
cubist_model <- train(train_features, train_target, method = "cubist")
cubist_predictions <- predict(cubist_model, test_features)
cubist_performance <- postResample(cubist_predictions, test_target)
Performance
model_performance <- data.frame(
Model = c("CART", "Bagged Trees", "Random Forest", "Boosted Trees", "Cubist"),
RMSE = c(cart_performance["RMSE"], bagged_performance["RMSE"], rf_performance["RMSE"], gbm_performance["RMSE"], cubist_performance["RMSE"]),
Rsquared = c(cart_performance["Rsquared"], bagged_performance["Rsquared"], rf_performance["Rsquared"], gbm_performance["Rsquared"], cubist_performance["Rsquared"]),
MAE = c(cart_performance["MAE"], bagged_performance["MAE"], rf_performance["MAE"], gbm_performance["MAE"], cubist_performance["MAE"])
)
print(model_performance)
## Model RMSE Rsquared MAE
## 1 CART 1.2594245 0.5469529 1.0226858
## 2 Bagged Trees 0.7816314 0.7609045 0.6183692
## 3 Random Forest 0.8020656 0.7566018 0.6352124
## 4 Boosted Trees 0.9161200 0.7321844 0.6890313
## 5 Cubist 0.8429716 0.7325779 0.6207981
The Cubist model appears to be the best fitted to give optimal resampling and test set performance due to having the second lowest RMSE and the lowest MAE value. The Bagged tree models could also be a viable option as it has the lowest RMSE and second lowest MAE values.
cubist_importance <- varImp(cubist_model, scale = FALSE)
print(cubist_importance)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 45.5
## BiologicalMaterial06 27.5
## ManufacturingProcess17 27.0
## ManufacturingProcess09 21.0
## ManufacturingProcess04 16.5
## ManufacturingProcess13 14.5
## ManufacturingProcess31 13.5
## BiologicalMaterial12 12.5
## ManufacturingProcess39 11.5
## ManufacturingProcess28 11.5
## BiologicalMaterial03 10.5
## ManufacturingProcess33 10.5
## ManufacturingProcess29 10.5
## BiologicalMaterial02 9.0
## ManufacturingProcess25 8.5
## BiologicalMaterial05 8.5
## ManufacturingProcess26 7.0
## BiologicalMaterial11 7.0
## BiologicalMaterial01 6.5
## BiologicalMaterial08 6.0
plot(cubist_importance, top = 20, main = "Top 20 Important Variables - Cubist Model")
It can be seen above that ManufacturingProcess32,
ManufacturingProcess17, and ManufacturingProcess09 are the most
significant. Manufacturing processees make up a majority of the most
important predcitors, as they are 8 of the top 10, and overall make up
13 of the top 20.
set.seed(1234)
linear_model <- train(train_features, train_target, method = "lm")
linear_importance <- varImp(linear_model, scale = FALSE)
set.seed(1234)
nonlinear_model <- train(train_features, train_target, method = "nnet", linout = TRUE, trace = FALSE)
nonlinear_importance <- varImp(nonlinear_model, scale = FALSE)
top10_cubist <- head(cubist_importance$importance[order(-cubist_importance$importance$Overall), , drop = FALSE], 10)
top10_linear <- head(linear_importance$importance[order(-linear_importance$importance$Overall), , drop = FALSE], 10)
top10_nonlinear <- head(nonlinear_importance$importance[order(-nonlinear_importance$importance$Overall), , drop = FALSE], 10)
print("Top 10 Predictors - Cubist Model:")
## [1] "Top 10 Predictors - Cubist Model:"
print(top10_cubist)
## Overall
## ManufacturingProcess32 45.5
## BiologicalMaterial06 27.5
## ManufacturingProcess17 27.0
## ManufacturingProcess09 21.0
## ManufacturingProcess04 16.5
## ManufacturingProcess13 14.5
## ManufacturingProcess31 13.5
## BiologicalMaterial12 12.5
## ManufacturingProcess39 11.5
## ManufacturingProcess28 11.5
print("Top 10 Predictors - Linear Model:")
## [1] "Top 10 Predictors - Linear Model:"
print(top10_linear)
## Overall
## ManufacturingProcess32 5.023733
## ManufacturingProcess33 3.652286
## ManufacturingProcess28 2.607675
## ManufacturingProcess45 2.096808
## BiologicalMaterial05 2.009901
## ManufacturingProcess37 1.926264
## ManufacturingProcess04 1.732094
## ManufacturingProcess43 1.704196
## ManufacturingProcess34 1.529390
## BiologicalMaterial11 1.514399
print("Top 10 Predictors - Nonlinear Model:")
## [1] "Top 10 Predictors - Nonlinear Model:"
print(top10_nonlinear)
## Overall
## BiologicalMaterial08 2.545897
## ManufacturingProcess02 2.507906
## ManufacturingProcess27 2.497457
## BiologicalMaterial06 2.412733
## ManufacturingProcess29 2.388649
## ManufacturingProcess05 2.377326
## ManufacturingProcess07 2.324988
## ManufacturingProcess45 2.324800
## BiologicalMaterial03 2.315194
## ManufacturingProcess12 2.299852
count_variable_type <- function(variable_names) {
bio_count <- sum(grepl("Biological", variable_names))
proc_count <- sum(grepl("Manufacturing", variable_names))
return(list(Biological = bio_count, Process = proc_count))
}
cubist_variable_counts <- count_variable_type(rownames(top10_cubist))
print("Cubist Model - Variable Counts:")
## [1] "Cubist Model - Variable Counts:"
print(cubist_variable_counts)
## $Biological
## [1] 2
##
## $Process
## [1] 8
linear_variable_counts <- count_variable_type(rownames(top10_linear))
print("Linear Model - Variable Counts:")
## [1] "Linear Model - Variable Counts:"
print(linear_variable_counts)
## $Biological
## [1] 2
##
## $Process
## [1] 8
nonlinear_variable_counts <- count_variable_type(rownames(top10_nonlinear))
print("Nonlinear Model - Variable Counts:")
## [1] "Nonlinear Model - Variable Counts:"
print(nonlinear_variable_counts)
## $Biological
## [1] 3
##
## $Process
## [1] 7
The tests above further solidify that ManufacturingProcesses are the dominate factor. Specifically, ManufacturingProcess32 appears to be the most significant as shown in 2 out of the 3 models tested.
The view provided by this rpart model does indeed offer additional insights into the predictors and their relationship with the yield. As expected, this model confirms that ManufacturingProcess32 is the most significant predictor. However, it also highlights the importance of ManufacturingProcess06 and BiologicalMaterial12. Overall, this model provides greater clarity on how various other variables contribute to the overall manufacturing process.
set.seed(5678)
index <- createDataPartition(pharm_data$Yield, p = .8, list = FALSE)
train_data <- pharm_data[index, ]
test_data <- pharm_data[-index, ]
set.seed(1234)
rpartTree <- rpart(Yield ~ ., data = train_data)
plot(as.party(rpartTree), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))