8.1

A

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

model1 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1|>
  arrange(desc(Overall))

##          Overall
## V1   8.732235404
## V4   7.615118809
## V2   6.415369387
## V5   2.023524577
## V3   0.763591825
## V6   0.165111172
## V7  -0.005961659
## V10 -0.074944788
## V9  -0.095292651
## V8  -0.166362581

The random forest model did not significantly use the uninformative predictors (V6–V10), as evidenced by their low or negative importance scores. Specifically, V6 has a very low importance score (0.17), while V7, V8, V9, and V10 have negative values, indicating they may not contribute meaningful information and could even detract from the model’s predictive power.

B

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

model2 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

rfImp2|>
  arrange(desc(Overall))

##                Overall
## V4          7.04752238
## V2          6.06896061
## V1          5.69119973
## duplicate1  4.28331581
## V5          1.87238438
## V3          0.62970218
## V6          0.13569065
## V10         0.02894814
## V9          0.00840438
## V7         -0.01345645
## V8         -0.04370565

Adding duplicate1, highly correlated with V1, has slightly reduced V1’s importance, as the predictive influence is now shared between them. However, the model still primarily relies on the top predictors (V4, V2, V1, and duplicate1). The uninformative predictors (V6 to V10) continue to have low or negative importance.

set.seed(250)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9406686

model3 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)

rfImp3|>
  arrange(desc(Overall))

##                 Overall
## V4          7.550747425
## V2          6.782146292
## V1          4.869421064
## duplicate1  3.528919211
## V5          1.843119536
## duplicate2  1.814699771
## V3          0.503819790
## V6          0.088674783
## V10         0.035235767
## V9          0.017386769
## V7         -0.002709714
## V8         -0.056839681

As we add more predictors that are highly correlated with V1, the importance weight of V1 is gradually reduced and distributed among the new, correlated predictors. This distribution occurs because random forests tend to allocate importance across similar variables, diluting the contribution of any single predictor when multiple predictors provide overlapping information.

While V1 remains influential, its unique predictive power is diminished as other highly correlated predictors absorb some of its importance.

C

modelcf <- cforest(y ~ ., data = simulated)

order(-varimp(modelcf, conditional = TRUE))

##  [1]  4  2  1 11  5 12  3  9  6  7 10  8

The stability of V4, V2, and V1 as top predictors in both traditional and conditional importance analyses suggests they have strong, independent predictive power. Their consistent ranking indicates they provide unique information to the model, unaffected by redundancy with other variables.

D

cubistTuned <- train(y ~ ., data = simulated[1:11], method = "cubist")
modelcubist <- varImp(cubistTuned$finalModel, scale = FALSE)

model_boostedTrees <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(model_boostedTrees)

##                   var    rel.inf
## V4                 V4 29.2271590
## V2                 V2 24.2145179
## V1                 V1 13.1968498
## V5                 V5 11.2538797
## duplicate1 duplicate1  9.2039472
## V3                 V3  7.5523877
## duplicate2 duplicate2  4.8990389
## V6                 V6  0.4522197
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

cat("Cubist Variables of Importance\n")

## Cubist Variables of Importance

modelcubist |>
  arrange(desc(Overall)) |>
  head(10) |>
  print()

##     Overall
## V1     72.0
## V2     54.5
## V4     49.0
## V3     42.0
## V5     40.0
## V6     11.0
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0

No the pattern is different in both the cubist and the Boosted bagged trees models. But the uninformative variables remained unused in both of these models.

8.2

# Simulate data with meaningful variable names
feature_small <- sample(1:10 / 10, 500, replace = TRUE)
feature_medium <- sample(1:100 / 100, 500, replace = TRUE)
feature_large <- sample(1:1000 / 1000, 500, replace = TRUE)
feature_xlarge <- sample(1:10000 / 10000, 500, replace = TRUE)
feature_xxlarge <- sample(1:100000 / 100000, 500, replace = TRUE)

target <- feature_small + feature_medium + feature_large + feature_xlarge + feature_xxlarge

simData <- data.frame(
  feature_small,
  feature_medium,
  feature_large,
  feature_xlarge,
  feature_xxlarge,
  target
)

# Build and plot the decision tree model
tree_model <- rpart(target ~ ., data = simData)
plot(as.party(tree_model), gp = gpar(fontsize = 7))

The tree clearly favors high-granularity features like feature_xxlarge and feature_xlarge, placing them at the top of the model even though they contribute similarly to the target. Lower-granularity features, like feature_small, show up less often and only deeper in the tree. This highlights how variables with more unique values tend to dominate the splits, which can skew the model’s interpretability and performance.

8.3

A

The model on the right, with high bagging fraction and learning rate (0.9), converges quickly and focuses on a few dominant predictors due to reduced randomness and larger updates. In contrast, the model on the left, with both parameters set to low values (0.1), spreads importance across more predictors by making smaller updates and incorporating more randomness.

B

The model on the left, with lower parameter values, is likely more predictive on new samples as it prevents overfitting by considering a broader range of predictors. The model on the right may overfit by focusing too heavily on a few predictors, reducing its generalizability.

C

Interaction depth controls the maximum number of splits in each tree, enabling the model to capture complex relationships between predictors. As interaction depth increases, the importance of dominant predictors grows, leading to a steeper importance slope.

In our two models, increasing interaction depth would amplify the importance of top predictors in both. For the model on the right (with high bagging fraction and learning rate), this would further emphasize dominant predictors, increasing the risk of overfitting. In contrast, the model on the left (with lower parameter values) would maintain a more balanced spread of importance across predictors, even with a steeper slope.

8.7

A

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
ccc<-preProcess(ChemicalManufacturingProcess,method="knnImpute")
Chemical<- predict(ccc, ChemicalManufacturingProcess)
index <- createDataPartition(Chemical$Yield, p = .7, list = FALSE)


Chemical <- Chemical[, -nearZeroVar(Chemical)]
dim(Chemical)

## [1] 176  57

# train set
train_chem <- Chemical[index, ]

# test set
test_chem <- Chemical[-index, ]

ctrl <- trainControl(method = "boot", number = 25)

rpartGrid <- expand.grid(maxdepth= seq(1,10,by=1))

# Train the decision tree model with caret
rpartChemTune <- train(
  Yield ~ ., data = train_chem, 
  method = "rpart2",
  metric = "Rsquared",
  tuneGrid = rpartGrid,
  trControl = ctrl
)

tree_model <- train(
  Yield ~ ., data = train_chem, method = "rpart",
  trControl = ctrl, metric = "RMSE"
)

# Random Forest with Cross-Validation
rf_model <- train(
  Yield ~ ., data = train_chem, method = "rf",
  trControl = ctrl, metric = "RMSE", importance = TRUE, ntree = 500
)

# Gradient Boosting with Cross-Validation
gbm_model <- train(
  Yield ~ ., data = train_chem, method = "gbm",
  trControl = ctrl, metric = "RMSE", verbose = FALSE,
  tuneGrid = expand.grid(n.trees = 100, interaction.depth = 3, shrinkage = 0.1, n.minobsinnode = 10)
)

# Cubist Model with Cross-Validation
cubist_model <- train(
  Yield ~ ., 
  data = train_chem, 
  method = "cubist",

)

# Make predictions on the test set
tree_preds <- predict(tree_model, newdata = test_chem)
rf_preds <- predict(rf_model, newdata = test_chem)
gbm_preds <- predict(gbm_model, newdata = test_chem)
cubist_preds <- predict(cubist_model, test_chem[, -which(names(test_chem) == "Yield")])
cubist_performance <- postResample(cubist_preds, test_chem$Yield)

# Calculate RMSE and R-squared for each model using postResample
tree_perf <- postResample(tree_preds, test_chem$Yield)
rf_perf <- postResample(rf_preds, test_chem$Yield)
gbm_perf <- postResample(gbm_preds, test_chem$Yield)
cubist_perf <- postResample(cubist_preds, test_chem$Yield)

final_results <- data.frame(
  Model = c("Decision Tree", "Random Forest", "Gradient Boosting", "Cubist"),
  RMSE = c(tree_perf["RMSE"], rf_perf["RMSE"], gbm_perf["RMSE"], cubist_perf["RMSE"]),
  R_squared = c(tree_perf["Rsquared"], rf_perf["Rsquared"], gbm_perf["Rsquared"], cubist_perf["Rsquared"])
)

print(final_results)

##               Model      RMSE R_squared
## 1     Decision Tree 0.8590615 0.2553974
## 2     Random Forest 0.6615125 0.5379487
## 3 Gradient Boosting 0.5876138 0.6396685
## 4            Cubist 0.5266000 0.7081406

Best Model is the Cubist model here by both RMSE and RSquared

B

# Extract variable importance for chem_model

cubistmod <- varImp(cubist_model$finalModel, scale = FALSE)
plot(varImp(cubist_model), top= 10)

The model is dominated by manufacturing predictors, with ManufacturingProcess 32 being the most important, followed by BM2 and MP17. This tree-based model combines important variables from both the linear and nonlinear models, indicating that it leverages insights from each to achieve improved fits.

Assignment 9

Darwhin Gomez

2024-11-17

8.1

A

B

C

D

8.2

8.3

A

B

C

8.7

A

B

C