Questions 8.1, 8.2, 8.3, and 8.7 from the Kuhn& Johnson book
library(caret)
library(corrplot)
library(Cubist)
library(dplyr)
library(data.table)
library(e1071)
library(earth)
library(fable)
library(forecast)
library(fpp2)
library(fpp3)
library(ggplot2)
library(GGally)
library(gbm)
library(knitr)
library(kernlab)
library(latex2exp)
library(mlbench)
library(MASS)
library(purrr)
library(party)
library(pls)
library(patchwork)
library(rpart)
library(rpart.plot)
library(RANN)
library(randomForest)
library(stats)
library(tsibble)
library(tidyr)
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
# Variable Importance
# Convert importance scores to a data frame
rf_imp_df <- data.frame(Variables = rownames(rfImp1), Importance = rfImp1$Overall)
# Order by importance
rf_imp_df <- rf_imp_df[order(rf_imp_df$Importance, decreasing = TRUE), ]
# Plot
ggplot(rf_imp_df, aes(x = reorder(Variables, Importance), y = Importance)) +
geom_col(fill = "#228B22") +
coord_flip() +
labs(title = "Random Forest Variable Importance Score")
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No- the random forest model made adequate partitions in each decision tree and accurately weighted the accuracy of predictions. So the models adjusted well to focusing on the predictors that provided the most information gain, or the smallest gradient, and accurately weighed them at the higher positions.
# conditional inference trees (Pg. 182)
model3 <- cforest(y ~ ., data = simulated)
rfImp3 <- varimp(model3, conditional = TRUE)
# Convert to data frame and clean up names
rf_imp_df3 <- data.frame(Variables = names(rfImp3), Importance = as.numeric(rfImp3))
# Plot
ggplot(rf_imp_df3, aes(x = reorder(Variables, Importance), y = Importance)) +
geom_col(fill = "#228B22") +
coord_flip() +
labs(title = "Random Forest 3 Variable Importance Score")
Yes, it mimics the behavior of the prior tree exactly, but it stops after a tree depth of 5 steps.
model_cubist <- cubist(simulated[, colnames(simulated)[colnames(simulated) != 'y']], simulated$y)
cubist_imp <- varImp(model_cubist, scale = FALSE)
cubist_imp %>%
mutate (var = rownames(cubist_imp)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = '#228B22') +
labs(title = 'Cubist Variable Importance Score ' )
As expected, Cubist used linear combinations from predictions of each iterative tree, so the important predictors are selected and V6-10 are dropped. The importance of each variable is equivalent. Since cubist fits a linear regression model at each terminal node, and each model looks at how often a variable is used in a rules conditions, it aggregates the usage of each variable across all rules and models. It then scales the scores of each, and in this case, these 4 models were the only significant ones used.
library(party)
library(ggplot2)
library(dplyr)
####################################
set.seed(12345)
simulated2 <- mlbench.friedman2(150, sd = 2)
simulated2 <- cbind(simulated2$x, simulated2$y)
simulated2 <- as.data.frame(simulated2)
colnames(simulated2)[ncol(simulated2)] <- "y"
cor(simulated2)
## V1 V2 V3 V4 y
## V1 1.00000000 -0.00444964 0.09962107 -0.17208090 0.08398245
## V2 -0.00444964 1.00000000 -0.11637790 -0.05896942 0.57424051
## V3 0.09962107 -0.11637790 1.00000000 -0.01241795 0.65472592
## V4 -0.17208090 -0.05896942 -0.01241795 1.00000000 -0.08072369
## y 0.08398245 0.57424051 0.65472592 -0.08072369 1.00000000
# Fit the model
model4 <- cforest(y ~ ., data = simulated2)
# Get variable importance
rfImp4 <- varimp(model4, conditional = FALSE) %>% as.data.frame()
# VIS Plot
rfImp4 %>%
rename(Overall = '.') %>%
mutate (var = rownames(rfImp4)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = '#228B22') +
labs(title = 'Newly Simulated Variable Importance' )
####################################
# Random
V1 <- rnorm(200, 2, 10)
V2 <- V1 + rnorm(200, 0, 2)
V3 <- V2 * runif(200, 0.75, 1.25)
y <- V1 + V2
random_df <- data.frame(V1, V2, V3, y)
cor(random_df)
## V1 V2 V3 y
## V1 1.0000000 0.9831294 0.9757395 0.9956457
## V2 0.9831294 1.0000000 0.9902429 0.9958993
## V3 0.9757395 0.9902429 1.0000000 0.9872716
## y 0.9956457 0.9958993 0.9872716 1.0000000
# Fit the model
controls <- cforest_unbiased(mtry = 2, ntree = 500)
model5 <- cforest(y ~ ., data = random_df, controls = controls)
# Get variable importance
rfImp5 <- varimp(model5, conditional = FALSE) %>% as.data.frame()
# Rename column
rfImp5 <- rfImp5 %>%
rename(Importance = ".") %>%
mutate(Variable = rownames(rfImp5))
# VIS Plot
ggplot(rfImp5, aes(x = Importance, y = reorder(Variable, Importance))) +
geom_col(fill = '#228B22') +
labs(title = 'Newly Simulated Variable Importance 2') +
theme_minimal()
The bagging fraction and learning rate, both adjustable parameters, have deep affects on the model. By lowering the bagging fraction, each decision tree only trains on 10% of the data, which makes the model generalize poorly. This model is forced to generalize on smaller random subset of the data, which causes variance within the trees. As you raise it, you incur overfitting on the data, so more focus on important predictors. The learning rate controls the contribution of each trees weight and vote on the final model,so a high rate means each tree contributes more. A higher rate rightfully detected that the most important predictors are numCarbon, MolWeight, and the SurfaceArea predictors.
The left model is more predictive, as by lowering the bagging fraction you allow the model to train on more varied data, and the learning rate makes the distribution of votes for each prediction diverse. Hence the model will train to more unimportant predictors, and then be more predictive to other observations.
Interaction depth is the parameter that impacts the maximum depth of decision trees, and the higher the depth, the more variable interaction and more complex the model. This will impact which predictors are looked at by each tree, and the information gain from each tree to the ensemble model. In the left model, which has 0.1 for both parameters, the low values lead the model to be very conservative and that only the main predictors show up. The top variables will remain important, but there will be diversity with other variables showing up in shallower trees. This will lead to the slopes of top predictors being higher than worse predictors, but worse predictors will still show up and have some slope. In the second model with the higher parameters, since the trees have more interaction depth, there is a higher learning rate and the model will quickly adapt to focusing on the important predictors. So the slope of predictor importance will be steeper, as there will be higher scores for the crucial predictors, and lower scores for the un-important ones.
Load chemical data
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
################################################################################
# Impute missing values
CMP_impute <- ChemicalManufacturingProcess %>%
preProcess("knnImpute")
CMP_complete <- predict(CMP_impute, ChemicalManufacturingProcess)
################################################################################
# Remove near zero Variance predictors
CMP_complete <- CMP_complete[, -nearZeroVar(CMP_complete)]
################################################################################
# Train Test Split
set.seed(12345)
index <- createDataPartition(CMP_complete$Yield, p = .8, list = FALSE)
# Train
train_predictors <- CMP_complete[index, -1]
train_yield <- CMP_complete[index, 1]
# Test
test_predictors <- CMP_complete[-index, -1]
test_yield <- CMP_complete[-index, 1]
Random Forest
set.seed(12345)
rf_model <- randomForest(x = train_predictors,
y = train_yield,
importance = TRUE,
ntree = 500)
plot(rf_model)
################################################################################
# Variable Importance
varImpPlot(rf_model, main = "Random Forest Variable Importance")
################################################################################
# Predict on Test Set
rf_predictions <- predict(rf_model, test_predictors)
################################################################################
# Evaluate Performance
rf_results <- postResample(pred = rf_predictions, obs = test_yield)
print(rf_results)
## RMSE Rsquared MAE
## 0.5030486 0.7814039 0.4239256
Cubist
set.seed(12345)
cubist_model <- cubist(x = train_predictors,
y = train_yield,
# Number of decision trees
committees = 10)
################################################################################
# Predict on Test Set
cubist_predictions <- predict(cubist_model, test_predictors)
################################################################################
# Evaluate Performance
cubist_results <- postResample(pred = cubist_predictions, obs = test_yield)
print(cubist_results)
## RMSE Rsquared MAE
## 0.7271022 0.5338601 0.5225497
################################################################################
# Variable Importance
cubist_imp <- varImp(cubist_model, scale = FALSE)
# Transform data for easier plotting
cubist_imp_df <- data.frame(Variable = rownames(cubist_imp), Importance = cubist_imp$Overall)
ggplot(cubist_imp_df, aes(x = reorder(Variable, Importance), y = Importance)) +
geom_col(fill = '#228B22') +
coord_flip() +
labs(title = "Cubist Variable Importance")
Regression Tree (CART)
set.seed(12345)
rpart_model <- rpart(train_yield ~ .,
data = train_predictors,
method = "anova",
control = rpart.control(minsplit = 20, cp = 0.01))
################################################################################
# Plot
rpart.plot(rpart_model, main = "Regression Tree", type = 3, extra = 1)
################################################################################
# Variable Importance
rpart_imp <- varImp(rpart_model, scale = FALSE)
# Convert to data frame for easier plotting
rpart_imp_df <- data.frame(Variable = rownames(rpart_imp),Importance = rpart_imp$Overall)
ggplot(rpart_imp_df, aes(x = reorder(Variable, Importance), y = Importance)) +
geom_col(fill = '#228B22') +
coord_flip() +
labs(title = "Regression Tree Variable Importance")
################################################################################
# Predict on Test Set
rpart_predictions <- predict(rpart_model, newdata = test_predictors)
################################################################################
# Evaluate Performance
rpart_results <- postResample(pred = rpart_predictions, obs = test_yield)
print(rpart_results)
## RMSE Rsquared MAE
## 0.7944217 0.4296448 0.6061872
Gradient Boosted Trees
set.seed(12345)
gbm_model <- gbm(
formula = train_yield ~ ., # Target variable ~ predictors
data = data.frame(train_yield, train_predictors), # Combine target and predictors
distribution = "gaussian", # Regression model (for numeric outcomes)
n.trees = 500, # Number of boosting iterations
interaction.depth = 4, # Maximum depth
shrinkage = 0.01, # Learning rate
cv.folds = 5, # Cross-validation folds
n.minobsinnode = 10 # Minimum number of observations in each leaf node
)
################################################################################
# Variable Importance
summary(gbm_model, cBars = 10, main = "Gradient Boosted Tree Variable Importance")
## var rel.inf
## ManufacturingProcess32 ManufacturingProcess32 26.88852835
## ManufacturingProcess13 ManufacturingProcess13 7.36140368
## ManufacturingProcess17 ManufacturingProcess17 6.58807977
## ManufacturingProcess09 ManufacturingProcess09 4.73966960
## ManufacturingProcess06 ManufacturingProcess06 4.68255796
## BiologicalMaterial12 BiologicalMaterial12 4.37093010
## BiologicalMaterial11 BiologicalMaterial11 3.33597023
## BiologicalMaterial03 BiologicalMaterial03 3.19574098
## ManufacturingProcess31 ManufacturingProcess31 2.29546401
## BiologicalMaterial05 BiologicalMaterial05 1.85146803
## BiologicalMaterial09 BiologicalMaterial09 1.81915688
## ManufacturingProcess20 ManufacturingProcess20 1.69770405
## ManufacturingProcess05 ManufacturingProcess05 1.65776671
## BiologicalMaterial06 BiologicalMaterial06 1.46363313
## BiologicalMaterial04 BiologicalMaterial04 1.38528019
## BiologicalMaterial08 BiologicalMaterial08 1.37260374
## ManufacturingProcess04 ManufacturingProcess04 1.33963576
## ManufacturingProcess01 ManufacturingProcess01 1.31274887
## ManufacturingProcess27 ManufacturingProcess27 1.26442216
## ManufacturingProcess19 ManufacturingProcess19 1.18391546
## ManufacturingProcess25 ManufacturingProcess25 1.08642647
## ManufacturingProcess39 ManufacturingProcess39 1.05584652
## ManufacturingProcess29 ManufacturingProcess29 0.94160760
## ManufacturingProcess18 ManufacturingProcess18 0.92274638
## BiologicalMaterial02 BiologicalMaterial02 0.91959488
## ManufacturingProcess15 ManufacturingProcess15 0.90234025
## ManufacturingProcess37 ManufacturingProcess37 0.88977417
## ManufacturingProcess11 ManufacturingProcess11 0.87523234
## ManufacturingProcess24 ManufacturingProcess24 0.87387221
## ManufacturingProcess28 ManufacturingProcess28 0.83042229
## ManufacturingProcess43 ManufacturingProcess43 0.78584359
## ManufacturingProcess30 ManufacturingProcess30 0.78557415
## ManufacturingProcess34 ManufacturingProcess34 0.77490011
## ManufacturingProcess26 ManufacturingProcess26 0.74368945
## ManufacturingProcess33 ManufacturingProcess33 0.63775581
## ManufacturingProcess21 ManufacturingProcess21 0.62688217
## ManufacturingProcess14 ManufacturingProcess14 0.61900641
## ManufacturingProcess02 ManufacturingProcess02 0.60488709
## ManufacturingProcess22 ManufacturingProcess22 0.53547902
## BiologicalMaterial10 BiologicalMaterial10 0.53437187
## BiologicalMaterial01 BiologicalMaterial01 0.53057577
## ManufacturingProcess35 ManufacturingProcess35 0.49067058
## ManufacturingProcess36 ManufacturingProcess36 0.48798781
## ManufacturingProcess03 ManufacturingProcess03 0.47316357
## ManufacturingProcess16 ManufacturingProcess16 0.38967425
## ManufacturingProcess23 ManufacturingProcess23 0.36786401
## ManufacturingProcess45 ManufacturingProcess45 0.35306249
## ManufacturingProcess07 ManufacturingProcess07 0.30702903
## ManufacturingProcess10 ManufacturingProcess10 0.22333866
## ManufacturingProcess44 ManufacturingProcess44 0.17306208
## ManufacturingProcess42 ManufacturingProcess42 0.16807405
## ManufacturingProcess08 ManufacturingProcess08 0.11183145
## ManufacturingProcess38 ManufacturingProcess38 0.10941490
## ManufacturingProcess40 ManufacturingProcess40 0.06131891
## ManufacturingProcess12 ManufacturingProcess12 0.00000000
## ManufacturingProcess41 ManufacturingProcess41 0.00000000
################################################################################
# Predict on Test Set
gbm_predictions <- predict(gbm_model, newdata = test_predictors, n.trees = gbm_model$n.trees)
################################################################################
# Evaluate Performance
gbm_results <- postResample(pred = gbm_predictions, obs = test_yield)
print(gbm_results)
## RMSE Rsquared MAE
## 0.5037441 0.7513914 0.3915591
See results
Results <- data.frame(
Model = c("Random Forest", "Cubist", "Regression Tree (CART)", "Gradient Boosted Trees"),
RMSE = c(rf_results["RMSE"], cubist_results["RMSE"], rpart_results["RMSE"], gbm_results["RMSE"]),
MAE = c(rf_results["MAE"], cubist_results["MAE"], rpart_results["MAE"], gbm_results["MAE"]),
Rsquared = c(rf_results["Rsquared"], cubist_results["Rsquared"], rpart_results["Rsquared"], gbm_results["Rsquared"])
)
print(Results)
## Model RMSE MAE Rsquared
## 1 Random Forest 0.5030486 0.4239256 0.7814039
## 2 Cubist 0.7271022 0.5225497 0.5338601
## 3 Regression Tree (CART) 0.7944217 0.6061872 0.4296448
## 4 Gradient Boosted Trees 0.5037441 0.3915591 0.7513914
The optimal model is the Random Forest, followed closely by Gradient Boosting Trees. It has the lowest RMSE of 0.5030486 and highest R^2 of 0.7814039, which is slightly better than Gradient Boosting which has RMSE of 0.5037441 and R^2 of 0.7513914.
# Variable Importance
varImpPlot(rf_model, main = "Random Forest Variable Importance")
The optimal predictors are MP32 (Manufacturing Process 32), MP13, BM03 (Biological Material 03), BM11, MP17, MP09, BM12, BM06, etc. This is incredibly similar to the variable importance scores I obtained when using the optimal SVM model in the prior assignment. The random forest model actually seems to be skewed towards Manufacturing Processes, but there are 8 Biological Material Predictors in the top 18 predictors, albeit with lower importance.
Prune to find optimal tree
library(rattle)
library(rpart.plot)
# Find optimal complexity parameter
best_cp <- rpart_model$cptable[which.min(rpart_model$cptable[,"xerror"]), "CP"]
# Prune the tree
optimal_tree <- prune(rpart_model, cp = best_cp)
################################################################################
# Plot optimal tree
fancyRpartPlot(optimal_tree,
sub = "Optimal Regression Tree",
palettes = c("Greens"),
caption = NULL)
# Save fancyRpartPlot to PNG
png("optimal_tree_fancy.png", width = 1200, height = 900)
fancyRpartPlot(optimal_tree, main = "Optimal Regression Tree with Yield Distribution")
dev.off()
## png
## 2
The optimal predictor shows to be BMP32, followed by BM11, MP17 in the 2nd-level nodes. The third-level nodes are BM05, MP25, MP09, followed by smaller-level predictors. This optimal tree visualization helps to show how all the different predictors <, >, <=, or >= some specific value, as well as the distribution of certain Yield values.