All exercises can be found in “Applied Predictive Modelling” written by Max Kuhn and Kjell Johnson.
The following packages are required to rerun this .rmd
file:
seed_num <- 200
library(tidyverse)
library(AppliedPredictiveModeling)
library(mlbench)
library(caret)
library(doParallel)
library(kernlab)
library(earth)
The cell below sets up a parallel backend to speed up the process of training the models used in this assignment.
# Detect the number of available cores
cores <- parallel::detectCores()
# Create a cluster
cl <- makeCluster(cores - 1) # Use one less core to avoid overloading
# Register the cluster for parallel processing
registerDoParallel(cl)
Friedman (1991) introduced several benchmark data sets created by simulation. One fo these simulations used the following nonlinear equation to create data:
\[ y = 10\sin{\pi x_1x_2} + 20(x_3-0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2) \]
where the \(x\) values are random
variables uniformly distributed between [0,1] (there are also 5 other
non-informative variables also created in the simulation). The package
mlbench contains a function called
mlbench.friedman1 that simulates these data:
set.seed(seed_num)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a very large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on this data. For example:
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = 'knn',
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select
the informative predictors (those named
X1-X5)?
Since the description of the exercise already implemented and tested a K-Nearest-Neighbors (KNN) model, we can use its performance as a benchmark to evaluate some of the more complicated non-linear regression models. First the cell below tunes a neural network by comparing models with different hidden layer sizes and decay rates:
set.seed(seed_num)
train_control <- trainControl(method = "cv", number=5, allowParallel = TRUE)
nnet_grid <- expand.grid(.decay = c(0, 0.01, 0.1), # learning rate
.size = c(1:10), # size of hidden layer
.bag = FALSE)
nnet <- train(trainingData$x, trainingData$y,
method = 'avNNet',
tuneGrid = nnet_grid,
trControl = train_control,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
max_it = 500)
# Info on setting MaxNWts: The MaxNWts argument is used to specify the maximum # allowable number of weights in the neural network model.
# MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1
# 10: Represents max number of hidden units from tuning grid
# ncol(trainingData$x): The number of input features.
# +1: Accounts for the bias term in the input layer.
# +10: The connections between the hidden layer and the output layer.
nnet
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 160, 160, 160, 160, 160
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.688639 0.7433601 2.106279
## 0.00 2 2.471951 0.7529743 1.944502
## 0.00 3 2.509642 0.7473229 1.954774
## 0.00 4 2.185160 0.8063577 1.715441
## 0.00 5 2.294365 0.7825379 1.800988
## 0.00 6 2.403238 0.7678322 1.862642
## 0.00 7 2.300114 0.7958077 1.871598
## 0.00 8 2.409539 0.7675302 1.886257
## 0.00 9 2.652394 0.7209181 2.108428
## 0.00 10 2.610223 0.7307329 2.088606
## 0.01 1 2.453946 0.7563625 1.899324
## 0.01 2 2.460287 0.7540060 1.939264
## 0.01 3 2.384403 0.7722541 1.863563
## 0.01 4 2.076132 0.8169480 1.643802
## 0.01 5 2.266288 0.7869724 1.776219
## 0.01 6 2.318958 0.7791690 1.809970
## 0.01 7 2.276704 0.7890390 1.840748
## 0.01 8 2.379298 0.7728557 1.920556
## 0.01 9 2.623797 0.7365057 2.076443
## 0.01 10 2.605356 0.7322066 2.039109
## 0.10 1 2.463917 0.7555615 1.906016
## 0.10 2 2.503459 0.7484796 1.918135
## 0.10 3 2.431379 0.7602079 1.870222
## 0.10 4 2.154322 0.8092339 1.655268
## 0.10 5 2.055931 0.8214570 1.596998
## 0.10 6 2.270502 0.7917765 1.803107
## 0.10 7 2.329103 0.7745899 1.812457
## 0.10 8 2.228139 0.8024162 1.784956
## 0.10 9 2.522098 0.7495327 1.941349
## 0.10 10 2.354224 0.7649269 1.888273
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.
The summary above reveals that the a neural network with 5 nodes in its hidden layer and a decay rate of 0.1 achieved the top performance. The cell below also provides a plot of data seen in the summary, showing visually that this model oversaw the lowest error score:
plot(nnet)
The cell below uses the tuned neural network to and evaluates its performance on the test set:
nnetPred <- predict(nnet, newdata = testData$x)
postResample(pred = nnetPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.2479456 0.8015616 1.6970604
The process outlined above using neural networks is repeated below using Support Vector Machine (SVM) regression models. Implementing SVM requires the choice of a kernel to map features into higher dimensional space, and choice of kernel affects the parameters that can be tuned. In this case, we use a radial basis function kernel and evaluate models with different values of \(C\) and \(\sigma\):
set.seed(seed_num)
svm_grid <- expand.grid(
sigma = c(0.01, 0.1, 1), # radial kernel parameter
C = c(0.1, 1, 10) # cost parameter
)
svm <- train(trainingData$x, trainingData$y,
method = 'svmRadial',
preProc = c("center", "scale"),
tuneGrid = svm_grid,
trControl = train_control)
svm
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 160, 160, 160, 160, 160
## Resampling results across tuning parameters:
##
## sigma C RMSE Rsquared MAE
## 0.01 0.1 4.137241 0.7307907 3.371430
## 0.01 1.0 2.491346 0.7630923 2.006860
## 0.01 10.0 2.244822 0.7914765 1.751592
## 0.10 0.1 3.498956 0.7576037 2.811992
## 0.10 1.0 2.257224 0.7979240 1.748175
## 0.10 10.0 2.189555 0.8021775 1.726239
## 1.00 0.1 4.913436 0.1469356 4.032097
## 1.00 1.0 4.845679 0.1619464 3.974399
## 1.00 10.0 4.826472 0.1959491 3.960697
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.1 and C = 10.
plot(svm)
In this case, the best model SVM model had a \(C\) (cost) value of 10 and a \(\sigma\) value of 0.1. Using the model on the test set results in the following performance:
svmPred <- predict(svm, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.2076835 0.8062778 1.6784911
The tuned versions of the neural network and SVM models achieved \(R^2\) values of \(\approx\) 0.81 using the test data. As such, these models exhibit a considerable improvement when compared to the performance of the tuned KNN model created in the exercise description. This is reassuring, given that these models (especially the neural network) are more mathematically (and by extension, more computationally) complex compared to the KNN model. That being said, KNN can still offer reasonable predictions quickly and acts as a strong benchmark to evaluate these more complex models.
Lastly, the cell below fits a Multivariate Adaptive Regression Splines (MARS) model. While MARS models can be used to make predictions, they are also useful in that they can identify what they believe are the most informative predictors. This latter ability is what is tested below:
mars <- earth(trainingData$x, trainingData$y)
mars
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
MARS models will only use what it believes are the most important
predictors, which, according to the output above, are predictors
X1-X6. Based on the formula used to simulate
the data, only X1-X5 are relevant, and while
X6 technically will be used by this MARS model, it received
the lowest importance score of the accepted predictors. As such, we can
conclude that MARS’s ability to determine the best
predictors is quite strong.
Exercise 6.3 describes data fora chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several non-linear regression models.
Which non-linear regression model gives the optimal re-sampling and test set performance?
First, the cell below recreates the dataset used in when solving Exercise 6.3 in HW7:
set.seed(19)
# pull in data
data(ChemicalManufacturingProcess)
# impute missing values with the median of each column
imputation <- ChemicalManufacturingProcess %>%
preProcess("medianImpute")
df <- predict(imputation, ChemicalManufacturingProcess)
# split into training and testing data
train_ratio <- 0.7
n <- nrow(df)
n_train <- round(train_ratio * n)
train_ind <- sample(1:n, n_train)
train <- data.frame(df[train_ind,])
test <- data.frame(df[-train_ind,])
X_train <- select(train, -Yield)
y_train <- train$Yield
X_test <- select(test, -Yield)
y_test <- test$Yield
Next, the cell below trains tuned versions of KNN, SVM, and NN regression models and assesses their performance on the testing and training data:
set.seed(seed_num)
r2_vals <- data.frame(matrix(ncol=3))
colnames(r2_vals) <- c('type', 'model', 'r2')
# KNN Model -------------------------------------------------------------------
knn <- train(X_train, y_train,
method = 'knn',
preProc = c("center", "scale"),
tuneLength = 10)
knn_r2_resampling <- max(knn$results$Rsquared)
knnPred <- predict(knn, newdata = X_train)
knn_r2_train <- unname(postResample(pred = knnPred, obs = y_train)[2])
knnPred <- predict(knn, newdata = X_test)
knn_r2_test <- unname(postResample(pred = knnPred, obs = y_test)[2])
r2_vals <- rbind(r2_vals,
c('resampling', 'KNN', knn_r2_resampling),
c('training', 'KNN', knn_r2_train),
c('testing', 'KNN', knn_r2_test))
# SVM Model -------------------------------------------------------------------
svm_grid <- expand.grid(
sigma = c(0.01, 0.1, 1), # radial kernel parameter
C = c(0.1, 1, 10) # cost parameter
)
svm <- train(X_train, y_train,
method = 'svmRadial',
preProc = c("center", "scale"),
tuneGrid = svm_grid,
trControl = train_control)
svm_r2_resampling <- max(svm$results$Rsquared, na.rm = TRUE)
svmPred <- predict(svm, newdata = X_train)
svm_r2_train <- unname(postResample(pred = svmPred, obs = y_train)[2])
svmPred <- predict(svm, newdata = X_test)
svm_r2_test <- unname(postResample(pred = svmPred, obs = y_test)[2])
r2_vals <- rbind(r2_vals,
c('resampling', 'SVM', svm_r2_resampling),
c('training', 'SVM', svm_r2_train),
c('testing', 'SVM', svm_r2_test))
# Neural Network Model --------------------------------------------------------
nnet_grid <- expand.grid(.decay = c(0, 0.01, 0.1, 0.2), # learning rate
.size = c(1:5), # size of hidden layer
.bag = FALSE)
nnet <- train(X_train, y_train,
method = 'avNNet',
tuneGrid = nnet_grid,
trControl = train_control,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
max_it = 500)
nnet_r2_resampling <- max(nnet$results$Rsquared, na.rm = TRUE)
nnetPred <- predict(nnet, newdata = X_train)
nnet_r2_train <- unname(postResample(pred = nnetPred, obs = y_train)[2])
nnetPred <- predict(nnet, newdata = X_test)
nnet_r2_test <- unname(postResample(pred = nnetPred, obs = y_test)[2])
r2_vals <- rbind(r2_vals,
c('resampling', 'NN', nnet_r2_resampling),
c('training', 'NN', nnet_r2_train),
c('testing', 'NN', nnet_r2_test))
# -----------------------------------------------------------------------------
r2_vals <- na.omit(r2_vals)
The \(R^2\) values for all models are stored in the \(r2_vals\) dataframe and are plotted below:
ggplot(r2_vals, aes(x = model, y = as.numeric(r2), fill = type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "R2 Values for Training and Testing",
x = "Model",
y = "R2 Value",
fill = "Type") +
theme_minimal()
Based on the results above, the SVM and NN had the best (and near-equal) performance on the test set, achieving a \(R^2\) value close to 0.65. These performance metrics are great deal more impressive than those witnessed when using the linear regression methods, which never achieved \(R^2\) values greater than 0.5. Even the simplest non-linear regression model tested (KNN) was able to achieve better results than the best linear regression model.
Which predictors are the most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare the top ten important predictors from the optimal linear model?
The SVM model from part (a) performed best on the testing data, so it
will be used to evaluate feature importance. The varImp
function is used below to determine the top 20 predictors from the tuned
SVM model and plots them:
var_imps <- data.frame(varImp(nnet)$importance)
var_imps$predictor_name <- rownames(var_imps)
rownames(var_imps) <- NULL
top_imps <- var_imps %>%
arrange(desc(Overall)) %>%
slice_head(n = 20)
ggplot(top_imps, aes(x = reorder(predictor_name, -Overall), y = Overall)) +
geom_bar(stat = "identity") +
coord_flip() + # Flip axes for better readability
labs(
title = "Variable Importance",
x = "Variables",
y = "Importance Score"
) +
theme_minimal()
The plot reveals that both manufacturing and biological predictors were among the most important with neither category dominating. This is similar to the results seen when this analysis was completed using linear models. In fact, the top three predictors are the same for the SVM model and the best linear model from HW7 (albeit in slightly different order).
However, if we look at the average feature importance across all predictors in the dataset:
man_imps <- var_imps %>%
filter(startsWith(predictor_name, "Manufacturing"))
bio_imps <- var_imps %>%
filter(startsWith(predictor_name, "Biological"))
avg_man_imp <- round(mean(man_imps$Overall), 2)
avg_bio_imp <- round(mean(bio_imps$Overall), 2)
print1 <- paste("Manufacturing average feature importance:", avg_man_imp)
print2 <- paste("Biological average feature importance:", avg_bio_imp)
cat(paste(print1, print2, sep='\n'))
## Manufacturing average feature importance: 25.3
## Biological average feature importance: 50.86
it reveals that the biological features have an average importance that is more than twice that of the average manufacturing feature importance.
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal non-linear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
Similar to what was done in Homework 7, the cell below determines the correlation of the top 20 predictors with yield, and plots them:
top_names <- top_imps$predictor_name
rs <- c()
i <- 1
for(column in top_names){
r_val <- cor(df[[column]], df[["Yield"]])
rs[i] <- r_val
i = i + 1
}
rs <- data.frame(predictor_name = top_names, r_value = rs)
ggplot(rs, aes(x = predictor_name, y = r_value)) +
geom_bar(stat = "identity") +
labs(
title = "Top Predictors Correlation with Yield",
x = "Variables",
y = "Pearson-r Correlation"
) +
coord_flip() +
theme_minimal()
Once again, the plot reveals that all of the top biological
predictors have a positive correlation with yield, meaning
that we can conclude that larger values of these predictors result in
increased yield. The behavior of the top manufacturing predictors is
less clear, with both positive and negative correlations being visible.
That being said, these correlations can help paint a picture of how
yield will be impacted should their value change in any
way.