All exercises can be found in “Applied Predictive Modelling” written by Max Kuhn and Kjell Johnson.

Setup

The following packages are required to rerun this .rmd file:

seed_num <- 200
library(tidyverse)
library(AppliedPredictiveModeling)
library(mlbench)
library(caret)
library(doParallel)
library(kernlab)
library(earth)

The cell below sets up a parallel backend to speed up the process of training the models used in this assignment.

# Detect the number of available cores
cores <- parallel::detectCores()

# Create a cluster
cl <- makeCluster(cores - 1) # Use one less core to avoid overloading

# Register the cluster for parallel processing
registerDoParallel(cl)

Exercise 7.2

Description

Friedman (1991) introduced several benchmark data sets created by simulation. One fo these simulations used the following nonlinear equation to create data:

\[ y = 10\sin{\pi x_1x_2} + 20(x_3-0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2) \]

where the \(x\) values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(seed_num)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using 
featurePlot(trainingData$x, trainingData$y)

## or other methods

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a very large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on this data. For example:

knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = 'knn',
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?

Solution

Since the description of the exercise already implemented and tested a K-Nearest-Neighbors (KNN) model, we can use its performance as a benchmark to evaluate some of the more complicated non-linear regression models. First the cell below tunes a neural network by comparing models with different hidden layer sizes and decay rates:

set.seed(seed_num)

train_control <- trainControl(method = "cv", number=5, allowParallel = TRUE)
nnet_grid <- expand.grid(.decay = c(0, 0.01, 0.1), # learning rate
                         .size = c(1:10), # size of hidden layer
                         .bag = FALSE)

nnet <- train(trainingData$x, trainingData$y,
              method = 'avNNet',
              tuneGrid = nnet_grid,
              trControl = train_control,
              preProc = c("center", "scale"),
              linout = TRUE,
              trace = FALSE,
              MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
              max_it = 500)

# Info on setting MaxNWts: The MaxNWts argument is used to specify the maximum # allowable number of weights in the neural network model. 

# MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1
# 10: Represents max number of hidden units from tuning grid
# ncol(trainingData$x): The number of input features.
# +1: Accounts for the bias term in the input layer.
# +10: The connections between the hidden layer and the output layer.

nnet
## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 160, 160, 160, 160, 160 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.688639  0.7433601  2.106279
##   0.00    2    2.471951  0.7529743  1.944502
##   0.00    3    2.509642  0.7473229  1.954774
##   0.00    4    2.185160  0.8063577  1.715441
##   0.00    5    2.294365  0.7825379  1.800988
##   0.00    6    2.403238  0.7678322  1.862642
##   0.00    7    2.300114  0.7958077  1.871598
##   0.00    8    2.409539  0.7675302  1.886257
##   0.00    9    2.652394  0.7209181  2.108428
##   0.00   10    2.610223  0.7307329  2.088606
##   0.01    1    2.453946  0.7563625  1.899324
##   0.01    2    2.460287  0.7540060  1.939264
##   0.01    3    2.384403  0.7722541  1.863563
##   0.01    4    2.076132  0.8169480  1.643802
##   0.01    5    2.266288  0.7869724  1.776219
##   0.01    6    2.318958  0.7791690  1.809970
##   0.01    7    2.276704  0.7890390  1.840748
##   0.01    8    2.379298  0.7728557  1.920556
##   0.01    9    2.623797  0.7365057  2.076443
##   0.01   10    2.605356  0.7322066  2.039109
##   0.10    1    2.463917  0.7555615  1.906016
##   0.10    2    2.503459  0.7484796  1.918135
##   0.10    3    2.431379  0.7602079  1.870222
##   0.10    4    2.154322  0.8092339  1.655268
##   0.10    5    2.055931  0.8214570  1.596998
##   0.10    6    2.270502  0.7917765  1.803107
##   0.10    7    2.329103  0.7745899  1.812457
##   0.10    8    2.228139  0.8024162  1.784956
##   0.10    9    2.522098  0.7495327  1.941349
##   0.10   10    2.354224  0.7649269  1.888273
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.

The summary above reveals that the a neural network with 5 nodes in its hidden layer and a decay rate of 0.1 achieved the top performance. The cell below also provides a plot of data seen in the summary, showing visually that this model oversaw the lowest error score:

plot(nnet)

The cell below uses the tuned neural network to and evaluates its performance on the test set:

nnetPred <- predict(nnet, newdata = testData$x)
postResample(pred = nnetPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.2479456 0.8015616 1.6970604

The process outlined above using neural networks is repeated below using Support Vector Machine (SVM) regression models. Implementing SVM requires the choice of a kernel to map features into higher dimensional space, and choice of kernel affects the parameters that can be tuned. In this case, we use a radial basis function kernel and evaluate models with different values of \(C\) and \(\sigma\):

set.seed(seed_num)

svm_grid <- expand.grid(
  sigma = c(0.01, 0.1, 1),    # radial kernel parameter
  C = c(0.1, 1, 10)           # cost parameter
)

svm <- train(trainingData$x, trainingData$y,
             method = 'svmRadial',
             preProc = c("center", "scale"),
             tuneGrid = svm_grid,
             trControl = train_control)

svm
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 160, 160, 160, 160, 160 
## Resampling results across tuning parameters:
## 
##   sigma  C     RMSE      Rsquared   MAE     
##   0.01    0.1  4.137241  0.7307907  3.371430
##   0.01    1.0  2.491346  0.7630923  2.006860
##   0.01   10.0  2.244822  0.7914765  1.751592
##   0.10    0.1  3.498956  0.7576037  2.811992
##   0.10    1.0  2.257224  0.7979240  1.748175
##   0.10   10.0  2.189555  0.8021775  1.726239
##   1.00    0.1  4.913436  0.1469356  4.032097
##   1.00    1.0  4.845679  0.1619464  3.974399
##   1.00   10.0  4.826472  0.1959491  3.960697
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.1 and C = 10.
plot(svm)

In this case, the best model SVM model had a \(C\) (cost) value of 10 and a \(\sigma\) value of 0.1. Using the model on the test set results in the following performance:

svmPred <- predict(svm, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.2076835 0.8062778 1.6784911

The tuned versions of the neural network and SVM models achieved \(R^2\) values of \(\approx\) 0.81 using the test data. As such, these models exhibit a considerable improvement when compared to the performance of the tuned KNN model created in the exercise description. This is reassuring, given that these models (especially the neural network) are more mathematically (and by extension, more computationally) complex compared to the KNN model. That being said, KNN can still offer reasonable predictions quickly and acts as a strong benchmark to evaluate these more complex models.

Lastly, the cell below fits a Multivariate Adaptive Regression Splines (MARS) model. While MARS models can be used to make predictions, they are also useful in that they can identify what they believe are the most informative predictors. This latter ability is what is tested below:

mars <- earth(trainingData$x, trainingData$y)
mars
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

MARS models will only use what it believes are the most important predictors, which, according to the output above, are predictors X1-X6. Based on the formula used to simulate the data, only X1-X5 are relevant, and while X6 technically will be used by this MARS model, it received the lowest importance score of the accepted predictors. As such, we can conclude that MARS’s ability to determine the best predictors is quite strong.

Exercise 7.5

Description

Exercise 6.3 describes data fora chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several non-linear regression models.

Part (a)

Description

Which non-linear regression model gives the optimal re-sampling and test set performance?

Solution

First, the cell below recreates the dataset used in when solving Exercise 6.3 in HW7:

set.seed(19)

# pull in data
data(ChemicalManufacturingProcess)

# impute missing values with the median of each column
imputation <- ChemicalManufacturingProcess %>%
  preProcess("medianImpute")

df <- predict(imputation, ChemicalManufacturingProcess)

# split into training and testing data
train_ratio <- 0.7
n <- nrow(df)
n_train <- round(train_ratio * n)

train_ind <- sample(1:n, n_train)

train <- data.frame(df[train_ind,])
test <- data.frame(df[-train_ind,])

X_train <- select(train, -Yield)
y_train <- train$Yield

X_test <- select(test, -Yield)
y_test <- test$Yield

Next, the cell below trains tuned versions of KNN, SVM, and NN regression models and assesses their performance on the testing and training data:

set.seed(seed_num)

r2_vals <- data.frame(matrix(ncol=3))
colnames(r2_vals) <- c('type', 'model', 'r2')


# KNN Model -------------------------------------------------------------------
knn <- train(X_train, y_train,
             method = 'knn',
             preProc = c("center", "scale"),
             tuneLength = 10)

knn_r2_resampling <- max(knn$results$Rsquared)
knnPred <- predict(knn, newdata = X_train)
knn_r2_train <- unname(postResample(pred = knnPred, obs = y_train)[2])
knnPred <- predict(knn, newdata = X_test)
knn_r2_test <- unname(postResample(pred = knnPred, obs = y_test)[2])

r2_vals <- rbind(r2_vals,
                 c('resampling', 'KNN', knn_r2_resampling),
                 c('training', 'KNN', knn_r2_train),
                 c('testing', 'KNN', knn_r2_test))


# SVM Model -------------------------------------------------------------------
svm_grid <- expand.grid(
  sigma = c(0.01, 0.1, 1),    # radial kernel parameter
  C = c(0.1, 1, 10)           # cost parameter
)

svm <- train(X_train, y_train,
             method = 'svmRadial',
             preProc = c("center", "scale"),
             tuneGrid = svm_grid,
             trControl = train_control)

svm_r2_resampling <- max(svm$results$Rsquared, na.rm = TRUE)
svmPred <- predict(svm, newdata = X_train)
svm_r2_train <- unname(postResample(pred = svmPred, obs = y_train)[2])
svmPred <- predict(svm, newdata = X_test)
svm_r2_test <- unname(postResample(pred = svmPred, obs = y_test)[2])

r2_vals <- rbind(r2_vals,
                 c('resampling', 'SVM', svm_r2_resampling),
                 c('training', 'SVM', svm_r2_train),
                 c('testing', 'SVM', svm_r2_test))


# Neural Network Model --------------------------------------------------------
nnet_grid <- expand.grid(.decay = c(0, 0.01, 0.1, 0.2), # learning rate
                         .size = c(1:5), # size of hidden layer
                         .bag = FALSE)

nnet <- train(X_train, y_train,
              method = 'avNNet',
              tuneGrid = nnet_grid,
              trControl = train_control,
              preProc = c("center", "scale"),
              linout = TRUE,
              trace = FALSE,
              MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
              max_it = 500)

nnet_r2_resampling <- max(nnet$results$Rsquared, na.rm = TRUE)
nnetPred <- predict(nnet, newdata = X_train)
nnet_r2_train <- unname(postResample(pred = nnetPred, obs = y_train)[2])
nnetPred <- predict(nnet, newdata = X_test)
nnet_r2_test <- unname(postResample(pred = nnetPred, obs = y_test)[2])

r2_vals <- rbind(r2_vals,
                 c('resampling', 'NN', nnet_r2_resampling),
                 c('training', 'NN', nnet_r2_train),
                 c('testing', 'NN', nnet_r2_test))

# -----------------------------------------------------------------------------
  
r2_vals <- na.omit(r2_vals)

The \(R^2\) values for all models are stored in the \(r2_vals\) dataframe and are plotted below:

ggplot(r2_vals, aes(x = model, y = as.numeric(r2), fill = type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "R2 Values for Training and Testing",
       x = "Model",
       y = "R2 Value",
       fill = "Type") +
  theme_minimal()

Based on the results above, the SVM and NN had the best (and near-equal) performance on the test set, achieving a \(R^2\) value close to 0.65. These performance metrics are great deal more impressive than those witnessed when using the linear regression methods, which never achieved \(R^2\) values greater than 0.5. Even the simplest non-linear regression model tested (KNN) was able to achieve better results than the best linear regression model.

Part (b)

Description

Which predictors are the most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare the top ten important predictors from the optimal linear model?

Solution

The SVM model from part (a) performed best on the testing data, so it will be used to evaluate feature importance. The varImp function is used below to determine the top 20 predictors from the tuned SVM model and plots them:

var_imps <- data.frame(varImp(nnet)$importance)
var_imps$predictor_name <- rownames(var_imps)
rownames(var_imps) <- NULL

top_imps <- var_imps %>%
  arrange(desc(Overall)) %>%
    slice_head(n = 20)

ggplot(top_imps, aes(x = reorder(predictor_name, -Overall), y = Overall)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # Flip axes for better readability
  labs(
    title = "Variable Importance",
    x = "Variables",
    y = "Importance Score"
  ) +
  theme_minimal()

The plot reveals that both manufacturing and biological predictors were among the most important with neither category dominating. This is similar to the results seen when this analysis was completed using linear models. In fact, the top three predictors are the same for the SVM model and the best linear model from HW7 (albeit in slightly different order).

However, if we look at the average feature importance across all predictors in the dataset:

man_imps <- var_imps %>%
  filter(startsWith(predictor_name, "Manufacturing"))

bio_imps <- var_imps %>%
  filter(startsWith(predictor_name, "Biological"))

avg_man_imp <- round(mean(man_imps$Overall), 2)
avg_bio_imp <- round(mean(bio_imps$Overall), 2)

print1 <- paste("Manufacturing average feature importance:", avg_man_imp)
print2 <- paste("Biological average feature importance:", avg_bio_imp)

cat(paste(print1, print2, sep='\n'))
## Manufacturing average feature importance: 25.3
## Biological average feature importance: 50.86

it reveals that the biological features have an average importance that is more than twice that of the average manufacturing feature importance.

Part (c)

Description

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal non-linear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Solution

Similar to what was done in Homework 7, the cell below determines the correlation of the top 20 predictors with yield, and plots them:

top_names <- top_imps$predictor_name
rs <- c()

i <- 1
for(column in top_names){
  r_val <- cor(df[[column]], df[["Yield"]])
  rs[i] <- r_val
  i = i + 1
}

rs <- data.frame(predictor_name = top_names, r_value = rs)

ggplot(rs, aes(x = predictor_name, y = r_value)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Top Predictors Correlation with Yield",
    x = "Variables",
    y = "Pearson-r Correlation"
  ) +
  coord_flip() +
  theme_minimal()

Once again, the plot reveals that all of the top biological predictors have a positive correlation with yield, meaning that we can conclude that larger values of these predictors result in increased yield. The behavior of the top manufacturing predictors is less clear, with both positive and negative correlations being visible. That being said, these correlations can help paint a picture of how yield will be impacted should their value change in any way.