if (!require("pls")) install.packages("pls")
if (!require("nnet")) install.packages("nnet")
if (!require("caret")) install.packages("caret")
if (!require("earth")) install.packages("earth")
if (!require("mlbench")) install.packages("mlbench")
if (!require("AppliedPredictiveModeling")) install.packages("AppliedPredictiveModeling")
library(pls)
library(nnet)
library(earth)
library(caret)
library(mlbench)
library(AppliedPredictiveModeling)

Note: Model descriptions are from Applied Predictive Modeling by Max Kuhn and Kjell Johnson. Solutions are modifications of those posted by Max Kuhn on his public GitHub page. Function descriptions are from the RDocumentation website.

Artificial Neural Networks (ANN)

Neural networks are powerful nonlinear regression techniques inspired by theories about how the brain works. Like Partial Least squares (PLS), the outcome is modeled by an intermediary set of unobserved variables (called hidden variables, hidden units, or hidden neurons). These hidden neurons are linear combinations of the original predictors. Unlike PLS models however, there are no constraints that help define these linear combinations and the linear combinations are not estimated in a hierarchical fashion. See the exercise on this page for more details on and application of this technique on character recognition of handwritten digits.

Multivariate Adaptive Regression Splines (MARS)

Like Neural Networks and Partial Least squares, MARS uses surrogate features instead of the original predictors. However, whereas PLS and Neural Networks are based on linear combinations of the predictors, MARS creates two contrasted versions of a predictor to enter the model. Also, the surrogate features in MARS are usually a function of only one or two predictors at a time. The nature of the MARS features breaks the predictor into two groups and models linear relationships between the predictor and the outcome in each group. In effect, this scheme creates a piecewise linear model where each new feature models an isolated portion of the original data. Each data point for each predictor is evaluated as a candidate cut point (hinge) by creating a linear regression model with the candidate features, and the corresponding model error is calculated. The predictor-hinge combination that achieves the smallest error is then used for the model.

Support Vector Machines (SVM)

SVMs are a class of powerful, highly flexible modeling techniques. The theory behind SVMs was originally developed in the context of classification models. This description of the SVM method is provides a lot of insight into the technique.

Support Vector Machine (SVM) is currently one of the most well-known and most commonly applied classifiers in supervised learning. The SVM employs a spatial representation of the data. SVMs fit vectors between the document features that best separate the documents into the various groups. Specifically, vectors are selected in a way that they maximize the space between the groups. After the estimation new documents are classified by checking on which sides of the vectors the features of unlabeled documents come to lie.

There are several flavors of Support Vector Regression. The focus here is on \(\varepsilon\)-insensitive regression which is a robust regression technique that seeks to minimize the effect of outliers on the regression equations. Recall that linear regression seeks to find parameter estimates that minimize SSE. One drawback of minimizing SSE is that the parameter estimates can be influenced by just one observation that falls far from the overall trend in the data. When data may contain influential observations, an alternative minimization metric that is less sensitive can be used to find the best parameter estimates. The Huber function uses the squared residuals when the residuals are “small” and uses the absolute residuals when the residuals are large. SVMs for regression use a function similar to the Huber function, with an important difference: Data points with residuals within a threshold set by the user (denoted as \(\varepsilon\)) do not contribute to the regression fit, data points with an absolute difference greater than the threshold contribute a linear-scale amount.

\(K\)-Nearest Neighbors (KNN)

The KNN approach predicts a new sample using the \(K\)-closest samples from the training set. Unlike the other methods, KNN cannot be cleanly summarized by a model. Instead, its construction is solely based on the individual samples from the training data. To predict a new sample for regression, KNN identifies the \(K\)-Nearest Neighbors in the predictor space of that specific sample. The predicted response for the new sample is then the mean of the \(K\) neighbors’ responses. Other summary statistics, such as the median, can also be used in place of the mean to predict the new sample.

Exercise 7.2

Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data: \(y=10\sin { \left( \pi x_{ 1 }x_{ 2 } \right) } +20\left( x_{ 3 }-0.5 \right) ^{ 2 }+10x_{ 4 }+5x_{ 5 }+N(0,\sigma ^{ 2 })\) where the \(x\) values are random variables uniformly distributed between \([0,1]\) (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data. [The function] creates a list with a vector \(y\) and a matrix of predictors \(x\). The \(x\) data [are converted] from a matrix to a data frame [to, amongst other reasons,] give the columns names.

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y, layout=c(5,2))

A large test set [is simulated] to estimate the true error rate with good precision: [Then,] several models [are tuned] on these data.

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
set.seed(921)
(knnModel <- train(x = trainingData$x, y = trainingData$y,
  method = "knn", preProc = c("center", "scale"), tuneLength = 10))

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.610314  0.4862525  2.947015
##    7  3.453175  0.5308199  2.824261
##    9  3.400881  0.5524056  2.760033
##   11  3.337854  0.5823352  2.701619
##   13  3.289472  0.6095509  2.666158
##   15  3.265693  0.6277036  2.625566
##   17  3.276995  0.6370013  2.640152
##   19  3.275630  0.6468982  2.634895
##   21  3.294059  0.6516466  2.656775
##   23  3.303875  0.6587058  2.670830
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.

The function postResample [is] used to get the test set perforamnce [sic] values.

knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.1537867 0.6682955 2.5396694

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?

Approach

In addition to the KNN models tuned on these data in the question setup, multiple MARS models are tuned the answer the question specific to MARS models. The expand.grid() function create a data frame consisting of all combinations (Cartesian product) of the supplied vectors or factors. The train() function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The tuneGrid parameter of the train() function is fed a data frame with values in columns named after the parameters being tuned. The predict() function applies a fitted model to data consisting of predictor variables. The postResample() function takes two vectors as its inputs and computes the RMSE, \(R^2\), and MAE performance metrics. The varImp() function calculates the variable importance for a given model.

Results

set.seed(624)
tg <- expand.grid(degree = 1:2, nprune = seq(2,14,by=2))
(tune <- train(x = trainingData$x, y = trainingData$y, method = "earth",
  preProcess = c("center", "scale"), tuneGrid = tg))

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.445398  0.2023004  3.649244
##   1        4      2.730495  0.6996039  2.189087
##   1        6      2.330585  0.7772941  1.846253
##   1        8      1.849901  0.8583698  1.448238
##   1       10      1.773989  0.8701136  1.385177
##   1       12      1.774982  0.8700869  1.376936
##   1       14      1.786500  0.8689010  1.391064
##   2        2      4.419910  0.2085702  3.624634
##   2        4      2.793223  0.6824988  2.239029
##   2        6      2.442604  0.7535913  1.922559
##   2        8      1.891985  0.8488207  1.471901
##   2       10      1.589932  0.8936622  1.252208
##   2       12      1.462782  0.9100411  1.147125
##   2       14      1.448615  0.9130046  1.130506
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.

plot(tune)

fcast <- predict(tune, newdata = testData$x)
postResample(pred = fcast, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 1.2736629 0.9345834 1.0070477

varImp(tune)

## earth variable importance
## 
##     Overall
## X1   100.00
## X4    84.98
## X2    68.87
## X5    48.55
## X3    38.96
## X9     0.00
## X10    0.00
## X8     0.00
## X6     0.00
## X7     0.00

Interpretation

The MARS model outperforms the KNN model tuned in the question setup. This is true by every performance metric. The MARS model has a lower RMSE, higher \(R^2\), and lower MAE. MARS does well at selecting the informative predictors X1-X5 and at assigning zero importance to the non-informative predictors X6-X10. A second-degree MARS model with 14 terms (including the intercept) provides the most appropriate fit for these data. The second degree implies that a product of two variables is included as one or more of the terms in the model.

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

This data set contains information about a chemical manufacturing process… Raw material in this process is put through a sequence of 27 steps to generate the final pharmaceutical product. The starting material is generated from a biological unit and has a range of quality and characteristics… The data set consisted of 177 samples of biological material for which 57 characteristics were measured. Of the 57 characteristics, there were 12 measurements of the biological starting material, and 45 measurements of the manufacturing process. The process variables included measurements such as temperature, drying time, washing time, and concentrations of by-products at various steps. Some of the process measurements can be controlled, while others are observed. Samples are not independent because sets of samples come from the same batch of biological starting material.

CMP <- get(data(ChemicalManufacturingProcess))
rm(ChemicalManufacturingProcess) # list not needed
set.seed(624)
rows_train <- createDataPartition(CMP$Yield, p=0.75, list=F)
CMP_train <- CMP[rows_train, ]
CMP_test <- CMP[-rows_train, ]
prepro <- preProcess(subset(CMP_train, select = - Yield),
  method=c("BoxCox", "center", "scale", "knnImpute"))
CMP_train_X <- predict(prepro, CMP_train[ , -1])
nzv <- nearZeroVar(CMP_train_X)
mcl <- findCorrelation(cor(CMP_train_X))
CMP_train_X <- CMP_train_X[ ,-c(nzv, mcl)] 
set.seed(624)
ctrl <- trainControl(method = "boot", number = 25)
tune1 <- train(x = CMP_train_X, y = CMP_train$Yield,
  method="pls", tuneLength=15, trControl=ctrl)

Artificial Neural Networks (ANN)

The avNNet function uses model averaging unlike the Nnet function which creates a single model. The .bag = FALSE option makes the model use different random seeds instead of bootstrap aggregation (bagging). For regression, the linear relationship between the hidden neurons and the prediction can be used by choosing the linout = TRUE option. To reduce the amount of printed output use trace = FALSE. The number of neurons used by the model is specified in the MaxNWts parameter. The parameter maxit = 500 expands the number of iterations to find neuron estimates.

set.seed(624)
tg2 <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = F)
tune2 <- train(x = CMP_train_X, y = CMP_train$Yield,
  method = "avNNet", tuneGrid = tg2, trControl = ctrl, linout = T, 
  trace = F, MaxNWts = 10 * (ncol(CMP_train_X) + 1) + 10 + 1, maxit = 500)

Multivariate Adaptive Regression Splines (MARS)

MARS models are [available] in several packages, but the most extensive implementation is in the earth package.

set.seed(624)
tg3 <- expand.grid(degree = c(1:2), nprune = c(2:10))
tune3 <- train(x = CMP_train_X, y = CMP_train$Yield,
  method = "earth", tuneGrid = tg3, trControl = ctrl)

Support Vector Machines (SVM)

The kernlab package has comprehensive implementation of SVM models for regression. Its ksvm function is available for regression models and uses the Guassian radial basis (rbfdot) function as the default kernel function. Other kernel functions can be used, including polynomial (polydot) and linear (vanilladot). If the appropriate values of the cost and kernel parameters are unknown, they can be estimated through resampling. In the train() function, the svmRadial, svmLinear, and svmPoly methods fit different kernels.

set.seed(614)
tg4 <- expand.grid(C=c(0.01,0.05,0.1), degree=c(1,2), scale=c(0.25,0.5,1))
tune4 <- train(x = CMP_train_X, y = CMP_train$Yield,
  method = "svmPoly",  tuneGrid = tg4,  trControl = ctrl)

\(K\)-Nearest Neighbors (KNN)

The knnreg function in the caret package fits the KNN regression model; train tunes the model over \(K\).

set.seed(624)
tg5 <- data.frame(.k = 1:20)
tune5 <- train(x = CMP_train_X, y = CMP_train$Yield,
  method = "knn", tuneGrid = tg5, trControl = trainControl(method = "cv"))

Exercise 7.5.a

Which nonlinear regression model gives the optimal resampling and test set performance?

Approach

The pre-processing steps applied to the training set are applied to the training set. Then the same non-zero variance and highly correlated columns removed from the training set are removed from the test set. A slight issue arises during forecasting due to a value of \(-\infty\) remaining in the pre-processed dataset. This infinite value hinders forecasting with the Neural Network and \(K\)-Nearest Neighbor models but does not affect the Multivariate Adaptive Regression Splines and Support Vector Machine models. For the sake of completeness in model comparison, the value of \(-\infty\) is replaced with the minimum finite value of the variable to which the infinite value belongs. The predict() function applies a fitted model to data consisting of predictor variables. The postResample() function takes two vectors as its inputs and computes the RMSE, \(R^2\), and MAE performance metrics.

Results

tune1$modelInfo$label
data.frame(model="PLS", tune1$bestTune, RMSE=min(tune1$results$RMSE), row.names="")
plot(tune1)

## [1] "Partial Least Squares"
##  model ncomp     RMSE
##    PLS     3 1.276354

tune2$modelInfo$label
data.frame(model="ANN", tune2$bestTune, RMSE=min(tune2$results$RMSE), row.names="")
plot(tune2)

## [1] "Model Averaged Neural Network"
##  model size decay   bag     RMSE
##    ANN    1     0 FALSE 1.611466

tune3$modelInfo$label
data.frame(model="MARS", tune3$bestTune, RMSE=min(tune3$results$RMSE), row.names="")
plot(tune3)

## [1] "Multivariate Adaptive Regression Spline"
##  model nprune degree     RMSE
##   MARS      4      1 1.285954

tune4$modelInfo$label
data.frame(model="SVM", tune4$bestTune, RMSE=min(tune4$results$RMSE), row.names="")
plot(tune4)

## [1] "Support Vector Machines with Polynomial Kernel"
##  model degree scale    C     RMSE
##    SVM      1  0.25 0.01 1.386585

tune5$modelInfo$label
data.frame(model="KNN", tune5$bestTune, RMSE=min(tune5$results$RMSE), row.names="")
plot(tune5)

## [1] "k-Nearest Neighbors"
##  model k     RMSE
##    KNN 3 1.263295

Training Set Resampling

metrics <- function(tune) {
  RMSE = min(tune$results$RMSE)
  Rsquared = max(tune$results$Rsquared)
  MAE = min(tune$results$MAE)
  return(cbind(RMSE, Rsquared, MAE)) }
data.frame(rbind(metrics(tune1), metrics(tune2), 
  metrics(tune3), metrics(tune4), metrics(tune5)),
  row.names = c("PLS","ANN","MARS","SVM","KNN"))

##          RMSE  Rsquared      MAE
## PLS  1.276354 0.5037055 1.055460
## ANN  1.611466 0.3331216 1.270380
## MARS 1.285954 0.4953417 1.011400
## SVM  1.386585 0.4190542 1.134055
## KNN  1.263295 0.5285458 1.032431

Validation on Test Set

CMP_test_X <- predict(prepro, CMP_test[ , -1])
CMP_test_X <- CMP_test_X[ ,-c(nzv, mcl)] 
CMP_test_X[CMP_test_X[,35] == -Inf, 35] <- min(CMP_test_X[is.finite(CMP_test_X[,35]), 35])
fcast1 <- predict(tune1, newdata = CMP_test_X)
fcast2 <- predict(tune2, newdata = CMP_test_X)
fcast3 <- predict(tune3, newdata = CMP_test_X)
fcast4 <- predict(tune4, newdata = CMP_test_X)
fcast5 <- predict(tune5, newdata = CMP_test_X)
data.frame(rbind(postResample(pred = fcast1, obs = CMP_test$Yield), 
  postResample(pred = fcast2, obs = CMP_test$Yield), 
  postResample(pred = fcast3, obs = CMP_test$Yield),
  postResample(pred = fcast4, obs = CMP_test$Yield),
  postResample(pred = fcast5, obs = CMP_test$Yield)), 
  row.names = c("PLS","ANN","MARS","SVM","KNN"))

##          RMSE  Rsquared      MAE
## PLS  1.761001 0.3990817 1.110442
## ANN  1.771717 0.3433766 1.401622
## MARS 3.007599 0.1090839 1.590109
## SVM  1.504663 0.4674955 1.146040
## KNN  1.368251 0.5493334 1.056591

Interpretation

The optimal RMSE, \(R^2\), and MAE resampling performance metrics are associated with the KNN model followed by the MARS, PLS, SVM, and ANN models in that order. The minimum RMSE has a range of 0.35 between the KNN resampled model and the ANN resampled model. The RMSE of the KNN, MARS, and PLS resampled models cluster around the low end of the range. The RMSE of the SVM resampled model sits around the midpoint of the range. At the high end of the range rests the RMSE of the ANN resampled model. The optimal RMSE, \(R^2\), and MAE test set performance metrics are also associated with the KNN model. The range of the RMSE between the best and worst performing models is much larger for the test set because performance metrics from resampled training sets are highly optimistic as a result of the repeated sampling. These results are interesting because the author states:

\(K\)-Nearest Neighbors models have better performance when the underlying relationship between predictors and the response relies is dependent on samples’ proximity in the predictor space. Geographic information is not part of the data generation scheme for this particular data set. Hence, we would expect another type of model to perform better then KNN.

Exercise 7.5.b

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Approach

The dotPlot() function create a dotplot of variable importance values. The varImp() function calculates the variable importance for a given model.

Results

dotPlot(varImp(tune5), top=15)

imp <- data.frame(varImp(tune5)$importance, varImp(tune1)$importance)
colnames(imp) <- c("KNN","PLS")
head(imp[order(imp$KNN,decreasing = T), ], 15)

##                              KNN       PLS
## ManufacturingProcess32 100.00000 100.00000
## BiologicalMaterial06    74.50577  61.51722
## ManufacturingProcess31  60.51397  47.18914
## ManufacturingProcess13  59.23711  62.53050
## ManufacturingProcess36  56.57169  77.47365
## BiologicalMaterial03    49.52172  54.80568
## ManufacturingProcess33  47.94111  65.94903
## BiologicalMaterial04    42.61298  52.82129
## ManufacturingProcess02  42.04596  39.82596
## ManufacturingProcess29  41.94919  44.86556
## ManufacturingProcess27  40.15378  19.82705
## ManufacturingProcess17  39.92681  56.99485
## BiologicalMaterial01    35.84409  50.30322
## ManufacturingProcess09  35.40062  60.26707
## ManufacturingProcess06  32.93945  53.03103

Interpretation

The KNN model which was found to be the optimal nonlinear regression model and the PLS model which was found to be the optimal linear model agree that the most important predictors are process variables and that ManufacturingProcess32 is the most important predictor. The ranking of the other predictors is very different however. When looking at the top 15 most important variables, the KNN model lists fewer biological process variables, but those fewer biological processes are of more importance in the KNN model than the PLS model.

Exercise 7.5.c

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Approach

The featurePlot() function produces lattice graphs containing scatter plots of the specified predictor variables against the target variable. Correlation measures the relationship between two variables. The most commonly used measure of correlation is Pearson correlation (\(r\)). Pearson correlation measures the relationship between normal linear homoskedastic variables. Measuring relationships between variables that are not normal, linear, or homoskedastic (inherently or through transformation) with the Pearson correlation formula produces misleading results. Spearman correlation (\(\rho\)) and Kendall correlation (\(\tau\)) are non-parametric measures of correlation based on monotonic rank. Spearman correlation is more computationally efficient than Kendall correlation, but less robust. For this analysis, the accuracy of Kendall correlation is being prioritized over the speed of Spearman correlation.

\[r = \frac{\sum{(x-m_x)(y-m_y)}}{\sqrt{\sum{(x-m_x)^2}\sum{(y-m_y)^2}}} \tag{Pearson}\] \[\rho = \frac{\sum(x' - m_{x'})(y'_i - m_{y'})}{\sqrt{\sum(x' - m_{x'})^2 \sum(y' - m_{y'})^2}} \tag{Spearman}\] \[\tau = \frac{n_c - n_d}{\frac{1}{2}n(n-1)} \tag{Kendall}\]

Results

important <- c('ManufacturingProcess32',
               'ManufacturingProcess06',
               'ManufacturingProcess31',
               'ManufacturingProcess13')
featurePlot(CMP_train_X[, important], CMP_train$Yield)

cor(CMP_train_X[, important], CMP_train$Yield, method="kendall")

##                              [,1]
## ManufacturingProcess32  0.4927701
## ManufacturingProcess06  0.2458783
## ManufacturingProcess31 -0.2494487
## ManufacturingProcess13 -0.2154288

Interpretation

The top four predictors show low to moderate correlation with the response variable. Some are negatively correlated and others are positively correlated. The intuition that is potentially revealed about the predictors and their relationship with yield is how each predictor plays a small part in the yield. This would explain the need for a 27-step process. It also highlights the importance of selecting the appropriate biological material for the pharmaceutical product.