# Import required R libraries
library(AppliedPredictiveModeling)
library(caret)
library(tidyverse)
library(corrplot)
library(earth)
library(kernlab)
library(mlbench)
library(kableExtra)

# Set seed once for entire file
set.seed(200)

Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: \[y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 +10x_4 +5x_5 +N(0,\sigma^2)\] where the \(x\) values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

# n = number of patterns to create
# sd = standard deviation of noise
trainingData <- mlbench.friedman1(200, sd = 1)

## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)

## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

K-Nearest Neighbors Model

knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

ggplot(knnModel) + labs(title="nNN Model With Tuning")

varImp(knnModel)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

knnPred <-predict(knnModel, newdata = testData$x)

## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

The above output of the kNN model from the book text results in an RMSE of 3.2040595 and \(R^2\) of 0.6819919. The top 5 predictor variables of importance are X1 - X5, not in that order though.

Let’s see if I can do better with the other model approaches from the textbook.

Neural Networks

# The findCorrelation takes a correlation matrix and determines the
# column numbers that should be removed to keep all pair-wise
# correlations below a threshold
tooHigh <- findCorrelation(cor(trainingData$x), cutoff=0.75)
#trainXnnet <- trainingData$x[, -tooHigh]
#testXnnet <- testData$x[, -tooHigh]
tooHigh

## integer(0)

The findCorrelation function results in no column numbers to be removed.

ctrl <- trainControl(method = "cv", number = 10)

# Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10),
                        # The next option is to use bagging (see the
                        # next chapter) instead of different random
                        # seeds
                        .bag=FALSE)

nnetTune <- train(trainingData$x, trainingData$y,
                  method="avNNet",
                  tuneGrid=nnetGrid,
                  trControl = ctrl,
                  ## Automatically standardize data prior to modeling
                  # and prediction
                  preProc = c("center", "scale"),
                  linout=TRUE,
                  trace=FALSE,
                  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit=500)

# Output tuned model
nnetTune

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.461393  0.7600116  1.938705
##   0.00    2    2.500361  0.7550256  1.996446
##   0.00    3    2.037909  0.8423482  1.607043
##   0.00    4    1.909649  0.8572080  1.536214
##   0.00    5    2.088743  0.8291840  1.606762
##   0.00    6    2.809956  0.7170857  2.023726
##   0.00    7    3.480605  0.6465869  2.487827
##   0.00    8    4.123930  0.5767445  2.759803
##   0.00    9    4.467039  0.4899385  2.904843
##   0.00   10    3.605718  0.6760246  2.475416
##   0.01    1    2.437184  0.7689762  1.934924
##   0.01    2    2.510985  0.7596191  1.988260
##   0.01    3    2.002978  0.8414854  1.557508
##   0.01    4    2.003354  0.8445293  1.549719
##   0.01    5    2.117566  0.8272396  1.672930
##   0.01    6    2.303155  0.8023191  1.844344
##   0.01    7    2.332790  0.8063403  1.863003
##   0.01    8    2.346956  0.7937476  1.881861
##   0.01    9    2.244813  0.8169741  1.778934
##   0.01   10    2.402267  0.7780340  1.965674
##   0.10    1    2.450909  0.7652283  1.942964
##   0.10    2    2.489401  0.7606440  1.997060
##   0.10    3    2.200693  0.8155496  1.786599
##   0.10    4    2.059317  0.8432349  1.651707
##   0.10    5    2.172667  0.8172341  1.718829
##   0.10    6    2.227873  0.8088523  1.766776
##   0.10    7    2.221611  0.8218031  1.793130
##   0.10    8    2.358921  0.7945830  1.831457
##   0.10    9    2.307354  0.7903244  1.872002
##   0.10   10    2.230084  0.8150670  1.789885
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag = FALSE.

# Plot trained/tuned model
ggplot(nnetTune) + labs(title="Neural Networks Model With Tuning")

# Make predictions on Test set
nnetPred <-predict(nnetTune, newdata = testData$x)
# Output prediction performance
postResample(pred = nnetPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.4788220 0.7870229 1.6803645

varImp(nnetTune)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

Result: The Neural Networks model results in an RMSE of 2.4788220 and \(R^2\) of 0.7870229. Certainly an improvement in the \(R^2\) compared to the initial kNN model. Again, the top 5 predictor variables of importance are X1 - X5, not in that order though.

MARS

# Tune MARS model
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2,
                        .nprune = 2:38)

# Fix the seed so that the results can be reproduced
marsTuned <- train(trainingData$x, 
                   trainingData$y,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = ctrl)
# Output model
marsTuned

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.334325  0.2599883  3.607719
##   1        3      3.599334  0.4805557  2.888987
##   1        4      2.637145  0.7290848  2.087677
##   1        5      2.283872  0.7939684  1.817343
##   1        6      2.125875  0.8183677  1.647491
##   1        7      1.766013  0.8733619  1.410328
##   1        8      1.671282  0.8842102  1.324258
##   1        9      1.645406  0.8867947  1.322041
##   1       10      1.597968  0.8926582  1.297518
##   1       11      1.540109  0.8996361  1.237949
##   1       12      1.545349  0.8992979  1.243771
##   1       13      1.535169  0.9010122  1.233571
##   1       14      1.529405  0.9018457  1.223874
##   1       15      1.529405  0.9018457  1.223874
##   1       16      1.529405  0.9018457  1.223874
##   1       17      1.529405  0.9018457  1.223874
##   1       18      1.529405  0.9018457  1.223874
##   1       19      1.529405  0.9018457  1.223874
##   1       20      1.529405  0.9018457  1.223874
##   1       21      1.529405  0.9018457  1.223874
##   1       22      1.529405  0.9018457  1.223874
##   1       23      1.529405  0.9018457  1.223874
##   1       24      1.529405  0.9018457  1.223874
##   1       25      1.529405  0.9018457  1.223874
##   1       26      1.529405  0.9018457  1.223874
##   1       27      1.529405  0.9018457  1.223874
##   1       28      1.529405  0.9018457  1.223874
##   1       29      1.529405  0.9018457  1.223874
##   1       30      1.529405  0.9018457  1.223874
##   1       31      1.529405  0.9018457  1.223874
##   1       32      1.529405  0.9018457  1.223874
##   1       33      1.529405  0.9018457  1.223874
##   1       34      1.529405  0.9018457  1.223874
##   1       35      1.529405  0.9018457  1.223874
##   1       36      1.529405  0.9018457  1.223874
##   1       37      1.529405  0.9018457  1.223874
##   1       38      1.529405  0.9018457  1.223874
##   2        2      4.334325  0.2599883  3.607719
##   2        3      3.599334  0.4805557  2.888987
##   2        4      2.637145  0.7290848  2.087677
##   2        5      2.271844  0.7927888  1.823675
##   2        6      2.114868  0.8200184  1.659485
##   2        7      1.780140  0.8733216  1.429346
##   2        8      1.663164  0.8891928  1.294968
##   2        9      1.460976  0.9122520  1.180387
##   2       10      1.399692  0.9175376  1.122526
##   2       11      1.380002  0.9216251  1.110556
##   2       12      1.312883  0.9284253  1.063321
##   2       13      1.285612  0.9343029  1.014216
##   2       14      1.328520  0.9286650  1.052185
##   2       15      1.322954  0.9298515  1.045527
##   2       16      1.341454  0.9283961  1.053190
##   2       17      1.344590  0.9280972  1.054209
##   2       18      1.340821  0.9285264  1.050274
##   2       19      1.340821  0.9285264  1.050274
##   2       20      1.340821  0.9285264  1.050274
##   2       21      1.340821  0.9285264  1.050274
##   2       22      1.340821  0.9285264  1.050274
##   2       23      1.340821  0.9285264  1.050274
##   2       24      1.340821  0.9285264  1.050274
##   2       25      1.340821  0.9285264  1.050274
##   2       26      1.340821  0.9285264  1.050274
##   2       27      1.340821  0.9285264  1.050274
##   2       28      1.340821  0.9285264  1.050274
##   2       29      1.340821  0.9285264  1.050274
##   2       30      1.340821  0.9285264  1.050274
##   2       31      1.340821  0.9285264  1.050274
##   2       32      1.340821  0.9285264  1.050274
##   2       33      1.340821  0.9285264  1.050274
##   2       34      1.340821  0.9285264  1.050274
##   2       35      1.340821  0.9285264  1.050274
##   2       36      1.340821  0.9285264  1.050274
##   2       37      1.340821  0.9285264  1.050274
##   2       38      1.340821  0.9285264  1.050274
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 13 and degree = 2.

# Plot model
ggplot(marsTuned) + labs(title="MARS Model With Tuning")

# Make predictions on Test set
marsPred <-predict(marsTuned, newdata = testData$x)
# Output prediction performance
postResample(pred = marsPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 1.2803060 0.9335241 1.0168673

# Variable importance of MARS model
varImp(marsTuned)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.33
## X2   48.88
## X5   15.63
## X3    0.00

Result: The MARS model results in an RMSE of 1.2803060 and \(R^2\) of 0.9335241 on the test data. Quite an improvement in the \(R^2\) compared to the initial kNN model and the Neural Networks model.

To answer the textbook question … yes, the MARS model does select the informative predictors (those named X1–X5). That being said, the variable X3 provides an overall importance of 0.0.

Support Vector Machines

svmRTuned <- train(trainingData$x, trainingData$y,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   tuneLength=14,
                   trControl = ctrl)
# Output model
svmRTuned

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.504105  0.7940789  1.987142
##      0.50  2.219946  0.8148914  1.750249
##      1.00  2.028115  0.8388693  1.590383
##      2.00  1.899331  0.8561464  1.486326
##      4.00  1.815659  0.8669649  1.424259
##      8.00  1.798345  0.8702827  1.427710
##     16.00  1.797071  0.8702865  1.431137
##     32.00  1.795181  0.8705334  1.429146
##     64.00  1.795181  0.8705334  1.429146
##    128.00  1.795181  0.8705334  1.429146
##    256.00  1.795181  0.8705334  1.429146
##    512.00  1.795181  0.8705334  1.429146
##   1024.00  1.795181  0.8705334  1.429146
##   2048.00  1.795181  0.8705334  1.429146
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06104815
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06104815 and C = 32.

# Output final model
svmRTuned$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 32 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0610481539802731 
## 
## Number of Support Vectors : 152 
## 
## Objective Function Value : -76.9951 
## Training error : 0.008468

# Plot of SVM model not valuable
#ggplot(svmRTuned) + labs(title="SVM Model With Tuning")

#head(predict(svmRTuned, testData$x))

# Make predictions on Test set
svmPred <-predict(svmRTuned, newdata = testData$x)
# Output prediction performance
postResample(pred = svmPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.0693470 0.8263551 1.5718998

varImp(svmRTuned)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

Result: The Support Vector Machines model results in an RMSE of 2.0693470 and \(R^2\) of 0.8263551 on the test set. Certainly an improvement in the \(R^2\) compared to the initial kNN model and neural networks but not quite as good as the MARS model on the test set. Again, the top 5 predictor variables of importance are X1 - X5, not in that order though.

Overall, the best performing model on the test set is the MARS model with a quite impressive \(R^2\) of 0.9335241 on the test data. This result indicates the model is able to account for over 93% of the variance in the test data set. Then again, this data is simulated using the same approach for both the training and the test data, so it shouldn’t be too surprising that the models do perform well on the data. I believe that by increasing the count of observations in the test set, the evaluation of the models is fair to gauge overall performance.

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Data Setup

The data setup below is appropriately borrowed from my work in the previous assignment. The set.seed function is reset to match the value from the previous assignment

# Reset seed from previous assignment
set.seed(8675309) # Jenny, I got your number

# From Exercise 6.3
data(ChemicalManufacturingProcess)
cmp_data <- as.data.frame(ChemicalManufacturingProcess)

# Let's try to impute using preprocess function
# And make sure not to transform the 'Yield' column which is the result
cmp_preprocess_data <- preProcess(cmp_data[, -c(1)], method="knnImpute")

cmp_full_data <- predict(cmp_preprocess_data, cmp_data[, -c(1)])
cmp_full_data$Yield <- cmp_data$Yield

# Identify near zero variance columns for removal
nzv_cols <- nearZeroVar(cmp_full_data)
length(nzv_cols)

## [1] 1

# From: https://stackoverflow.com/questions/28043393/nearzerovar-function-in-caret
if(length(nzv_cols) > 0) cmp_full_data <- cmp_full_data[, -nzv_cols]
dim(cmp_full_data)

## [1] 176  57

trainingRows <- createDataPartition(cmp_full_data$Yield, p = .80, list=FALSE)

# Training set
training_data <- cmp_full_data[trainingRows, ]

# Test set
test_data <- cmp_full_data[-trainingRows, ]

# Based on book example
ctrl <- trainControl(method = "cv", number = 10)

With the goal of assessing the nonlinear regression models on the chemical manufacturing process data, I have fit models of type neural networks, MARS, SVM, and kNN, as I did in the first exercise of this assignment.

Model: Neural Networks

# The findCorrelation takes a correlation matrix and determines the
# column numbers that should be removed to keep all pair-wise
## correlations below a threshold
training_data_minus_y <- subset(training_data, select=-c(Yield))

tooHigh <- findCorrelation(cor(training_data_minus_y), cutoff=0.75)
trainXnnet <- training_data[, -tooHigh]
testXnnet <- test_data[, -tooHigh]

ctrl <- trainControl(method = "cv", number = 10)

# Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(decay = c(0, 0.01, .1),
                        size = c(1:10),
                        bag=FALSE)

nnetTune_cmp <- train(Yield ~ .,
                  data = trainXnnet,
                  method="avNNet",
                  tuneGrid=nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout=TRUE,
                  trace=FALSE,
                  MaxNWts = 10 * (ncol(trainXnnet) + 1) + 10 + 1,
                  maxit=500)

# Output tuned model
nnetTune_cmp

## Model Averaged Neural Network 
## 
## 144 samples
##  37 predictor
## 
## Pre-processing: centered (37), scaled (37) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    1.431184  0.4791997  1.165326
##   0.00    2    1.490655  0.3715362  1.261593
##   0.00    3    2.084346  0.3038504  1.678679
##   0.00    4    2.160302  0.2599626  1.738254
##   0.00    5    1.761031  0.3853397  1.430423
##   0.00    6    2.182373  0.2890228  1.752620
##   0.00    7    2.738784  0.3163743  2.098215
##   0.00    8    3.970852  0.2155929  2.582643
##   0.00    9    4.364210  0.2818610  2.901452
##   0.00   10    7.242408  0.1774099  4.248413
##   0.01    1    1.369164  0.4570740  1.099515
##   0.01    2    1.389743  0.4812962  1.122951
##   0.01    3    2.059833  0.2912552  1.574798
##   0.01    4    1.837103  0.3380343  1.371654
##   0.01    5    1.836712  0.3203407  1.426279
##   0.01    6    1.637615  0.4151693  1.342843
##   0.01    7    1.657243  0.4054110  1.285770
##   0.01    8    1.508695  0.4901474  1.194654
##   0.01    9    1.960660  0.3476448  1.520401
##   0.01   10    2.260141  0.3566402  1.651608
##   0.10    1    1.755761  0.4386429  1.258617
##   0.10    2    1.636166  0.4234651  1.210309
##   0.10    3    2.082464  0.3192228  1.541803
##   0.10    4    2.160683  0.3257065  1.600388
##   0.10    5    2.119334  0.3186189  1.483646
##   0.10    6    2.010512  0.3271826  1.478006
##   0.10    7    2.083727  0.3132577  1.495065
##   0.10    8    1.944911  0.3413724  1.511686
##   0.10    9    1.939889  0.3276584  1.414940
##   0.10   10    1.892039  0.3204802  1.404408
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0.01 and bag = FALSE.

# Plot trained/tuned model
ggplot(nnetTune_cmp) + labs(title="Neural Networks Model With Tuning")

# Make predictions on Test set
nnetPred <-predict(nnetTune_cmp, newdata = test_data)
# Output prediction performance
nnet_test_perf <- postResample(pred = nnetPred, obs = test_data$Yield)
nnet_test_perf

##      RMSE  Rsquared       MAE 
## 1.6747770 0.2824588 1.2292282

nnet_var_imp <- varImp(nnetTune_cmp)
nnet_var_imp

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 37)
## 
##                        Overall
## ManufacturingProcess36  100.00
## BiologicalMaterial03     94.59
## ManufacturingProcess17   94.11
## ManufacturingProcess06   85.45
## ManufacturingProcess11   75.91
## BiologicalMaterial11     61.87
## ManufacturingProcess33   55.29
## ManufacturingProcess30   47.14
## ManufacturingProcess25   46.77
## BiologicalMaterial09     44.85
## ManufacturingProcess12   30.04
## ManufacturingProcess01   29.66
## ManufacturingProcess20   29.30
## BiologicalMaterial10     25.47
## ManufacturingProcess35   24.91
## ManufacturingProcess21   19.93
## ManufacturingProcess10   16.84
## ManufacturingProcess24   16.31
## ManufacturingProcess16   15.46
## ManufacturingProcess34   15.27

Result: The Neural Networks model results in RMSE of 1.6747770 and \(R^2\) of 0.2824588 on the test set. First reaction, not a good \(R^2\) value at all.

Model: MARS

# Tune MARS model
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2,
                        .nprune = 2:38)
# Fix the seed so that the results can be reproduced
marsTuned_cmp <- train(Yield ~ ., 
                   data = training_data,
                   method = "earth",
                   # Explicitly declare the candidate models to test
                   tuneGrid = marsGrid,
                   trControl = ctrl)
# Output model
marsTuned_cmp

## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 131, 129, 131, 130, 128, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.426926  0.4382366  1.1434246
##   1        3      1.273534  0.5332807  1.0660261
##   1        4      1.236491  0.5519561  1.0193896
##   1        5      1.229455  0.5637839  0.9888648
##   1        6      1.277487  0.5472579  1.0267552
##   1        7      1.357491  0.4921570  1.0823859
##   1        8      1.379472  0.4923700  1.1001221
##   1        9      1.342980  0.5039311  1.0791890
##   1       10      1.384097  0.4916274  1.1083406
##   1       11      1.392302  0.4947727  1.1245497
##   1       12      1.372559  0.5155146  1.1052152
##   1       13      1.378398  0.5138871  1.1059596
##   1       14      1.374664  0.5248953  1.1103730
##   1       15      1.388739  0.5157009  1.1196297
##   1       16      1.388455  0.5158043  1.1199276
##   1       17      1.388455  0.5158043  1.1199276
##   1       18      1.388455  0.5158043  1.1199276
##   1       19      1.388455  0.5158043  1.1199276
##   1       20      1.388455  0.5158043  1.1199276
##   1       21      1.388455  0.5158043  1.1199276
##   1       22      1.388455  0.5158043  1.1199276
##   1       23      1.388455  0.5158043  1.1199276
##   1       24      1.388455  0.5158043  1.1199276
##   1       25      1.388455  0.5158043  1.1199276
##   1       26      1.388455  0.5158043  1.1199276
##   1       27      1.388455  0.5158043  1.1199276
##   1       28      1.388455  0.5158043  1.1199276
##   1       29      1.388455  0.5158043  1.1199276
##   1       30      1.388455  0.5158043  1.1199276
##   1       31      1.388455  0.5158043  1.1199276
##   1       32      1.388455  0.5158043  1.1199276
##   1       33      1.388455  0.5158043  1.1199276
##   1       34      1.388455  0.5158043  1.1199276
##   1       35      1.388455  0.5158043  1.1199276
##   1       36      1.388455  0.5158043  1.1199276
##   1       37      1.388455  0.5158043  1.1199276
##   1       38      1.388455  0.5158043  1.1199276
##   2        2      1.426926  0.4382366  1.1434246
##   2        3      1.290167  0.5213695  1.0616799
##   2        4      1.280474  0.5395705  1.0565108
##   2        5      1.267973  0.5318890  1.0543190
##   2        6      1.406288  0.5081843  1.1465652
##   2        7      1.407606  0.5126693  1.1314956
##   2        8      1.287724  0.5627628  1.0636373
##   2        9      1.303450  0.5317632  1.0639438
##   2       10      1.326812  0.5400931  1.0466505
##   2       11      1.388873  0.5262956  1.0917281
##   2       12      1.398588  0.5465208  1.0831373
##   2       13      1.487736  0.5210738  1.1264031
##   2       14      1.520211  0.5098230  1.1442476
##   2       15      1.476407  0.5322346  1.1229999
##   2       16      1.497239  0.5242669  1.1161376
##   2       17      1.466236  0.5467224  1.1079794
##   2       18      1.469903  0.5449608  1.1139425
##   2       19      1.520273  0.5167608  1.1405750
##   2       20      1.530656  0.5140442  1.1481545
##   2       21      1.513713  0.5215437  1.1493869
##   2       22      1.513713  0.5215437  1.1493869
##   2       23      1.513805  0.5214995  1.1533265
##   2       24      1.524664  0.5168372  1.1608298
##   2       25      1.527408  0.5155528  1.1624783
##   2       26      1.527408  0.5155528  1.1624783
##   2       27      1.527408  0.5155528  1.1624783
##   2       28      1.527408  0.5155528  1.1624783
##   2       29      1.527408  0.5155528  1.1624783
##   2       30      1.527408  0.5155528  1.1624783
##   2       31      1.527408  0.5155528  1.1624783
##   2       32      1.527408  0.5155528  1.1624783
##   2       33      1.527408  0.5155528  1.1624783
##   2       34      1.527408  0.5155528  1.1624783
##   2       35      1.527408  0.5155528  1.1624783
##   2       36      1.527408  0.5155528  1.1624783
##   2       37      1.527408  0.5155528  1.1624783
##   2       38      1.527408  0.5155528  1.1624783
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.

# Plot model
ggplot(marsTuned_cmp) + labs(title="MARS Model With Tuning")

# Make predictions on Test set
marsPred <-predict(marsTuned_cmp, newdata = test_data)
# Output prediction performance
mars_test_perf <- postResample(pred = marsPred, obs = test_data$Yield)
mars_test_perf

##      RMSE  Rsquared       MAE 
## 1.2057906 0.6232327 0.9256242

# Variable importance of MARS model
mars_var_imp <- varImp(marsTuned_cmp)
mars_var_imp

## earth variable importance
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   49.05
## ManufacturingProcess17    0.00

Result: The MARS model results in RMSE of 1.2057906 and \(R^2\) of 0.6232327 on the test set. Clearly, a much better \(R^2\) compared to the Neural Networks model.

Model: SVM

svmRTuned_cmp <- train(Yield ~ .,
                   data = training_data,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   tuneLength=14,
                   trControl = ctrl)
# Output model
svmRTuned_cmp

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 129, 128, 130, 129, 131, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE      
##      0.25  1.400064  0.5551365  1.1573287
##      0.50  1.268365  0.6087157  1.0539021
##      1.00  1.154556  0.6574527  0.9529924
##      2.00  1.116321  0.6657979  0.9099986
##      4.00  1.089719  0.6749753  0.8921569
##      8.00  1.099778  0.6678435  0.9033695
##     16.00  1.097377  0.6695076  0.9003905
##     32.00  1.097377  0.6695076  0.9003905
##     64.00  1.097377  0.6695076  0.9003905
##    128.00  1.097377  0.6695076  0.9003905
##    256.00  1.097377  0.6695076  0.9003905
##    512.00  1.097377  0.6695076  0.9003905
##   1024.00  1.097377  0.6695076  0.9003905
##   2048.00  1.097377  0.6695076  0.9003905
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0130225
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0130225 and C = 4.

# Output final model
svmRTuned_cmp$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 4 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0130224956000477 
## 
## Number of Support Vectors : 124 
## 
## Objective Function Value : -91.4615 
## Training error : 0.031966

# Again, no need to plot SVM model
# ggplot(svmRTuned) + labs(title="SVM Model With Tuning")

# Make predictions on Test set
svmPred <-predict(svmRTuned_cmp, newdata = test_data)
# Output prediction performance
svm_test_perf <- postResample(pred = svmPred, obs = test_data$Yield)
svm_test_perf

##      RMSE  Rsquared       MAE 
## 1.2001067 0.6453839 0.9268041

svm_var_imp <- varImp(svmRTuned_cmp)
svm_var_imp

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     85.65
## ManufacturingProcess36   81.77
## ManufacturingProcess09   81.57
## ManufacturingProcess13   79.95
## BiologicalMaterial03     77.35
## ManufacturingProcess17   76.96
## ManufacturingProcess06   69.87
## BiologicalMaterial12     63.49
## ManufacturingProcess11   62.07
## ManufacturingProcess31   61.21
## BiologicalMaterial02     58.82
## BiologicalMaterial11     50.59
## ManufacturingProcess29   50.08
## BiologicalMaterial04     45.29
## ManufacturingProcess33   45.21
## BiologicalMaterial08     40.36
## ManufacturingProcess30   38.55
## ManufacturingProcess25   38.25
## ManufacturingProcess18   38.03

Result: The SVM model results in RMSE of 1.2001067 and \(R^2\) of 0.6453839 on the test set. A much better \(R^2\) compared to the Neural Networks model, and even a small improvement over the MARS model.

Model: kNN

knnModel_cmp <- train(Yield ~ .,
                  data = training_data,
                  method = "knn",
                  preProc = c("center", "scale"),
                  trContorl = ctrl)
knnModel_cmp

## k-Nearest Neighbors 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  1.462848  0.3646681  1.174720
##   7  1.442976  0.3795977  1.164743
##   9  1.430208  0.3924194  1.159082
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

ggplot(knnModel_cmp) + labs(title="kNN Model With Tuning")

knnPred <-predict(knnModel_cmp, newdata = test_data)
knn_test_perf <- postResample(pred = knnPred, obs = test_data$Yield)
knn_test_perf

##      RMSE  Rsquared       MAE 
## 1.4025385 0.5479653 1.1086111

knn_var_imp <- varImp(knnModel_cmp)
knn_var_imp

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     85.65
## ManufacturingProcess36   81.77
## ManufacturingProcess09   81.57
## ManufacturingProcess13   79.95
## BiologicalMaterial03     77.35
## ManufacturingProcess17   76.96
## ManufacturingProcess06   69.87
## BiologicalMaterial12     63.49
## ManufacturingProcess11   62.07
## ManufacturingProcess31   61.21
## BiologicalMaterial02     58.82
## BiologicalMaterial11     50.59
## ManufacturingProcess29   50.08
## BiologicalMaterial04     45.29
## ManufacturingProcess33   45.21
## BiologicalMaterial08     40.36
## ManufacturingProcess30   38.55
## ManufacturingProcess25   38.25
## ManufacturingProcess18   38.03

Result: The kNN model results in RMSE of 1.4025385 and \(R^2\) of 0.5479653 on the test set. A better \(R^2\) compared to the Neural Networks model, but under-performed compared to the MARS model and SVM model.

Section a

Which nonlinear regression model gives the optimal re-sampling and test set performance?

perf_results <- data.frame(NNet=nnet_test_perf, MARS=mars_test_perf, SVM=svm_test_perf, kNN=knn_test_perf)

perf_results %>% t() %>% kable(caption="Comparison of Model Performance", digits=4) %>%
  kable_styling(bootstrap_options = c("hover", "striped"))

Comparison of Model Performance
	RMSE	Rsquared	MAE
NNet	1.6748	0.2825	1.2292
MARS	1.2058	0.6232	0.9256
SVM	1.2001	0.6454	0.9268
kNN	1.4025	0.5480	1.1086

Answer: As the above table indicates, the SVM model attained the lowest RMSE and the highest \(R^2\).

Section b

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

svm_var_imp

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     85.65
## ManufacturingProcess36   81.77
## ManufacturingProcess09   81.57
## ManufacturingProcess13   79.95
## BiologicalMaterial03     77.35
## ManufacturingProcess17   76.96
## ManufacturingProcess06   69.87
## BiologicalMaterial12     63.49
## ManufacturingProcess11   62.07
## ManufacturingProcess31   61.21
## BiologicalMaterial02     58.82
## BiologicalMaterial11     50.59
## ManufacturingProcess29   50.08
## BiologicalMaterial04     45.29
## ManufacturingProcess33   45.21
## BiologicalMaterial08     40.36
## ManufacturingProcess30   38.55
## ManufacturingProcess25   38.25
## ManufacturingProcess18   38.03

Answer: From the SVM model, the optimal nonlinear regression model, the top 20 most important predictors resulted in 13 manufacturing process variables and 7 biological material variables as seen above.

Overall, I believe the manufacturing variables provide more importance than the biological variables in the nonlinear regression model, but I don’t believe the manufacturing variables dominate in the same way they did in the linear regression model.

Compared to the results from the linear regression model, the nonlinear regression model for SVM results in two biological variables in the top 10, BiologicalMaterial06 in 2nd and BiologicalMaterial03 in 6th. Also, 4 of the biological variables scored above 53.9%, which was the highest score of a biological predictor from the linear model. These results support the assessment that the manufacturing variables don’t dominate the model as much in the nonlinear model as previously seen in the linear model.

Looking at the top 10 lists of the optimal nonlinear and linear models, the top 6 manufacturing predictors from the linear model are also in the top 10 of the SVM nonlinear model. But, interestingly, the 3 biological predictors in the top 10 of the non-linear model are not in the top 10 of the linear model, so the biological predictors don’t behave the same between the optimal linear and nonlinear models.

Section c

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

# From the identified predictors unique to the nonlinear regression model.

top_vars <- c("BiologicalMaterial06",
              "BiologicalMaterial03",
              "BiologicalMaterial12",
              "ManufacturingProcess31",
              "ManufacturingProcess29",
              "ManufacturingProcess30",
              "ManufacturingProcess25",
              "ManufacturingProcess18")

featurePlot(cmp_full_data[,top_vars], cmp_full_data$Yield)

The above plot shows some outliers in the manufacturing variables which make it quite difficult to evaluate the variable plots. I will remove the outliers with the goal of improving the plots for readability.

top_vars_and_yield <- c(top_vars, "Yield")
#dim(cmp_full_data[,top_vars_and_yield])

data_for_plot <- cmp_full_data[,top_vars_and_yield] %>%
  filter(ManufacturingProcess31 > -5 & ManufacturingProcess29 > -5 &
           ManufacturingProcess30 > -5 & ManufacturingProcess25 > -5 &
           ManufacturingProcess18 > -5)

# Appear to have removed 2 rows if '> -5' and 7 rows removed if '> -1.5'
#dim(data_for_plot)

featurePlot(data_for_plot[,top_vars], data_for_plot$Yield)

Turns out 2 observations were excluded to remove the extreme outliers from the manufacturing variables. The biological variables appear to have a distinct positive linear relationship with the Yield predictor variable.

Answer: By removing the outliers from the plots of the manufacturing variables, we see the manufacturing variables do not have a clear linear relationship with the predictor variable Yield. The visual evaluation of the two sets of feature plots supports the argument for the value of the SVM model as applied to the chemical manufacturing process data. The SVM model uses a “framework of robust regression in order to minimize the effect of outliers on the regression equations,” according to the textbook. As the textbook also indicates, “when data may contain influential observations, an alternative minimization metric that is less sensitive … can be used to find the best parameter estimates.”

So when considering the optimal nonlinear regression model, the SVM model outperforms the best linear regression model through identification of several important predictors containing outliers otherwise dismissed by the linear regression model. The underlying algorithm of the SVM model performs exactly as it should by minimizing the effect of the outliers. Yes, the optimal linear and nonlinear regression models share many important variables, but the SVM model does identify several unique variables with high importance by minimizing the effect of the extreme outliers.

corr <- cmp_full_data %>% 
  select(top_vars_and_yield) %>% cor()

corrplot(corr, method="number")

The numeric correlation plot above does not provide much more insight than already identified by the feature plots. Unsurprisingly, the 3 biological variables are highly correlated with each other and generally the manufacturing variables are highly correlated with each other.

DATA 624 Assignment 8

CUNY Fall 2021

Philip Tanofsky

14 November 2021