# Import required R libraries
library(AppliedPredictiveModeling)
library(caret)
library(tidyverse)
library(corrplot)
library(earth)
library(kernlab)
library(mlbench)
library(kableExtra)
# Set seed once for entire file
set.seed(200)
Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: \[y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 +10x_4 +5x_5 +N(0,\sigma^2)\] where the \(x\) values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench
contains a function called mlbench.friedman1
that simulates these data:
# n = number of patterns to create
# sd = standard deviation of noise
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
ggplot(knnModel) + labs(title="nNN Model With Tuning")
varImp(knnModel)
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
knnPred <-predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1
–X5
)?
The above output of the kNN model from the book text results in an RMSE of 3.2040595 and \(R^2\) of 0.6819919. The top 5 predictor variables of importance are X1
- X5
, not in that order though.
Let’s see if I can do better with the other model approaches from the textbook.
# The findCorrelation takes a correlation matrix and determines the
# column numbers that should be removed to keep all pair-wise
# correlations below a threshold
tooHigh <- findCorrelation(cor(trainingData$x), cutoff=0.75)
#trainXnnet <- trainingData$x[, -tooHigh]
#testXnnet <- testData$x[, -tooHigh]
tooHigh
## integer(0)
The findCorrelation
function results in no column numbers to be removed.
ctrl <- trainControl(method = "cv", number = 10)
# Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
# The next option is to use bagging (see the
# next chapter) instead of different random
# seeds
.bag=FALSE)
nnetTune <- train(trainingData$x, trainingData$y,
method="avNNet",
tuneGrid=nnetGrid,
trControl = ctrl,
## Automatically standardize data prior to modeling
# and prediction
preProc = c("center", "scale"),
linout=TRUE,
trace=FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit=500)
# Output tuned model
nnetTune
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.461393 0.7600116 1.938705
## 0.00 2 2.500361 0.7550256 1.996446
## 0.00 3 2.037909 0.8423482 1.607043
## 0.00 4 1.909649 0.8572080 1.536214
## 0.00 5 2.088743 0.8291840 1.606762
## 0.00 6 2.809956 0.7170857 2.023726
## 0.00 7 3.480605 0.6465869 2.487827
## 0.00 8 4.123930 0.5767445 2.759803
## 0.00 9 4.467039 0.4899385 2.904843
## 0.00 10 3.605718 0.6760246 2.475416
## 0.01 1 2.437184 0.7689762 1.934924
## 0.01 2 2.510985 0.7596191 1.988260
## 0.01 3 2.002978 0.8414854 1.557508
## 0.01 4 2.003354 0.8445293 1.549719
## 0.01 5 2.117566 0.8272396 1.672930
## 0.01 6 2.303155 0.8023191 1.844344
## 0.01 7 2.332790 0.8063403 1.863003
## 0.01 8 2.346956 0.7937476 1.881861
## 0.01 9 2.244813 0.8169741 1.778934
## 0.01 10 2.402267 0.7780340 1.965674
## 0.10 1 2.450909 0.7652283 1.942964
## 0.10 2 2.489401 0.7606440 1.997060
## 0.10 3 2.200693 0.8155496 1.786599
## 0.10 4 2.059317 0.8432349 1.651707
## 0.10 5 2.172667 0.8172341 1.718829
## 0.10 6 2.227873 0.8088523 1.766776
## 0.10 7 2.221611 0.8218031 1.793130
## 0.10 8 2.358921 0.7945830 1.831457
## 0.10 9 2.307354 0.7903244 1.872002
## 0.10 10 2.230084 0.8150670 1.789885
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag = FALSE.
# Plot trained/tuned model
ggplot(nnetTune) + labs(title="Neural Networks Model With Tuning")
# Make predictions on Test set
nnetPred <-predict(nnetTune, newdata = testData$x)
# Output prediction performance
postResample(pred = nnetPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.4788220 0.7870229 1.6803645
varImp(nnetTune)
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
Result: The Neural Networks model results in an RMSE of 2.4788220 and \(R^2\) of 0.7870229. Certainly an improvement in the \(R^2\) compared to the initial kNN model. Again, the top 5 predictor variables of importance are X1
- X5
, not in that order though.
# Tune MARS model
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2,
.nprune = 2:38)
# Fix the seed so that the results can be reproduced
marsTuned <- train(trainingData$x,
trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = ctrl)
# Output model
marsTuned
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.334325 0.2599883 3.607719
## 1 3 3.599334 0.4805557 2.888987
## 1 4 2.637145 0.7290848 2.087677
## 1 5 2.283872 0.7939684 1.817343
## 1 6 2.125875 0.8183677 1.647491
## 1 7 1.766013 0.8733619 1.410328
## 1 8 1.671282 0.8842102 1.324258
## 1 9 1.645406 0.8867947 1.322041
## 1 10 1.597968 0.8926582 1.297518
## 1 11 1.540109 0.8996361 1.237949
## 1 12 1.545349 0.8992979 1.243771
## 1 13 1.535169 0.9010122 1.233571
## 1 14 1.529405 0.9018457 1.223874
## 1 15 1.529405 0.9018457 1.223874
## 1 16 1.529405 0.9018457 1.223874
## 1 17 1.529405 0.9018457 1.223874
## 1 18 1.529405 0.9018457 1.223874
## 1 19 1.529405 0.9018457 1.223874
## 1 20 1.529405 0.9018457 1.223874
## 1 21 1.529405 0.9018457 1.223874
## 1 22 1.529405 0.9018457 1.223874
## 1 23 1.529405 0.9018457 1.223874
## 1 24 1.529405 0.9018457 1.223874
## 1 25 1.529405 0.9018457 1.223874
## 1 26 1.529405 0.9018457 1.223874
## 1 27 1.529405 0.9018457 1.223874
## 1 28 1.529405 0.9018457 1.223874
## 1 29 1.529405 0.9018457 1.223874
## 1 30 1.529405 0.9018457 1.223874
## 1 31 1.529405 0.9018457 1.223874
## 1 32 1.529405 0.9018457 1.223874
## 1 33 1.529405 0.9018457 1.223874
## 1 34 1.529405 0.9018457 1.223874
## 1 35 1.529405 0.9018457 1.223874
## 1 36 1.529405 0.9018457 1.223874
## 1 37 1.529405 0.9018457 1.223874
## 1 38 1.529405 0.9018457 1.223874
## 2 2 4.334325 0.2599883 3.607719
## 2 3 3.599334 0.4805557 2.888987
## 2 4 2.637145 0.7290848 2.087677
## 2 5 2.271844 0.7927888 1.823675
## 2 6 2.114868 0.8200184 1.659485
## 2 7 1.780140 0.8733216 1.429346
## 2 8 1.663164 0.8891928 1.294968
## 2 9 1.460976 0.9122520 1.180387
## 2 10 1.399692 0.9175376 1.122526
## 2 11 1.380002 0.9216251 1.110556
## 2 12 1.312883 0.9284253 1.063321
## 2 13 1.285612 0.9343029 1.014216
## 2 14 1.328520 0.9286650 1.052185
## 2 15 1.322954 0.9298515 1.045527
## 2 16 1.341454 0.9283961 1.053190
## 2 17 1.344590 0.9280972 1.054209
## 2 18 1.340821 0.9285264 1.050274
## 2 19 1.340821 0.9285264 1.050274
## 2 20 1.340821 0.9285264 1.050274
## 2 21 1.340821 0.9285264 1.050274
## 2 22 1.340821 0.9285264 1.050274
## 2 23 1.340821 0.9285264 1.050274
## 2 24 1.340821 0.9285264 1.050274
## 2 25 1.340821 0.9285264 1.050274
## 2 26 1.340821 0.9285264 1.050274
## 2 27 1.340821 0.9285264 1.050274
## 2 28 1.340821 0.9285264 1.050274
## 2 29 1.340821 0.9285264 1.050274
## 2 30 1.340821 0.9285264 1.050274
## 2 31 1.340821 0.9285264 1.050274
## 2 32 1.340821 0.9285264 1.050274
## 2 33 1.340821 0.9285264 1.050274
## 2 34 1.340821 0.9285264 1.050274
## 2 35 1.340821 0.9285264 1.050274
## 2 36 1.340821 0.9285264 1.050274
## 2 37 1.340821 0.9285264 1.050274
## 2 38 1.340821 0.9285264 1.050274
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 13 and degree = 2.
# Plot model
ggplot(marsTuned) + labs(title="MARS Model With Tuning")
# Make predictions on Test set
marsPred <-predict(marsTuned, newdata = testData$x)
# Output prediction performance
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.2803060 0.9335241 1.0168673
# Variable importance of MARS model
varImp(marsTuned)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.33
## X2 48.88
## X5 15.63
## X3 0.00
Result: The MARS model results in an RMSE of 1.2803060 and \(R^2\) of 0.9335241 on the test data. Quite an improvement in the \(R^2\) compared to the initial kNN model and the Neural Networks model.
To answer the textbook question … yes, the MARS model does select the informative predictors (those named X1
–X5
). That being said, the variable X3
provides an overall importance of 0.0.
svmRTuned <- train(trainingData$x, trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength=14,
trControl = ctrl)
# Output model
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.504105 0.7940789 1.987142
## 0.50 2.219946 0.8148914 1.750249
## 1.00 2.028115 0.8388693 1.590383
## 2.00 1.899331 0.8561464 1.486326
## 4.00 1.815659 0.8669649 1.424259
## 8.00 1.798345 0.8702827 1.427710
## 16.00 1.797071 0.8702865 1.431137
## 32.00 1.795181 0.8705334 1.429146
## 64.00 1.795181 0.8705334 1.429146
## 128.00 1.795181 0.8705334 1.429146
## 256.00 1.795181 0.8705334 1.429146
## 512.00 1.795181 0.8705334 1.429146
## 1024.00 1.795181 0.8705334 1.429146
## 2048.00 1.795181 0.8705334 1.429146
##
## Tuning parameter 'sigma' was held constant at a value of 0.06104815
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06104815 and C = 32.
# Output final model
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 32
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0610481539802731
##
## Number of Support Vectors : 152
##
## Objective Function Value : -76.9951
## Training error : 0.008468
# Plot of SVM model not valuable
#ggplot(svmRTuned) + labs(title="SVM Model With Tuning")
#head(predict(svmRTuned, testData$x))
# Make predictions on Test set
svmPred <-predict(svmRTuned, newdata = testData$x)
# Output prediction performance
postResample(pred = svmPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0693470 0.8263551 1.5718998
varImp(svmRTuned)
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
Result: The Support Vector Machines model results in an RMSE of 2.0693470 and \(R^2\) of 0.8263551 on the test set. Certainly an improvement in the \(R^2\) compared to the initial kNN model and neural networks but not quite as good as the MARS model on the test set. Again, the top 5 predictor variables of importance are X1
- X5
, not in that order though.
Overall, the best performing model on the test set is the MARS model with a quite impressive \(R^2\) of 0.9335241 on the test data. This result indicates the model is able to account for over 93% of the variance in the test data set. Then again, this data is simulated using the same approach for both the training and the test data, so it shouldn’t be too surprising that the models do perform well on the data. I believe that by increasing the count of observations in the test set, the evaluation of the models is fair to gauge overall performance.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
The data setup below is appropriately borrowed from my work in the previous assignment. The set.seed
function is reset to match the value from the previous assignment
# Reset seed from previous assignment
set.seed(8675309) # Jenny, I got your number
# From Exercise 6.3
data(ChemicalManufacturingProcess)
cmp_data <- as.data.frame(ChemicalManufacturingProcess)
# Let's try to impute using preprocess function
# And make sure not to transform the 'Yield' column which is the result
cmp_preprocess_data <- preProcess(cmp_data[, -c(1)], method="knnImpute")
cmp_full_data <- predict(cmp_preprocess_data, cmp_data[, -c(1)])
cmp_full_data$Yield <- cmp_data$Yield
# Identify near zero variance columns for removal
nzv_cols <- nearZeroVar(cmp_full_data)
length(nzv_cols)
## [1] 1
# From: https://stackoverflow.com/questions/28043393/nearzerovar-function-in-caret
if(length(nzv_cols) > 0) cmp_full_data <- cmp_full_data[, -nzv_cols]
dim(cmp_full_data)
## [1] 176 57
trainingRows <- createDataPartition(cmp_full_data$Yield, p = .80, list=FALSE)
# Training set
training_data <- cmp_full_data[trainingRows, ]
# Test set
test_data <- cmp_full_data[-trainingRows, ]
# Based on book example
ctrl <- trainControl(method = "cv", number = 10)
With the goal of assessing the nonlinear regression models on the chemical manufacturing process data, I have fit models of type neural networks, MARS, SVM, and kNN, as I did in the first exercise of this assignment.
# The findCorrelation takes a correlation matrix and determines the
# column numbers that should be removed to keep all pair-wise
## correlations below a threshold
training_data_minus_y <- subset(training_data, select=-c(Yield))
tooHigh <- findCorrelation(cor(training_data_minus_y), cutoff=0.75)
trainXnnet <- training_data[, -tooHigh]
testXnnet <- test_data[, -tooHigh]
ctrl <- trainControl(method = "cv", number = 10)
# Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(decay = c(0, 0.01, .1),
size = c(1:10),
bag=FALSE)
nnetTune_cmp <- train(Yield ~ .,
data = trainXnnet,
method="avNNet",
tuneGrid=nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout=TRUE,
trace=FALSE,
MaxNWts = 10 * (ncol(trainXnnet) + 1) + 10 + 1,
maxit=500)
# Output tuned model
nnetTune_cmp
## Model Averaged Neural Network
##
## 144 samples
## 37 predictor
##
## Pre-processing: centered (37), scaled (37)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 1.431184 0.4791997 1.165326
## 0.00 2 1.490655 0.3715362 1.261593
## 0.00 3 2.084346 0.3038504 1.678679
## 0.00 4 2.160302 0.2599626 1.738254
## 0.00 5 1.761031 0.3853397 1.430423
## 0.00 6 2.182373 0.2890228 1.752620
## 0.00 7 2.738784 0.3163743 2.098215
## 0.00 8 3.970852 0.2155929 2.582643
## 0.00 9 4.364210 0.2818610 2.901452
## 0.00 10 7.242408 0.1774099 4.248413
## 0.01 1 1.369164 0.4570740 1.099515
## 0.01 2 1.389743 0.4812962 1.122951
## 0.01 3 2.059833 0.2912552 1.574798
## 0.01 4 1.837103 0.3380343 1.371654
## 0.01 5 1.836712 0.3203407 1.426279
## 0.01 6 1.637615 0.4151693 1.342843
## 0.01 7 1.657243 0.4054110 1.285770
## 0.01 8 1.508695 0.4901474 1.194654
## 0.01 9 1.960660 0.3476448 1.520401
## 0.01 10 2.260141 0.3566402 1.651608
## 0.10 1 1.755761 0.4386429 1.258617
## 0.10 2 1.636166 0.4234651 1.210309
## 0.10 3 2.082464 0.3192228 1.541803
## 0.10 4 2.160683 0.3257065 1.600388
## 0.10 5 2.119334 0.3186189 1.483646
## 0.10 6 2.010512 0.3271826 1.478006
## 0.10 7 2.083727 0.3132577 1.495065
## 0.10 8 1.944911 0.3413724 1.511686
## 0.10 9 1.939889 0.3276584 1.414940
## 0.10 10 1.892039 0.3204802 1.404408
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0.01 and bag = FALSE.
# Plot trained/tuned model
ggplot(nnetTune_cmp) + labs(title="Neural Networks Model With Tuning")
# Make predictions on Test set
nnetPred <-predict(nnetTune_cmp, newdata = test_data)
# Output prediction performance
nnet_test_perf <- postResample(pred = nnetPred, obs = test_data$Yield)
nnet_test_perf
## RMSE Rsquared MAE
## 1.6747770 0.2824588 1.2292282
nnet_var_imp <- varImp(nnetTune_cmp)
nnet_var_imp
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 37)
##
## Overall
## ManufacturingProcess36 100.00
## BiologicalMaterial03 94.59
## ManufacturingProcess17 94.11
## ManufacturingProcess06 85.45
## ManufacturingProcess11 75.91
## BiologicalMaterial11 61.87
## ManufacturingProcess33 55.29
## ManufacturingProcess30 47.14
## ManufacturingProcess25 46.77
## BiologicalMaterial09 44.85
## ManufacturingProcess12 30.04
## ManufacturingProcess01 29.66
## ManufacturingProcess20 29.30
## BiologicalMaterial10 25.47
## ManufacturingProcess35 24.91
## ManufacturingProcess21 19.93
## ManufacturingProcess10 16.84
## ManufacturingProcess24 16.31
## ManufacturingProcess16 15.46
## ManufacturingProcess34 15.27
Result: The Neural Networks model results in RMSE of 1.6747770 and \(R^2\) of 0.2824588 on the test set. First reaction, not a good \(R^2\) value at all.
# Tune MARS model
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2,
.nprune = 2:38)
# Fix the seed so that the results can be reproduced
marsTuned_cmp <- train(Yield ~ .,
data = training_data,
method = "earth",
# Explicitly declare the candidate models to test
tuneGrid = marsGrid,
trControl = ctrl)
# Output model
marsTuned_cmp
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 131, 131, 129, 131, 130, 128, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.426926 0.4382366 1.1434246
## 1 3 1.273534 0.5332807 1.0660261
## 1 4 1.236491 0.5519561 1.0193896
## 1 5 1.229455 0.5637839 0.9888648
## 1 6 1.277487 0.5472579 1.0267552
## 1 7 1.357491 0.4921570 1.0823859
## 1 8 1.379472 0.4923700 1.1001221
## 1 9 1.342980 0.5039311 1.0791890
## 1 10 1.384097 0.4916274 1.1083406
## 1 11 1.392302 0.4947727 1.1245497
## 1 12 1.372559 0.5155146 1.1052152
## 1 13 1.378398 0.5138871 1.1059596
## 1 14 1.374664 0.5248953 1.1103730
## 1 15 1.388739 0.5157009 1.1196297
## 1 16 1.388455 0.5158043 1.1199276
## 1 17 1.388455 0.5158043 1.1199276
## 1 18 1.388455 0.5158043 1.1199276
## 1 19 1.388455 0.5158043 1.1199276
## 1 20 1.388455 0.5158043 1.1199276
## 1 21 1.388455 0.5158043 1.1199276
## 1 22 1.388455 0.5158043 1.1199276
## 1 23 1.388455 0.5158043 1.1199276
## 1 24 1.388455 0.5158043 1.1199276
## 1 25 1.388455 0.5158043 1.1199276
## 1 26 1.388455 0.5158043 1.1199276
## 1 27 1.388455 0.5158043 1.1199276
## 1 28 1.388455 0.5158043 1.1199276
## 1 29 1.388455 0.5158043 1.1199276
## 1 30 1.388455 0.5158043 1.1199276
## 1 31 1.388455 0.5158043 1.1199276
## 1 32 1.388455 0.5158043 1.1199276
## 1 33 1.388455 0.5158043 1.1199276
## 1 34 1.388455 0.5158043 1.1199276
## 1 35 1.388455 0.5158043 1.1199276
## 1 36 1.388455 0.5158043 1.1199276
## 1 37 1.388455 0.5158043 1.1199276
## 1 38 1.388455 0.5158043 1.1199276
## 2 2 1.426926 0.4382366 1.1434246
## 2 3 1.290167 0.5213695 1.0616799
## 2 4 1.280474 0.5395705 1.0565108
## 2 5 1.267973 0.5318890 1.0543190
## 2 6 1.406288 0.5081843 1.1465652
## 2 7 1.407606 0.5126693 1.1314956
## 2 8 1.287724 0.5627628 1.0636373
## 2 9 1.303450 0.5317632 1.0639438
## 2 10 1.326812 0.5400931 1.0466505
## 2 11 1.388873 0.5262956 1.0917281
## 2 12 1.398588 0.5465208 1.0831373
## 2 13 1.487736 0.5210738 1.1264031
## 2 14 1.520211 0.5098230 1.1442476
## 2 15 1.476407 0.5322346 1.1229999
## 2 16 1.497239 0.5242669 1.1161376
## 2 17 1.466236 0.5467224 1.1079794
## 2 18 1.469903 0.5449608 1.1139425
## 2 19 1.520273 0.5167608 1.1405750
## 2 20 1.530656 0.5140442 1.1481545
## 2 21 1.513713 0.5215437 1.1493869
## 2 22 1.513713 0.5215437 1.1493869
## 2 23 1.513805 0.5214995 1.1533265
## 2 24 1.524664 0.5168372 1.1608298
## 2 25 1.527408 0.5155528 1.1624783
## 2 26 1.527408 0.5155528 1.1624783
## 2 27 1.527408 0.5155528 1.1624783
## 2 28 1.527408 0.5155528 1.1624783
## 2 29 1.527408 0.5155528 1.1624783
## 2 30 1.527408 0.5155528 1.1624783
## 2 31 1.527408 0.5155528 1.1624783
## 2 32 1.527408 0.5155528 1.1624783
## 2 33 1.527408 0.5155528 1.1624783
## 2 34 1.527408 0.5155528 1.1624783
## 2 35 1.527408 0.5155528 1.1624783
## 2 36 1.527408 0.5155528 1.1624783
## 2 37 1.527408 0.5155528 1.1624783
## 2 38 1.527408 0.5155528 1.1624783
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.
# Plot model
ggplot(marsTuned_cmp) + labs(title="MARS Model With Tuning")
# Make predictions on Test set
marsPred <-predict(marsTuned_cmp, newdata = test_data)
# Output prediction performance
mars_test_perf <- postResample(pred = marsPred, obs = test_data$Yield)
mars_test_perf
## RMSE Rsquared MAE
## 1.2057906 0.6232327 0.9256242
# Variable importance of MARS model
mars_var_imp <- varImp(marsTuned_cmp)
mars_var_imp
## earth variable importance
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 49.05
## ManufacturingProcess17 0.00
Result: The MARS model results in RMSE of 1.2057906 and \(R^2\) of 0.6232327 on the test set. Clearly, a much better \(R^2\) compared to the Neural Networks model.
svmRTuned_cmp <- train(Yield ~ .,
data = training_data,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength=14,
trControl = ctrl)
# Output model
svmRTuned_cmp
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 129, 128, 130, 129, 131, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.400064 0.5551365 1.1573287
## 0.50 1.268365 0.6087157 1.0539021
## 1.00 1.154556 0.6574527 0.9529924
## 2.00 1.116321 0.6657979 0.9099986
## 4.00 1.089719 0.6749753 0.8921569
## 8.00 1.099778 0.6678435 0.9033695
## 16.00 1.097377 0.6695076 0.9003905
## 32.00 1.097377 0.6695076 0.9003905
## 64.00 1.097377 0.6695076 0.9003905
## 128.00 1.097377 0.6695076 0.9003905
## 256.00 1.097377 0.6695076 0.9003905
## 512.00 1.097377 0.6695076 0.9003905
## 1024.00 1.097377 0.6695076 0.9003905
## 2048.00 1.097377 0.6695076 0.9003905
##
## Tuning parameter 'sigma' was held constant at a value of 0.0130225
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0130225 and C = 4.
# Output final model
svmRTuned_cmp$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 4
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0130224956000477
##
## Number of Support Vectors : 124
##
## Objective Function Value : -91.4615
## Training error : 0.031966
# Again, no need to plot SVM model
# ggplot(svmRTuned) + labs(title="SVM Model With Tuning")
# Make predictions on Test set
svmPred <-predict(svmRTuned_cmp, newdata = test_data)
# Output prediction performance
svm_test_perf <- postResample(pred = svmPred, obs = test_data$Yield)
svm_test_perf
## RMSE Rsquared MAE
## 1.2001067 0.6453839 0.9268041
svm_var_imp <- varImp(svmRTuned_cmp)
svm_var_imp
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 85.65
## ManufacturingProcess36 81.77
## ManufacturingProcess09 81.57
## ManufacturingProcess13 79.95
## BiologicalMaterial03 77.35
## ManufacturingProcess17 76.96
## ManufacturingProcess06 69.87
## BiologicalMaterial12 63.49
## ManufacturingProcess11 62.07
## ManufacturingProcess31 61.21
## BiologicalMaterial02 58.82
## BiologicalMaterial11 50.59
## ManufacturingProcess29 50.08
## BiologicalMaterial04 45.29
## ManufacturingProcess33 45.21
## BiologicalMaterial08 40.36
## ManufacturingProcess30 38.55
## ManufacturingProcess25 38.25
## ManufacturingProcess18 38.03
Result: The SVM model results in RMSE of 1.2001067 and \(R^2\) of 0.6453839 on the test set. A much better \(R^2\) compared to the Neural Networks model, and even a small improvement over the MARS model.
knnModel_cmp <- train(Yield ~ .,
data = training_data,
method = "knn",
preProc = c("center", "scale"),
trContorl = ctrl)
knnModel_cmp
## k-Nearest Neighbors
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.462848 0.3646681 1.174720
## 7 1.442976 0.3795977 1.164743
## 9 1.430208 0.3924194 1.159082
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
ggplot(knnModel_cmp) + labs(title="kNN Model With Tuning")
knnPred <-predict(knnModel_cmp, newdata = test_data)
knn_test_perf <- postResample(pred = knnPred, obs = test_data$Yield)
knn_test_perf
## RMSE Rsquared MAE
## 1.4025385 0.5479653 1.1086111
knn_var_imp <- varImp(knnModel_cmp)
knn_var_imp
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 85.65
## ManufacturingProcess36 81.77
## ManufacturingProcess09 81.57
## ManufacturingProcess13 79.95
## BiologicalMaterial03 77.35
## ManufacturingProcess17 76.96
## ManufacturingProcess06 69.87
## BiologicalMaterial12 63.49
## ManufacturingProcess11 62.07
## ManufacturingProcess31 61.21
## BiologicalMaterial02 58.82
## BiologicalMaterial11 50.59
## ManufacturingProcess29 50.08
## BiologicalMaterial04 45.29
## ManufacturingProcess33 45.21
## BiologicalMaterial08 40.36
## ManufacturingProcess30 38.55
## ManufacturingProcess25 38.25
## ManufacturingProcess18 38.03
Result: The kNN model results in RMSE of 1.4025385 and \(R^2\) of 0.5479653 on the test set. A better \(R^2\) compared to the Neural Networks model, but under-performed compared to the MARS model and SVM model.
Which nonlinear regression model gives the optimal re-sampling and test set performance?
perf_results <- data.frame(NNet=nnet_test_perf, MARS=mars_test_perf, SVM=svm_test_perf, kNN=knn_test_perf)
perf_results %>% t() %>% kable(caption="Comparison of Model Performance", digits=4) %>%
kable_styling(bootstrap_options = c("hover", "striped"))
RMSE | Rsquared | MAE | |
---|---|---|---|
NNet | 1.6748 | 0.2825 | 1.2292 |
MARS | 1.2058 | 0.6232 | 0.9256 |
SVM | 1.2001 | 0.6454 | 0.9268 |
kNN | 1.4025 | 0.5480 | 1.1086 |
Answer: As the above table indicates, the SVM model attained the lowest RMSE and the highest \(R^2\).
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
svm_var_imp
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 85.65
## ManufacturingProcess36 81.77
## ManufacturingProcess09 81.57
## ManufacturingProcess13 79.95
## BiologicalMaterial03 77.35
## ManufacturingProcess17 76.96
## ManufacturingProcess06 69.87
## BiologicalMaterial12 63.49
## ManufacturingProcess11 62.07
## ManufacturingProcess31 61.21
## BiologicalMaterial02 58.82
## BiologicalMaterial11 50.59
## ManufacturingProcess29 50.08
## BiologicalMaterial04 45.29
## ManufacturingProcess33 45.21
## BiologicalMaterial08 40.36
## ManufacturingProcess30 38.55
## ManufacturingProcess25 38.25
## ManufacturingProcess18 38.03
Answer: From the SVM model, the optimal nonlinear regression model, the top 20 most important predictors resulted in 13 manufacturing process variables and 7 biological material variables as seen above.
Overall, I believe the manufacturing variables provide more importance than the biological variables in the nonlinear regression model, but I don’t believe the manufacturing variables dominate in the same way they did in the linear regression model.
Compared to the results from the linear regression model, the nonlinear regression model for SVM results in two biological variables in the top 10, BiologicalMaterial06 in 2nd and BiologicalMaterial03 in 6th. Also, 4 of the biological variables scored above 53.9%, which was the highest score of a biological predictor from the linear model. These results support the assessment that the manufacturing variables don’t dominate the model as much in the nonlinear model as previously seen in the linear model.
Looking at the top 10 lists of the optimal nonlinear and linear models, the top 6 manufacturing predictors from the linear model are also in the top 10 of the SVM nonlinear model. But, interestingly, the 3 biological predictors in the top 10 of the non-linear model are not in the top 10 of the linear model, so the biological predictors don’t behave the same between the optimal linear and nonlinear models.
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
# From the identified predictors unique to the nonlinear regression model.
top_vars <- c("BiologicalMaterial06",
"BiologicalMaterial03",
"BiologicalMaterial12",
"ManufacturingProcess31",
"ManufacturingProcess29",
"ManufacturingProcess30",
"ManufacturingProcess25",
"ManufacturingProcess18")
featurePlot(cmp_full_data[,top_vars], cmp_full_data$Yield)
The above plot shows some outliers in the manufacturing variables which make it quite difficult to evaluate the variable plots. I will remove the outliers with the goal of improving the plots for readability.
top_vars_and_yield <- c(top_vars, "Yield")
#dim(cmp_full_data[,top_vars_and_yield])
data_for_plot <- cmp_full_data[,top_vars_and_yield] %>%
filter(ManufacturingProcess31 > -5 & ManufacturingProcess29 > -5 &
ManufacturingProcess30 > -5 & ManufacturingProcess25 > -5 &
ManufacturingProcess18 > -5)
# Appear to have removed 2 rows if '> -5' and 7 rows removed if '> -1.5'
#dim(data_for_plot)
featurePlot(data_for_plot[,top_vars], data_for_plot$Yield)
Turns out 2 observations were excluded to remove the extreme outliers from the manufacturing variables. The biological variables appear to have a distinct positive linear relationship with the Yield
predictor variable.
Answer: By removing the outliers from the plots of the manufacturing variables, we see the manufacturing variables do not have a clear linear relationship with the predictor variable Yield
. The visual evaluation of the two sets of feature plots supports the argument for the value of the SVM model as applied to the chemical manufacturing process data. The SVM model uses a “framework of robust regression in order to minimize the effect of outliers on the regression equations,” according to the textbook. As the textbook also indicates, “when data may contain influential observations, an alternative minimization metric that is less sensitive … can be used to find the best parameter estimates.”
So when considering the optimal nonlinear regression model, the SVM model outperforms the best linear regression model through identification of several important predictors containing outliers otherwise dismissed by the linear regression model. The underlying algorithm of the SVM model performs exactly as it should by minimizing the effect of the outliers. Yes, the optimal linear and nonlinear regression models share many important variables, but the SVM model does identify several unique variables with high importance by minimizing the effect of the extreme outliers.
corr <- cmp_full_data %>%
select(top_vars_and_yield) %>% cor()
corrplot(corr, method="number")
The numeric correlation plot above does not provide much more insight than already identified by the feature plots. Unsurprisingly, the 3 biological variables are highly correlated with each other and generally the manufacturing variables are highly correlated with each other.