Applied Predictive
Modeling
Chapter 7 Nonlinear Regression Models
Friedman (1991) introduced several benchmark data sets create by
simulation. One of these simulations used the following nonlinear
equation to create data:
y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)
where the x values are random variables uniformly distributed between
[0, 1] (there are also 5 other non-informative variables also created in
the simulation). The package mlbench contains a function
called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data
featurePlot(trainingData$x, trainingData$y)
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data.
knnModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
# The function 'postResample' can be used to get the test set perforamnce values
knn_results = postResample(pred = knnPred, obs = testData$y)
knn_results
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
The optimal number of neighbors in the KNN model is 19 based on having the lowest RMSE value.
Neural Networks are often adversely affected by high correlation among the predictor variables due to the use of gradients to optimize the model parameters.
Before fitting the model, predictors associated with high correlations are identified and filtered out.
# identify predictors to remove to ensure that the maximum absolute pairwise
# correlation between the predictors is less than 0.75.
tooHigh <- findCorrelation(cor(trainingData$x), cutoff = .75)
tooHigh
## integer(0)
None of the predictors are significantly correlated; thus, no predictors need to be removed from the data.
## Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(
.decay = c(0, 0.01, .1),.size = c(1:10),
.bag = FALSE)
ctrl <- trainControl(method = "cv", number = 10)
set.seed(100)
# mimic approach of choosing the number of hidden units and the amount of weight decay via resampling
nnetTune <- train(trainingData$x, trainingData$y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
# Automatically standardize data prior to modeling and prediction
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
nnetTune
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.399958 0.7613127 1.893172
## 0.00 2 2.417296 0.7555717 1.913551
## 0.00 3 2.044313 0.8222590 1.631380
## 0.00 4 2.346892 0.8036428 1.791547
## 0.00 5 2.558811 0.7550840 1.856888
## 0.00 6 3.319656 0.6865631 2.213605
## 0.00 7 3.797724 0.6325088 2.639918
## 0.00 8 6.233372 0.5076975 3.538537
## 0.00 9 4.349457 0.5878049 2.841743
## 0.00 10 3.585916 0.6419600 2.532514
## 0.01 1 2.385355 0.7603048 1.887716
## 0.01 2 2.417179 0.7524496 1.930389
## 0.01 3 2.151198 0.8016034 1.701939
## 0.01 4 2.091926 0.8154384 1.676653
## 0.01 5 2.192652 0.7945339 1.755759
## 0.01 6 2.254642 0.8059197 1.808248
## 0.01 7 2.362394 0.7788018 1.894616
## 0.01 8 2.465421 0.7657129 1.952369
## 0.01 9 2.398000 0.7650013 1.948351
## 0.01 10 2.480909 0.7437663 1.983851
## 0.10 1 2.393966 0.7596430 1.894195
## 0.10 2 2.431003 0.7520281 1.929497
## 0.10 3 2.169934 0.7982368 1.726892
## 0.10 4 2.059080 0.8224159 1.648607
## 0.10 5 1.991051 0.8346028 1.599298
## 0.10 6 2.143273 0.8101609 1.684721
## 0.10 7 2.121592 0.8190956 1.649157
## 0.10 8 2.302167 0.7881632 1.817938
## 0.10 9 2.348140 0.7747370 1.817227
## 0.10 10 2.340229 0.7677416 1.858249
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.
# review performance metrics
nnetTunePred <- predict(nnetTune, newdata = testData$x)
nnet_results = postResample(pred = nnetTunePred, obs = testData$y)
nnet_results
## RMSE Rsquared MAE
## 1.9823587 0.8458179 1.4858719
The optimal neural network model based on the lowest RMSE value used 5 hidden units and a decay of 0.1
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Fix the seed so that the results can be reproduced
set.seed(100)
marsTuned <- train(trainingData$x, trainingData$y,
method = "earth",
# Explicitly declare the candidate models to test
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsTuned
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.327937 0.2544880 3.600474
## 1 3 3.572450 0.4912720 2.895811
## 1 4 2.596841 0.7183600 2.106341
## 1 5 2.370161 0.7659777 1.918669
## 1 6 2.276141 0.7881481 1.810001
## 1 7 1.766728 0.8751831 1.390215
## 1 8 1.780946 0.8723243 1.401345
## 1 9 1.665091 0.8819775 1.325515
## 1 10 1.663804 0.8821283 1.327657
## 1 11 1.657738 0.8822967 1.331730
## 1 12 1.653784 0.8827903 1.331504
## 1 13 1.648496 0.8823663 1.316407
## 1 14 1.639073 0.8841742 1.312833
## 1 15 1.639073 0.8841742 1.312833
## 1 16 1.639073 0.8841742 1.312833
## 1 17 1.639073 0.8841742 1.312833
## 1 18 1.639073 0.8841742 1.312833
## 1 19 1.639073 0.8841742 1.312833
## 1 20 1.639073 0.8841742 1.312833
## 1 21 1.639073 0.8841742 1.312833
## 1 22 1.639073 0.8841742 1.312833
## 1 23 1.639073 0.8841742 1.312833
## 1 24 1.639073 0.8841742 1.312833
## 1 25 1.639073 0.8841742 1.312833
## 1 26 1.639073 0.8841742 1.312833
## 1 27 1.639073 0.8841742 1.312833
## 1 28 1.639073 0.8841742 1.312833
## 1 29 1.639073 0.8841742 1.312833
## 1 30 1.639073 0.8841742 1.312833
## 1 31 1.639073 0.8841742 1.312833
## 1 32 1.639073 0.8841742 1.312833
## 1 33 1.639073 0.8841742 1.312833
## 1 34 1.639073 0.8841742 1.312833
## 1 35 1.639073 0.8841742 1.312833
## 1 36 1.639073 0.8841742 1.312833
## 1 37 1.639073 0.8841742 1.312833
## 1 38 1.639073 0.8841742 1.312833
## 2 2 4.327937 0.2544880 3.600474
## 2 3 3.572450 0.4912720 2.895811
## 2 4 2.661826 0.7070510 2.173471
## 2 5 2.404015 0.7578971 1.975387
## 2 6 2.243927 0.7914805 1.783072
## 2 7 1.856336 0.8605482 1.435682
## 2 8 1.754607 0.8763186 1.396841
## 2 9 1.603578 0.8938666 1.261361
## 2 10 1.492421 0.9084998 1.168700
## 2 11 1.317350 0.9292504 1.033926
## 2 12 1.304327 0.9320133 1.019108
## 2 13 1.277510 0.9323681 1.002927
## 2 14 1.269626 0.9350024 1.003346
## 2 15 1.266217 0.9359400 1.013893
## 2 16 1.268470 0.9354868 1.011414
## 2 17 1.268470 0.9354868 1.011414
## 2 18 1.268470 0.9354868 1.011414
## 2 19 1.268470 0.9354868 1.011414
## 2 20 1.268470 0.9354868 1.011414
## 2 21 1.268470 0.9354868 1.011414
## 2 22 1.268470 0.9354868 1.011414
## 2 23 1.268470 0.9354868 1.011414
## 2 24 1.268470 0.9354868 1.011414
## 2 25 1.268470 0.9354868 1.011414
## 2 26 1.268470 0.9354868 1.011414
## 2 27 1.268470 0.9354868 1.011414
## 2 28 1.268470 0.9354868 1.011414
## 2 29 1.268470 0.9354868 1.011414
## 2 30 1.268470 0.9354868 1.011414
## 2 31 1.268470 0.9354868 1.011414
## 2 32 1.268470 0.9354868 1.011414
## 2 33 1.268470 0.9354868 1.011414
## 2 34 1.268470 0.9354868 1.011414
## 2 35 1.268470 0.9354868 1.011414
## 2 36 1.268470 0.9354868 1.011414
## 2 37 1.268470 0.9354868 1.011414
## 2 38 1.268470 0.9354868 1.011414
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
# review performance metrics
marsTunedPred <- predict(marsTuned, newdata = testData$x)
mars_results = postResample(pred = marsTunedPred, obs = testData$y)
mars_results
## RMSE Rsquared MAE
## 1.1589948 0.9460418 0.9250230
The most optimal MARS model based on the lowest RMSE value is a second-degree model with 15 terms.
svmRTuned <- train(trainingData$x, trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 32
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0631548298520486
##
## Number of Support Vectors : 150
##
## Objective Function Value : -73.2538
## Training error : 0.008502
# review performance metrics
svmRTunedPred <- predict(svmRTuned, newdata = testData$x)
svm_results = postResample(pred = svmRTunedPred, obs = testData$y)
svm_results
## RMSE Rsquared MAE
## 2.0741633 0.8255819 1.5755127
Cross-validation results for the SVM model fit a model with a scale factor of 0.1 and a cost value of 32.
Which models appear to give the best performance?
# compile all model performance results
final_results = rbind(
knn_results,
nnet_results,
mars_results,
svm_results
)
# add model name column
final_results = data.frame(cbind(Model = c('KNN', 'NNET', 'MARS', 'SVM'), final_results))
rownames(final_results) <- NULL
# output results
knit_table(final_results |> arrange(RMSE), 'Model Performance Metrics')
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| MARS | 1.15899475879587 | 0.946041768997385 | 0.925023038062675 |
| NNET | 1.98235867606548 | 0.845817911406657 | 1.48587190378097 |
| SVM | 2.07416325523646 | 0.825581870904828 | 1.57551266804197 |
| KNN | 3.20405949374908 | 0.681991908733613 | 2.56834607044141 |
The MARS model appears to have best performance due to having the lowest RMSE.
Does MARS select the informative predictors (those named X1–X5)?
# estimate the importance of each predictor
mars_importance = varImp(marsTuned)[1]$importance
mars_importance = rownames_to_column(mars_importance, var = "Predictor")
# output results
knit_table(mars_importance, 'MARS Predictor Estimated Importance')
| Predictor | Overall |
|---|---|
| X1 | 100.00000 |
| X4 | 75.23592 |
| X2 | 48.72974 |
| X5 | 15.51884 |
| X3 | 0.00000 |
plot(varImp(marsTuned), main = "MARS Predictor Importance - Top 10")
The MARS model did select the informative predictors (those named X1–X5) based on the estimated importance of each predictor. X1 appears to be the most significant followed by X4, X2, X5, and X3 respectively. X3 appears to have have little to no significance due to have an importance overall importance of 0.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
Impute, Transform, and Scale the Data
BiologicalMaterial07 was identified to
have near zero variance due to having very few unique values relative to
the number of samples. This columns was removed from the data.data(ChemicalManufacturingProcess)
# impute, transform, and standardize the data
data_prep_process <- preProcess(
ChemicalManufacturingProcess,
method = c("knnImpute", "BoxCox", "center", "scale"))
prep_data <- predict(data_prep_process, ChemicalManufacturingProcess)
# identify predictors with zero or near zero variance
colnames(prep_data)[nearZeroVar(prep_data)]
## [1] "BiologicalMaterial07"
# remove columns with zero or near zero variance
prep_data = prep_data[,-nearZeroVar(prep_data)]
Split the Data
# Set the random number seed so we can reproduce the results
set.seed(20)
trainingRows <- createDataPartition(prep_data$Yield, p = .80, list= FALSE)
#split data into train and test sets
train_chem <- prep_data[trainingRows, ]
test_chem <- prep_data[-trainingRows, ]
knnModel_chem <- train(Yield ~., data = train_chem,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10
)
knnModel_chem
## k-Nearest Neighbors
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.8015017 0.3823492 0.6458311
## 7 0.8012521 0.3820255 0.6478658
## 9 0.7948816 0.3885799 0.6431372
## 11 0.7946423 0.3899859 0.6414679
## 13 0.7965878 0.3892959 0.6433097
## 15 0.7994008 0.3858255 0.6479619
## 17 0.7985998 0.3912364 0.6470940
## 19 0.7971403 0.3980982 0.6451050
## 21 0.8006246 0.3959270 0.6485058
## 23 0.8053417 0.3901015 0.6514357
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
knnPred_chem <- predict(knnModel_chem, newdata = test_chem)
knn_results_chem = postResample(pred = knnPred_chem, obs = test_chem$Yield)
knn_results_chem
## RMSE Rsquared MAE
## 0.6492212 0.4874072 0.4683866
The optimal number of neighbors in the KNN model is 11 based on having the lowest RMSE value.
Neural Networks are often adversely affected by high correlation among the predictor variables due to the use of gradients to optimize the model parameters.
Before fitting the model, predictors associated with high correlations are identified and filtered out.
# identify predictors to remove to ensure that the maximum absolute pairwise
# correlation between the predictors is less than 0.75.
tooHigh <- findCorrelation(cor(train_chem), cutoff = .75)
knit_table(tibble(`Highly Correlated Predictors` = colnames(train_chem)[tooHigh]))
| Highly Correlated Predictors |
|---|
| BiologicalMaterial02 |
| BiologicalMaterial06 |
| BiologicalMaterial01 |
| BiologicalMaterial08 |
| BiologicalMaterial04 |
| BiologicalMaterial12 |
| ManufacturingProcess32 |
| ManufacturingProcess15 |
| ManufacturingProcess29 |
| ManufacturingProcess13 |
| ManufacturingProcess10 |
| ManufacturingProcess42 |
| ManufacturingProcess45 |
| ManufacturingProcess26 |
| ManufacturingProcess39 |
| ManufacturingProcess27 |
| ManufacturingProcess31 |
| ManufacturingProcess18 |
| ManufacturingProcess40 |
# Remove column with high correlations to keep all pair-wise correlations below a 0.75 threshold
train_nnet_chem <- train_chem[, -tooHigh]
test_nnet_chem <- test_chem[, -tooHigh]
19 predictors were identified to have pair-wise correlations greater than 0.75. These columns were removed from the data before fitting the neural network model.
# mimic approach of choosing the number of hidden units and the amount of weight decay via resampling
nnetTune_chem <- train(Yield ~ ., data = train_nnet_chem,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(train_nnet_chem) + 1) + 10 + 1,
maxit = 500)
nnetTune_chem
## Model Averaged Neural Network
##
## 144 samples
## 37 predictor
##
## Pre-processing: centered (37), scaled (37)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 129, 130, 129, 131, 130, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 0.9215431 0.4085737 0.6936327
## 0.00 2 0.7530974 0.4798629 0.6166415
## 0.00 3 0.8347486 0.4546008 0.6615904
## 0.00 4 0.7986780 0.4759001 0.6187075
## 0.00 5 0.8240804 0.4415768 0.6695649
## 0.00 6 0.8709329 0.3934973 0.7096894
## 0.00 7 0.8296750 0.4654343 0.6773977
## 0.00 8 0.7559354 0.5047882 0.6215966
## 0.00 9 0.7590205 0.4899559 0.6037898
## 0.00 10 0.7598074 0.4991927 0.6129294
## 0.01 1 0.7964384 0.4556596 0.6335519
## 0.01 2 0.8240593 0.4510977 0.6801288
## 0.01 3 0.7587551 0.4874756 0.6200164
## 0.01 4 0.7838466 0.4572082 0.6262609
## 0.01 5 0.7062712 0.5628523 0.5675729
## 0.01 6 0.7124832 0.5505332 0.5641237
## 0.01 7 0.6879704 0.5772658 0.5401702
## 0.01 8 0.6791100 0.5891584 0.5206540
## 0.01 9 0.7181114 0.5493945 0.5868633
## 0.01 10 0.7282227 0.5512819 0.5789078
## 0.10 1 0.8200261 0.4562204 0.6530313
## 0.10 2 0.7654329 0.4771717 0.6208142
## 0.10 3 0.7548395 0.4814655 0.6071775
## 0.10 4 0.7594984 0.5231600 0.6168862
## 0.10 5 0.6882829 0.5729120 0.5397624
## 0.10 6 0.6805450 0.5765358 0.5365855
## 0.10 7 0.6745016 0.5840402 0.5553380
## 0.10 8 0.7126941 0.5357889 0.5719314
## 0.10 9 0.7431749 0.5330285 0.5987037
## 0.10 10 0.7215311 0.5379258 0.5893741
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 7, decay = 0.1 and bag = FALSE.
# review performance metrics
nnetTunePred_chem <- predict(nnetTune_chem, newdata = test_nnet_chem)
nnet_results_chem = postResample(pred = nnetTunePred_chem, obs = test_nnet_chem$Yield)
nnet_results_chem
## RMSE Rsquared MAE
## 0.5496806 0.6633652 0.4538905
The optimal neural network model based on the lowest RMSE value used 7 hidden units and a decay of 0.1
# Fix the seed so that the results can be reproduced
set.seed(100)
marsTuned_chem <- train(Yield ~ ., data = train_chem,
method = "earth",
# Explicitly declare the candidate models to test
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsTuned_chem
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.7740598 0.4733556 0.6087136
## 1 3 0.6996593 0.5417369 0.5490306
## 1 4 0.6605515 0.5834685 0.5378210
## 1 5 0.6814633 0.5435487 0.5672462
## 1 6 0.7024723 0.5376756 0.5669634
## 1 7 0.6743175 0.5767315 0.5401251
## 1 8 0.6764564 0.5843267 0.5427274
## 1 9 0.6758093 0.5853744 0.5395543
## 1 10 0.7000308 0.5631781 0.5453968
## 1 11 0.6827439 0.5894983 0.5211233
## 1 12 0.6712193 0.6039004 0.5194857
## 1 13 0.6831724 0.5943378 0.5325571
## 1 14 0.6792247 0.6016117 0.5241257
## 1 15 0.6724057 0.6059972 0.5228352
## 1 16 0.6754366 0.6027430 0.5231209
## 1 17 0.6737388 0.6048366 0.5192630
## 1 18 0.6737388 0.6048366 0.5192630
## 1 19 0.6737388 0.6048366 0.5192630
## 1 20 0.6737388 0.6048366 0.5192630
## 1 21 0.6737388 0.6048366 0.5192630
## 1 22 0.6737388 0.6048366 0.5192630
## 1 23 0.6737388 0.6048366 0.5192630
## 1 24 0.6737388 0.6048366 0.5192630
## 1 25 0.6737388 0.6048366 0.5192630
## 1 26 0.6737388 0.6048366 0.5192630
## 1 27 0.6737388 0.6048366 0.5192630
## 1 28 0.6737388 0.6048366 0.5192630
## 1 29 0.6737388 0.6048366 0.5192630
## 1 30 0.6737388 0.6048366 0.5192630
## 1 31 0.6737388 0.6048366 0.5192630
## 1 32 0.6737388 0.6048366 0.5192630
## 1 33 0.6737388 0.6048366 0.5192630
## 1 34 0.6737388 0.6048366 0.5192630
## 1 35 0.6737388 0.6048366 0.5192630
## 1 36 0.6737388 0.6048366 0.5192630
## 1 37 0.6737388 0.6048366 0.5192630
## 1 38 0.6737388 0.6048366 0.5192630
## 2 2 0.7740598 0.4733556 0.6087136
## 2 3 0.6865599 0.5550566 0.5484936
## 2 4 0.6485393 0.6060205 0.5202082
## 2 5 0.6583152 0.5847700 0.5180857
## 2 6 0.6619458 0.5748836 0.5411086
## 2 7 0.6593264 0.5849328 0.5448480
## 2 8 0.7455957 0.5147119 0.5922504
## 2 9 0.7443781 0.5214152 0.5801832
## 2 10 0.7295405 0.5459154 0.5654232
## 2 11 0.7604953 0.5071400 0.5721723
## 2 12 0.7566631 0.5213203 0.5707833
## 2 13 0.7541511 0.5290848 0.5728947
## 2 14 0.7634419 0.5177030 0.5804917
## 2 15 0.7975741 0.5027785 0.5971431
## 2 16 0.8589678 0.4888981 0.6209762
## 2 17 0.9017096 0.4842544 0.6289030
## 2 18 0.9260934 0.4680455 0.6453277
## 2 19 1.5936833 0.4330744 0.8559668
## 2 20 1.6512573 0.4213563 0.8863791
## 2 21 1.6478361 0.4222369 0.8801275
## 2 22 1.6473637 0.4213893 0.8821900
## 2 23 1.6482950 0.4214995 0.8840967
## 2 24 1.6516864 0.4178207 0.8862218
## 2 25 1.6516864 0.4178207 0.8862218
## 2 26 1.6516864 0.4178207 0.8862218
## 2 27 1.6516864 0.4178207 0.8862218
## 2 28 1.6516864 0.4178207 0.8862218
## 2 29 1.6516864 0.4178207 0.8862218
## 2 30 1.6516864 0.4178207 0.8862218
## 2 31 1.6516864 0.4178207 0.8862218
## 2 32 1.6516864 0.4178207 0.8862218
## 2 33 1.6516864 0.4178207 0.8862218
## 2 34 1.6516864 0.4178207 0.8862218
## 2 35 1.6516864 0.4178207 0.8862218
## 2 36 1.6516864 0.4178207 0.8862218
## 2 37 1.6516864 0.4178207 0.8862218
## 2 38 1.6516864 0.4178207 0.8862218
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 4 and degree = 2.
# review performance metrics
marsTunedPred_chem <- predict(marsTuned_chem, newdata = test_chem)
mars_results_chem = postResample(pred = marsTunedPred_chem, obs = test_chem$Yield)
mars_results_chem
## RMSE Rsquared MAE
## 0.5580107 0.6202870 0.4588720
The most optimal MARS model based on the lowest RMSE value is a second-degree model with 4 terms.
svmRTuned_chem <- train(Yield ~ ., data = train_chem,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned_chem
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 129, 130, 131, 130, 128, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7571299 0.5092540 0.6254305
## 0.50 0.6990910 0.5537348 0.5747492
## 1.00 0.6639040 0.5901893 0.5412185
## 2.00 0.6380184 0.6177798 0.5112582
## 4.00 0.6296507 0.6267349 0.5091241
## 8.00 0.6218346 0.6345227 0.5035030
## 16.00 0.6201751 0.6352947 0.5025944
## 32.00 0.6201751 0.6352947 0.5025944
## 64.00 0.6201751 0.6352947 0.5025944
## 128.00 0.6201751 0.6352947 0.5025944
## 256.00 0.6201751 0.6352947 0.5025944
## 512.00 0.6201751 0.6352947 0.5025944
## 1024.00 0.6201751 0.6352947 0.5025944
## 2048.00 0.6201751 0.6352947 0.5025944
##
## Tuning parameter 'sigma' was held constant at a value of 0.0138412
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0138412 and C = 16.
svmRTuned_chem$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 16
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0138411996697324
##
## Number of Support Vectors : 119
##
## Objective Function Value : -93.3045
## Training error : 0.008903
# review performance metrics
svmRTunedPred_chem <- predict(svmRTuned_chem, newdata = test_chem)
svm_results_chem = postResample(pred = svmRTunedPred_chem, obs = test_chem$Yield)
svm_results_chem
## RMSE Rsquared MAE
## 0.5156197 0.6943808 0.4294884
Cross-validation results for the SVM model fit a model with a scale factor of 0.1 and a cost value of 16.
(a) Which nonlinear regression model gives the optimal resampling and test set performance?
# compile all model performance results
final_results_chem = rbind(
knn_results_chem,
nnet_results_chem,
mars_results_chem,
svm_results_chem
)
# add model name column
final_results_chem = data.frame(cbind(Model = c('KNN', 'NNET', 'MARS', 'SVM'), final_results_chem))
rownames(final_results_chem) <- NULL
# output results
knit_table(final_results_chem |> arrange(RMSE), 'Model Performance Metrics')
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| SVM | 0.515619707472275 | 0.694380777283164 | 0.429488407257436 |
| NNET | 0.549680579464381 | 0.663365169364669 | 0.453890505095549 |
| MARS | 0.558010711328633 | 0.620287015285582 | 0.458871959004251 |
| KNN | 0.649221205563854 | 0.487407206700223 | 0.468386557728143 |
The SVM model appears to have best overall nonlinear regression performance due to having the lowest RMSE.
(b)
plot(varImp(svmRTuned_chem), top = 10,
main = "SVM Model Important Predictors - Top 10")
Process variables dominate the top 10 most important predictors of the SVM model. Process variables make up 60% of the most important predictors in both the top 10 and the top 5 important predictors. This same pattern is seen within the top 10 most important predictors of the optimal linear model that was fit to this data in the previous assignment.
(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
svm_chem_importance = varImp(svmRTuned_chem)$importance
svm_chem_importance = rownames_to_column(svm_chem_importance, var = "Predictor")
svm_chem_importance = svm_chem_importance |>
arrange(desc(Overall))|>
head(10)
top_chem_var = svm_chem_importance$Predictor
svm_chem_importance_long = prep_data |>
select(all_of(top_chem_var), Yield) |>
pivot_longer(cols = -Yield, names_to = "predictor", values_to = "value")
# top predictors vs yield
svm_chem_importance_long |>
ggplot(aes(x = value, y = Yield)) +
geom_point() +
geom_smooth(method = "lm", formula = 'y ~ x', color = "red") +
facet_wrap(~predictor, scales = "free_x", nrow = 3) +
theme_minimal() +
labs(title = "Top Predictors vs Yield")
# pair plots
top_pred_df = prep_data |>
select(all_of(top_chem_var), Yield)
colnames(top_pred_df) = gsub("Manufacturing|Material", "", colnames(top_pred_df))
ggpairs(
top_pred_df,
title="Pair Plot - Top Predictors & Yield",
progress = FALSE)
ggpairs(
top_pred_df |> select(matches('Biological|Yield')),
lower.panel=panel.smooth,
title="Pair Plot - Top Biological Predictors & Yield",
progress = FALSE,)
## Warning: The `...` argument of `ggpais()` is deprecated as of GGally 2.3.0.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
process_df = top_pred_df |>
select(matches('Process|Yield'))
colnames(process_df) = gsub("Manufacturing", "", colnames(process_df))
ggpairs(
process_df,
title="Pair Plot - Top Process Predictors & Yield",
progress = FALSE,)