7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N (0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame. One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix of predictors 'x'. Also simulate a large test set to estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
KNN Model: The lowest RMSE for the model is at k = 11. The lowest RMSE was 3.12226. The R2 was 0.6690472, which means 66.9% of the variance in y is explained by the kNN model. The MAE was 2.496365 which is the average error between prediction and actual values. From these numbers we can say, the model is okay/decent at modeling the data.
set.seed(200)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.238598 0.5836232 2.705822
## 7 3.117335 0.6295372 2.561052
## 9 3.100423 0.6590940 2.524483
## 11 3.086639 0.6822198 2.506584
## 13 3.094904 0.6902613 2.504433
## 15 3.116059 0.7045172 2.516131
## 17 3.129874 0.7133067 2.529370
## 19 3.151840 0.7183283 2.546422
## 21 3.175787 0.7209301 2.574113
## 23 3.208213 0.7146199 2.611285
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.1222641 0.6690472 2.4963650
SVM Model: The lowest RMSE for the model is sigma = 0.06299324 and C = 8. The lowest RMSE was 2.0541197. The R2 was 0.8290353, which means 82.9% of the variance in y is explained by the SNM model. The MAE was 1.558, which is the average error between prediction and actual values. From these numbers we can say, the model preformed much better than the kNN model with a lower RMSE and a higher R2.
set.seed(200)
svmRTuned <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.525164 0.7810576 2.010680
## 0.50 2.270567 0.7944850 1.794902
## 1.00 2.099356 0.8155574 1.659376
## 2.00 2.005858 0.8302852 1.578799
## 4.00 1.934650 0.8435677 1.528373
## 8.00 1.915665 0.8475605 1.528648
## 16.00 1.923914 0.8463074 1.535991
## 32.00 1.923914 0.8463074 1.535991
## 64.00 1.923914 0.8463074 1.535991
## 128.00 1.923914 0.8463074 1.535991
##
## Tuning parameter 'sigma' was held constant at a value of 0.06299324
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06299324 and C = 8.
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 8
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0629932410345396
##
## Number of Support Vectors : 152
##
## Objective Function Value : -72.63
## Training error : 0.009177
svnPred <- predict(svmRTuned, newdata = testData$x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = svnPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0541197 0.8290353 1.5586411
MARS: The lowest RMSE for the model is nprune = 14 and degree = 2. The lowest RMSE was 1.1722635. The R2 was 0.9448890, which means 94.6% of the variance in y is explained by the SNM model. The MAE was 0.9324923, which is the average error between prediction and actual values. From these numbers we can say, the model preformed better than both the kNN model and the SVM model with an even lower RMSE and an even higher R2.
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Fix the seed so that the results can be reproduced
set.seed(200)
marsTuned <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsTuned
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.188280 0.3042527 3.460689
## 1 3 3.551182 0.4999832 2.837116
## 1 4 2.653143 0.7167280 2.128222
## 1 5 2.405769 0.7562160 1.948161
## 1 6 2.295006 0.7754603 1.853199
## 1 7 1.771950 0.8611767 1.391357
## 1 8 1.647182 0.8774867 1.299564
## 1 9 1.609816 0.8837307 1.299705
## 1 10 1.635035 0.8798236 1.309436
## 1 11 1.571915 0.8896147 1.260711
## 1 12 1.571561 0.8898750 1.253077
## 1 13 1.567577 0.8906927 1.250795
## 1 14 1.571673 0.8909652 1.245508
## 1 15 1.571673 0.8909652 1.245508
## 1 16 1.571673 0.8909652 1.245508
## 1 17 1.571673 0.8909652 1.245508
## 1 18 1.571673 0.8909652 1.245508
## 1 19 1.571673 0.8909652 1.245508
## 1 20 1.571673 0.8909652 1.245508
## 1 21 1.571673 0.8909652 1.245508
## 1 22 1.571673 0.8909652 1.245508
## 1 23 1.571673 0.8909652 1.245508
## 1 24 1.571673 0.8909652 1.245508
## 1 25 1.571673 0.8909652 1.245508
## 1 26 1.571673 0.8909652 1.245508
## 1 27 1.571673 0.8909652 1.245508
## 1 28 1.571673 0.8909652 1.245508
## 1 29 1.571673 0.8909652 1.245508
## 1 30 1.571673 0.8909652 1.245508
## 1 31 1.571673 0.8909652 1.245508
## 1 32 1.571673 0.8909652 1.245508
## 1 33 1.571673 0.8909652 1.245508
## 1 34 1.571673 0.8909652 1.245508
## 1 35 1.571673 0.8909652 1.245508
## 1 36 1.571673 0.8909652 1.245508
## 1 37 1.571673 0.8909652 1.245508
## 1 38 1.571673 0.8909652 1.245508
## 2 2 4.188280 0.3042527 3.460689
## 2 3 3.551182 0.4999832 2.837116
## 2 4 2.615256 0.7216809 2.128763
## 2 5 2.344223 0.7683855 1.890080
## 2 6 2.275048 0.7762472 1.807779
## 2 7 1.841464 0.8418935 1.457945
## 2 8 1.641647 0.8839822 1.288520
## 2 9 1.535119 0.9002991 1.214772
## 2 10 1.473254 0.9101555 1.158761
## 2 11 1.379476 0.9207735 1.080991
## 2 12 1.285380 0.9283193 1.033426
## 2 13 1.267261 0.9328905 1.014726
## 2 14 1.261797 0.9327541 1.009821
## 2 15 1.266663 0.9320714 1.005751
## 2 16 1.270858 0.9322465 1.009757
## 2 17 1.263778 0.9327687 1.007653
## 2 18 1.263778 0.9327687 1.007653
## 2 19 1.263778 0.9327687 1.007653
## 2 20 1.263778 0.9327687 1.007653
## 2 21 1.263778 0.9327687 1.007653
## 2 22 1.263778 0.9327687 1.007653
## 2 23 1.263778 0.9327687 1.007653
## 2 24 1.263778 0.9327687 1.007653
## 2 25 1.263778 0.9327687 1.007653
## 2 26 1.263778 0.9327687 1.007653
## 2 27 1.263778 0.9327687 1.007653
## 2 28 1.263778 0.9327687 1.007653
## 2 29 1.263778 0.9327687 1.007653
## 2 30 1.263778 0.9327687 1.007653
## 2 31 1.263778 0.9327687 1.007653
## 2 32 1.263778 0.9327687 1.007653
## 2 33 1.263778 0.9327687 1.007653
## 2 34 1.263778 0.9327687 1.007653
## 2 35 1.263778 0.9327687 1.007653
## 2 36 1.263778 0.9327687 1.007653
## 2 37 1.263778 0.9327687 1.007653
## 2 38 1.263778 0.9327687 1.007653
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
marsPred <- predict(marsTuned, newdata = testData$x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.1722635 0.9448890 0.9324923
Neural Networks: The lowest RMSE for the model is size = 4, decay = 0.01 and bag = FALSE. The lowest RMSE was 2.0603901. The R2 was 0.8320669, which means 83.2% of the variance in y is explained by the SNM model. The MAE was 1.5289876, which is the average error between prediction and actual values. From these numbers we can say, the model preformed better than the kNN model and very close to the SVM model, but under preformed compared to the MARS model.
tooHigh <- findCorrelation(cor(trainingData$x), cutoff = .75)
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),.size = c(1:10),.bag = FALSE)
set.seed(200)
nnetTune <- train(x = trainingData$x,
y = trainingData$y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetTune
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.409283 0.7621183 1.899438
## 0.00 2 2.422970 0.7596059 1.940174
## 0.00 3 2.050680 0.8164736 1.639116
## 0.00 4 1.945570 0.8359932 1.553962
## 0.00 5 2.459608 0.7843926 1.852817
## 0.00 6 3.619609 0.6518455 2.367746
## 0.00 7 3.732346 0.5465485 2.505487
## 0.00 8 5.341921 0.4925666 3.027228
## 0.00 9 4.448181 0.5628776 2.637284
## 0.00 10 3.736545 0.6189292 2.529286
## 0.01 1 2.380838 0.7641924 1.871203
## 0.01 2 2.456920 0.7487966 1.925584
## 0.01 3 2.152614 0.8037282 1.690702
## 0.01 4 1.926277 0.8453345 1.547266
## 0.01 5 2.129094 0.8103702 1.698914
## 0.01 6 2.140650 0.8117168 1.698805
## 0.01 7 2.414589 0.7646608 1.911319
## 0.01 8 2.366556 0.7741779 1.873354
## 0.01 9 2.368192 0.7641654 1.781225
## 0.01 10 2.336267 0.7855705 1.857180
## 0.10 1 2.392300 0.7614548 1.873846
## 0.10 2 2.437038 0.7557138 1.918843
## 0.10 3 2.136582 0.8043180 1.702665
## 0.10 4 2.009698 0.8245209 1.574401
## 0.10 5 2.015296 0.8345946 1.586740
## 0.10 6 2.038841 0.8283220 1.586032
## 0.10 7 2.129954 0.8133844 1.707177
## 0.10 8 2.148353 0.8099883 1.690601
## 0.10 9 2.254190 0.7942883 1.759231
## 0.10 10 2.359693 0.7719593 1.873586
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.01 and bag = FALSE.
nnetPred <- predict(nnetTune, newdata = testData$x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = nnetPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0603901 0.8320669 1.5289876
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
The MARS (Multivariate Adaptive Regression Splines) was the best performing model with the lowest RMSE (1.1589948) and MAE (0.9250) and highest R2 (explaining ~95% of the variance). This tell us that the MARS models accurately predicted the actual values, since the R2 was very close to 1. The MAE also tell us the the difference between the actual and predicted values were very small.
From the final model output, we can see that MARS did select the the informative predictors (those named X1–X5). This shows that MARS preformed feature selection on the model to exclude noise predictors (those named X6-X10).
marsTuned$finalModel
## Selected 14 of 18 terms, and 5 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.62945 RSS 225.8601 GRSq 0.9338437 RSq 0.953688
7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
data(ChemicalManufacturingProcess)
impute <- preProcess(ChemicalManufacturingProcess, "knnImpute")
bio <- predict(impute, ChemicalManufacturingProcess)
filtered_bio <- bio[, -nearZeroVar(bio)]
set.seed(1000)
splits2 <- createDataPartition(filtered_bio$Yield, p = .80, times = 1, list = FALSE)
training <- filtered_bio[splits2, ]
testing <- filtered_bio[-splits2, ]
# Separate predictors and outcome
training_x <- training[, names(training) != "Yield"]
training_y <- training$Yield
test_x <- testing[, names(testing) != "Yield"]
test_y <- testing$Yield
kNN Model: The lowest RMSE for the model is at k = 7. The lowest RMSE was 0.707341. The R2 was 0.439959, which means 44.0% of the variance in y is explained by the kNN model. The MAE was 0.597708 which is the average error between prediction and actual values. From these numbers we can say, the model is probably not the best model to predict the values of this dataset.
set.seed(200)
knnModel2 <- train(x = training_x,
y = training_y,
method = "knn",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
knnModel2
## k-Nearest Neighbors
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.7323182 0.5025668 0.5986016
## 7 0.7315242 0.5243653 0.5820750
## 9 0.7339834 0.5156104 0.5870805
## 11 0.7454995 0.5164099 0.6068725
## 13 0.7481804 0.5179680 0.6128097
## 15 0.7558579 0.5084821 0.6188031
## 17 0.7590126 0.5077806 0.6160498
## 19 0.7691120 0.4923293 0.6231535
## 21 0.7743640 0.4994307 0.6290775
## 23 0.7843231 0.4926474 0.6354539
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
knnPred2 <- predict(knnModel2, newdata = test_x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = knnPred2, obs = test_y)
## RMSE Rsquared MAE
## 0.707341 0.439959 0.597708
Neural Networks: The lowest RMSE for the model is size = 6, decay = 0.01 and bag = FALSE. The lowest RMSE was 0.9090523. The R2 was 0.2950409, which means 29.5% of the variance in y is explained by the SNM model. The MAE was 0.683490, which is the average error between prediction and actual values. From these numbers we can say, the model preformed worse than kNN model with a very small R2 value and a higher RMSE than the kNN model - but overall, not good at predicting the actual values.
tooHigh <- findCorrelation(cor(training_x), cutoff = .75)
train_x_nnet <- training_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),.size = c(1:10),.bag = FALSE)
set.seed(200)
nnetTune2 <- train(x = train_x_nnet,
y = training_y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(train_x_nnet) + 1) + 10 + 1,
maxit = 500)
nnetTune2
## Model Averaged Neural Network
##
## 144 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 0.8102564 0.4207664 0.6456538
## 0.00 2 0.7521505 0.5237831 0.5982847
## 0.00 3 0.8127144 0.4671907 0.6529255
## 0.00 4 0.7426528 0.5466168 0.5990348
## 0.00 5 0.7526093 0.5564751 0.6030258
## 0.00 6 0.8217485 0.4643396 0.6693577
## 0.00 7 0.7481465 0.5319397 0.6242967
## 0.00 8 0.7077327 0.5665477 0.5666518
## 0.00 9 0.7014477 0.5776377 0.5820450
## 0.00 10 0.7032134 0.5589602 0.5634196
## 0.01 1 0.8006842 0.4430332 0.6422199
## 0.01 2 0.6982785 0.5887453 0.5513743
## 0.01 3 0.7343272 0.5454770 0.5945528
## 0.01 4 0.7574609 0.5002943 0.6176227
## 0.01 5 0.6733725 0.6117187 0.5538290
## 0.01 6 0.6297313 0.6457762 0.5171542
## 0.01 7 0.6617782 0.6179006 0.5332240
## 0.01 8 0.6787046 0.5982176 0.5505471
## 0.01 9 0.6733624 0.6075216 0.5464198
## 0.01 10 0.6790164 0.6060510 0.5616365
## 0.10 1 0.8042218 0.4596355 0.6580454
## 0.10 2 0.6911396 0.5825341 0.5555202
## 0.10 3 0.7390385 0.5473842 0.5904390
## 0.10 4 0.6830727 0.6272789 0.5499197
## 0.10 5 0.6885004 0.6025470 0.5720166
## 0.10 6 0.6760863 0.6090577 0.5508936
## 0.10 7 0.6835250 0.6062134 0.5475977
## 0.10 8 0.6730756 0.6008292 0.5648896
## 0.10 9 0.6489555 0.6273300 0.5321331
## 0.10 10 0.6834932 0.5914035 0.5600145
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 6, decay = 0.01 and bag = FALSE.
nnetPred2 <- predict(nnetTune2, newdata = test_x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = nnetPred2, obs = test_y)
## RMSE Rsquared MAE
## 0.9090523 0.2950409 0.6834901
MARS: The lowest RMSE for the model is nprune = 3 and degree = 2. The lowest RMSE was 0.7284024. The R2 was 0.4554852, which means 45.55% of the variance in y is explained by the SNM model. The MAE was 0.5794772, which is the average error between prediction and actual values. From these numbers we can say, the model preformed better than the kNN model and the Neural Network model with a slightly higher R2 than NN model but a slightly higher RMSE value - but overall, not that great at predicting the actual values.
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Fix the seed so that the results can be reproduced
set.seed(200)
marsTuned2 <- train(x = training_x,
y = training_y,
method = "earth",
tuneGrid = marsGrid,
tuneLength = 10,
trControl = trainControl(method = "cv"))
marsTuned2
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.7351587 0.4816841 0.5881771
## 1 3 0.6645810 0.5759239 0.5427819
## 1 4 0.6554259 0.5858700 0.5459351
## 1 5 0.6438461 0.6011463 0.5399213
## 1 6 0.6894082 0.5576078 0.5676534
## 1 7 0.7406958 0.5318539 0.6035565
## 1 8 0.7767046 0.4956401 0.6383589
## 1 9 0.7756256 0.4982759 0.6222046
## 1 10 0.7464739 0.5462954 0.5793487
## 1 11 0.7701961 0.5202253 0.5946750
## 1 12 0.8004951 0.4842799 0.6199716
## 1 13 0.8142147 0.4803955 0.6254273
## 1 14 0.8388045 0.4802877 0.6328067
## 1 15 0.8453222 0.4751444 0.6326985
## 1 16 0.8414692 0.4800565 0.6301831
## 1 17 0.8414692 0.4800565 0.6301831
## 1 18 0.8378011 0.4872107 0.6305787
## 1 19 0.8312784 0.4913901 0.6281800
## 1 20 0.8312784 0.4913901 0.6281800
## 1 21 0.8312784 0.4913901 0.6281800
## 1 22 0.8312784 0.4913901 0.6281800
## 1 23 0.8312784 0.4913901 0.6281800
## 1 24 0.8312784 0.4913901 0.6281800
## 1 25 0.8312784 0.4913901 0.6281800
## 1 26 0.8312784 0.4913901 0.6281800
## 1 27 0.8312784 0.4913901 0.6281800
## 1 28 0.8312784 0.4913901 0.6281800
## 1 29 0.8312784 0.4913901 0.6281800
## 1 30 0.8312784 0.4913901 0.6281800
## 1 31 0.8312784 0.4913901 0.6281800
## 1 32 0.8312784 0.4913901 0.6281800
## 1 33 0.8312784 0.4913901 0.6281800
## 1 34 0.8312784 0.4913901 0.6281800
## 1 35 0.8312784 0.4913901 0.6281800
## 1 36 0.8312784 0.4913901 0.6281800
## 1 37 0.8312784 0.4913901 0.6281800
## 1 38 0.8312784 0.4913901 0.6281800
## 2 2 0.7351587 0.4816841 0.5881771
## 2 3 0.6034844 0.6550300 0.4948902
## 2 4 0.6256130 0.6304572 0.5143866
## 2 5 0.6778739 0.5913431 0.5477765
## 2 6 0.6726250 0.5905856 0.5411828
## 2 7 0.6909968 0.5598967 0.5686644
## 2 8 0.7556908 0.4943548 0.5923824
## 2 9 0.7571614 0.5012385 0.5898781
## 2 10 0.7275396 0.5369153 0.5643234
## 2 11 0.7498740 0.5294244 0.5786155
## 2 12 0.7411371 0.5387883 0.5658056
## 2 13 0.7691828 0.5233715 0.5777534
## 2 14 0.7870945 0.5093099 0.5873846
## 2 15 0.8030363 0.4914380 0.6045851
## 2 16 0.8131474 0.4946378 0.6150707
## 2 17 0.8233712 0.4879887 0.6202039
## 2 18 0.8355710 0.4763152 0.6278927
## 2 19 0.8485194 0.4692869 0.6305074
## 2 20 0.8526728 0.4658218 0.6316808
## 2 21 0.8749133 0.4510601 0.6392113
## 2 22 0.8749133 0.4510601 0.6392113
## 2 23 0.8749133 0.4510601 0.6392113
## 2 24 0.8749133 0.4510601 0.6392113
## 2 25 0.8749133 0.4510601 0.6392113
## 2 26 0.8749133 0.4510601 0.6392113
## 2 27 0.8786266 0.4505728 0.6435548
## 2 28 0.8786266 0.4505728 0.6435548
## 2 29 0.8786266 0.4505728 0.6435548
## 2 30 0.8786266 0.4505728 0.6435548
## 2 31 0.8786266 0.4505728 0.6435548
## 2 32 0.8786266 0.4505728 0.6435548
## 2 33 0.8786266 0.4505728 0.6435548
## 2 34 0.8786266 0.4505728 0.6435548
## 2 35 0.8786266 0.4505728 0.6435548
## 2 36 0.8786266 0.4505728 0.6435548
## 2 37 0.8786266 0.4505728 0.6435548
## 2 38 0.8786266 0.4505728 0.6435548
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 2.
marsPred2 <- predict(marsTuned2, newdata = test_x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = marsPred2, obs = test_y)
## RMSE Rsquared MAE
## 0.7284024 0.4554852 0.5794772
SVM Model: The lowest RMSE for the model is sigma = 0.01399323 and C = 4. The lowest RMSE was 0.6802913. The R2 was 0.5225403, which means 52.3% of the variance in y is explained by the SNM model. The MAE was 0.5200695, which is the average error between prediction and actual values. From these numbers we can say, the model preformed the best out of all the models with the lowest RMSE and a highest R2.
set.seed(200)
svmRTuned2 <- train(x = training_x,
y = training_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned2
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7580621 0.5492036 0.6176256
## 0.50 0.6939783 0.5981826 0.5693265
## 1.00 0.6457835 0.6530946 0.5296596
## 2.00 0.6134673 0.6836176 0.5011625
## 4.00 0.5964908 0.6925414 0.4835798
## 8.00 0.6016593 0.6770286 0.4868337
## 16.00 0.5989691 0.6802140 0.4863216
## 32.00 0.5989691 0.6802140 0.4863216
## 64.00 0.5989691 0.6802140 0.4863216
## 128.00 0.5989691 0.6802140 0.4863216
## 256.00 0.5989691 0.6802140 0.4863216
## 512.00 0.5989691 0.6802140 0.4863216
## 1024.00 0.5989691 0.6802140 0.4863216
## 2048.00 0.5989691 0.6802140 0.4863216
##
## Tuning parameter 'sigma' was held constant at a value of 0.01399323
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01399323 and C = 4.
svmRTuned2$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 4
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0139932256245438
##
## Number of Support Vectors : 128
##
## Objective Function Value : -77.2058
## Training error : 0.026496
svnPred2 <- predict(svmRTuned2, newdata = test_x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = svnPred2, obs = test_y)
## RMSE Rsquared MAE
## 0.6802913 0.5225403 0.5200695
(a) Which nonlinear regression model gives the optimal resampling and test set performance? Based on the resampling and test set performance results across the nonlinear regression models, the SVM was the best-performing model overall with the lowest RMSE (0.680291) and MAE (0.5200695) and highest R2 (explaining ~52.3% of the variance). This tell us that although the SVM model performed the best out of all the models, it was not very good overall at accurately predicting the actual values, since the R2 was around 0.5 - which means there is still a large amount of unexplained variance.
(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
The most important predictors in the non-linear SVM model that I have
trained are:
ManufacturingProcess32
BiologicalMaterial06
ManufacturingProcess36
ManufacturingProcess13
BiologicalMaterial03
BiologicalMaterial12
ManufacturingProcess17
BiologicalMaterial02
ManufacturingProcess31
ManufacturingProcess09
The most important predictors from the optimal linear model
was:
ManufacturingProcess32
ManufacturingProcess36
ManufacturingProcess13
ManufacturingProcess09
ManufacturingProcess17
ManufacturingProcess06
BiologicalMaterial02
BiologicalMaterial06
ManufacturingProcess11
ManufacturingProcess33
The process predictors dominate the list compared to biological
predictors for both the non-linear and linear models. However, there are
much more process predictors in top 10 predictors for the linear model
compared to the non-linear model. The very top predictor for both models
was “ManufacturingProcess32.” Both models shared more of the same top
predictors, just not in the same order of importance, these
include:
ManufacturingProcess36
ManufacturingProcess13
ManufacturingProcess09
ManufacturingProcess17
BiologicalMaterial06
varImp(svmRTuned2)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 87.83
## ManufacturingProcess36 84.94
## ManufacturingProcess13 80.51
## BiologicalMaterial03 73.41
## BiologicalMaterial12 68.92
## ManufacturingProcess17 68.72
## BiologicalMaterial02 66.62
## ManufacturingProcess31 64.95
## ManufacturingProcess09 63.74
## ManufacturingProcess06 63.35
## ManufacturingProcess11 49.94
## BiologicalMaterial04 49.41
## ManufacturingProcess33 46.77
## ManufacturingProcess29 45.64
## BiologicalMaterial11 42.08
## BiologicalMaterial09 39.23
## BiologicalMaterial08 39.20
## BiologicalMaterial01 35.96
## ManufacturingProcess30 33.76
ggplot(varImp(svmRTuned2), top =15)
(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
We can see from the results of the correlation that most of the relationships between the top predictors and the response was either moderate or weak. The relationship between ManufacturingProcess32: Corr = 0.6343, BiologicalMaterial06: Corr = 0.4990, ManufacturingProcess36: Corr = -0.584640, ManufacturingProcess13: Corr = -0.4872, ManufacturingProcess13: Corr = -0.4871624, BiologicalMaterial03: Corr = 0.4518573, and BiologicalMaterial12: Corr = 0.3565016.
The relationship between ManufacturingProcess32: Corr = 0.6343301 and the response is positive and slightly strong. The relationship between BiologicalMaterial06: Corr = 0.4990, BiologicalMaterial03: Corr = 0.4518573, and BiologicalMaterial12: Corr = 0.3565016 are positive and on the moderate-weaker side. The relationship between ManufacturingProcess36: Corr = -0.584640 and ManufacturingProcess13: Corr = -0.48716 are negative and moderate. Positive associations generally means increasing the value of the predictor would increase the yield, and vice versa for negative associations. Whereas negative associations generally means decreasing the value of the predictor would decrease the yield. However, these relationships are not very strong, so we might not see this occur for all.
These plots reveal intuition about the biological or process predictors and their relationship with yield since knowing the predictors highly correlated with the product yield will help to improve production. The manufacture may want to use more of these predictors during production to have a higher yield or possibly less of the uncorrelated predictors, especially the predictors that produced a negative correlated value.
top_predictors <- (c("ManufacturingProcess32", "BiologicalMaterial06", "ManufacturingProcess36","ManufacturingProcess13", "BiologicalMaterial03", "BiologicalMaterial12" ))
cor_results <- cor(training[, top_predictors], training$Yield)
cor_results
## [,1]
## ManufacturingProcess32 0.6343301
## BiologicalMaterial06 0.4990380
## ManufacturingProcess36 -0.5846408
## ManufacturingProcess13 -0.4871624
## BiologicalMaterial03 0.4518573
## BiologicalMaterial12 0.3565016
plot_data <- training[, c(top_predictors, "Yield")]
plot_data_long <- pivot_longer(plot_data, cols = top_predictors,
names_to = "Predictor", values_to = "Value")
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(top_predictors)
##
## # Now:
## data %>% select(all_of(top_predictors))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(plot_data_long, aes(x = Value, y = Yield)) +
geom_point() +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
facet_wrap(~ Predictor, scales = "free_x") +
labs(title = "Relationship Between Top Predictors and Yield",
x = "Predictor Value",
y = "Yield")
## `geom_smooth()` using formula = 'y ~ x'