Friedman (1991) introduced several benchmark data sets create by
simulation. The package mlbench contains a function
called mlbench.friedman1 that simulates these data.
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
# We convert the 'x' data from a matrix to a data frame
# One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
# Look at the data using
featurePlot(trainingData$x, trainingData$y)
# or other methods.
# This creates a list with a vector 'y' and a matrix
# of predictors 'x'. Also simulate a large test set to
# estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data.
We will train a few different non-linear models to the data.
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
# The function 'postResample' can be used to get the test set
# perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
First, we need to remove any variables that are too highly correlated, as this will adversely affect the model.
Let’s see if there are any highly correlated variables.
findCorrelation(cor(trainingData$x), cutoff = 0.7)
## integer(0)
There are no highly correlated variables. We can proceed with a neural network model..
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
set.seed(613)
nnetTune <- train(trainingData$x,
trainingData$y,
method = "nnet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv"),
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
nnetTune
## Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.403507 0.7743189 1.901555
## 0.00 2 2.566132 0.7370821 2.138302
## 0.00 3 2.200392 0.8013902 1.711597
## 0.00 4 2.284752 0.7974063 1.797241
## 0.00 5 3.054918 0.7091331 2.157627
## 0.00 6 3.616891 0.6452446 2.377875
## 0.00 7 6.828992 0.5669780 4.131037
## 0.00 8 7.215473 0.4390305 4.255543
## 0.00 9 7.868435 0.5330447 4.460909
## 0.00 10 4.687039 0.5024569 3.358780
## 0.01 1 2.399650 0.7747890 1.895779
## 0.01 2 2.672432 0.7215392 2.143019
## 0.01 3 2.336309 0.7830089 1.925250
## 0.01 4 2.407845 0.7723048 1.953429
## 0.01 5 2.478010 0.7735236 1.965944
## 0.01 6 2.819053 0.7098835 2.283037
## 0.01 7 3.184249 0.6535647 2.503848
## 0.01 8 3.329401 0.6476834 2.608900
## 0.01 9 3.424461 0.5984221 2.655507
## 0.01 10 3.549240 0.6102028 2.713519
## 0.10 1 2.409116 0.7733430 1.899935
## 0.10 2 2.641610 0.7293005 2.085931
## 0.10 3 2.549721 0.7415061 2.070464
## 0.10 4 2.437229 0.7735266 1.953236
## 0.10 5 2.465427 0.7769614 1.955106
## 0.10 6 2.802721 0.6975629 2.251406
## 0.10 7 2.979128 0.6799093 2.367295
## 0.10 8 2.902694 0.6891414 2.372789
## 0.10 9 3.148114 0.6404661 2.460830
## 0.10 10 2.983838 0.6820981 2.504837
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.
nnetPred <- predict(nnetTune, testData$x)
postResample(nnetPred, testData$y)
## RMSE Rsquared MAE
## 1.7867916 0.8738675 1.3735092
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(613)
marsTuned <- train(trainingData$x,
trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsTuned
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.322580 0.2543399 3.610030
## 1 3 3.748215 0.4252332 3.036391
## 1 4 2.645509 0.7223745 2.108078
## 1 5 2.402542 0.7679191 1.888667
## 1 6 2.378116 0.7741220 1.845874
## 1 7 1.822505 0.8667803 1.417201
## 1 8 1.706965 0.8790185 1.342523
## 1 9 1.658683 0.8840241 1.287723
## 1 10 1.623801 0.8910466 1.275374
## 1 11 1.610206 0.8913385 1.251271
## 1 12 1.627807 0.8899860 1.265812
## 1 13 1.627496 0.8904753 1.268445
## 1 14 1.642775 0.8881009 1.285885
## 1 15 1.644248 0.8879273 1.285336
## 1 16 1.644248 0.8879273 1.285336
## 1 17 1.644248 0.8879273 1.285336
## 1 18 1.644248 0.8879273 1.285336
## 1 19 1.644248 0.8879273 1.285336
## 1 20 1.644248 0.8879273 1.285336
## 1 21 1.644248 0.8879273 1.285336
## 1 22 1.644248 0.8879273 1.285336
## 1 23 1.644248 0.8879273 1.285336
## 1 24 1.644248 0.8879273 1.285336
## 1 25 1.644248 0.8879273 1.285336
## 1 26 1.644248 0.8879273 1.285336
## 1 27 1.644248 0.8879273 1.285336
## 1 28 1.644248 0.8879273 1.285336
## 1 29 1.644248 0.8879273 1.285336
## 1 30 1.644248 0.8879273 1.285336
## 1 31 1.644248 0.8879273 1.285336
## 1 32 1.644248 0.8879273 1.285336
## 1 33 1.644248 0.8879273 1.285336
## 1 34 1.644248 0.8879273 1.285336
## 1 35 1.644248 0.8879273 1.285336
## 1 36 1.644248 0.8879273 1.285336
## 1 37 1.644248 0.8879273 1.285336
## 1 38 1.644248 0.8879273 1.285336
## 2 2 4.322580 0.2543399 3.610030
## 2 3 3.748215 0.4252332 3.036391
## 2 4 2.645509 0.7223745 2.108078
## 2 5 2.410179 0.7670986 1.889079
## 2 6 2.299175 0.7890546 1.789842
## 2 7 1.816732 0.8673148 1.404578
## 2 8 1.604893 0.8983251 1.244678
## 2 9 1.470507 0.9134121 1.170182
## 2 10 1.416355 0.9184502 1.145227
## 2 11 1.330617 0.9263783 1.086287
## 2 12 1.270846 0.9339981 1.014435
## 2 13 1.290552 0.9338261 1.030310
## 2 14 1.282980 0.9325176 1.030299
## 2 15 1.259247 0.9360780 1.000936
## 2 16 1.284517 0.9332954 1.026027
## 2 17 1.269574 0.9353742 1.015216
## 2 18 1.268393 0.9354958 1.014549
## 2 19 1.268393 0.9354958 1.014549
## 2 20 1.268393 0.9354958 1.014549
## 2 21 1.268393 0.9354958 1.014549
## 2 22 1.268393 0.9354958 1.014549
## 2 23 1.268393 0.9354958 1.014549
## 2 24 1.268393 0.9354958 1.014549
## 2 25 1.268393 0.9354958 1.014549
## 2 26 1.268393 0.9354958 1.014549
## 2 27 1.268393 0.9354958 1.014549
## 2 28 1.268393 0.9354958 1.014549
## 2 29 1.268393 0.9354958 1.014549
## 2 30 1.268393 0.9354958 1.014549
## 2 31 1.268393 0.9354958 1.014549
## 2 32 1.268393 0.9354958 1.014549
## 2 33 1.268393 0.9354958 1.014549
## 2 34 1.268393 0.9354958 1.014549
## 2 35 1.268393 0.9354958 1.014549
## 2 36 1.268393 0.9354958 1.014549
## 2 37 1.268393 0.9354958 1.014549
## 2 38 1.268393 0.9354958 1.014549
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
marsPred <- predict(marsTuned, testData$x)
postResample(marsPred, testData$y)
## RMSE Rsquared MAE
## 1.1589948 0.9460418 0.9250230
svmRTuned <- train(trainingData$x,
trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.499441 0.8077423 1.989067
## 0.50 2.227383 0.8291063 1.755788
## 1.00 2.032303 0.8520541 1.602510
## 2.00 1.950671 0.8626296 1.531516
## 4.00 1.873928 0.8715331 1.473022
## 8.00 1.855371 0.8729816 1.485854
## 16.00 1.851335 0.8734116 1.487928
## 32.00 1.851335 0.8734116 1.487928
## 64.00 1.851335 0.8734116 1.487928
## 128.00 1.851335 0.8734116 1.487928
## 256.00 1.851335 0.8734116 1.487928
## 512.00 1.851335 0.8734116 1.487928
## 1024.00 1.851335 0.8734116 1.487928
## 2048.00 1.851335 0.8734116 1.487928
##
## Tuning parameter 'sigma' was held constant at a value of 0.06613742
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06613742 and C = 16.
svmPred <- predict(svmRTuned, testData$x)
postResample(svmPred, testData$y)
## RMSE Rsquared MAE
## 2.0815349 0.8244315 1.5814080
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
Let’s compare the evaluation metrics of each model.
rbind(knnMod = postResample(knnPred, testData$y),
nnetMod = postResample(nnetPred, testData$y),
marsMod = postResample(marsPred, testData$y),
svmMod = postResample(svmPred, testData$y))
## RMSE Rsquared MAE
## knnMod 3.204059 0.6819919 2.568346
## nnetMod 1.786792 0.8738675 1.373509
## marsMod 1.158995 0.9460418 0.925023
## svmMod 2.081535 0.8244315 1.581408
The MARS model has the best performance with the highest \(R^2\) of 0.95 and the lowest \(RMSE\) of 1.16.
varImp(marsTuned)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.24
## X2 48.73
## X5 15.52
## X3 0.00
The model selects X1, X4, X2, and X5 as the most informative predictors. However, X3 has an overall importance of 0 in this model.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
# load data
data(ChemicalManufacturingProcess)
# apply same imputations, data splitting, and pre-processing as done in HW7
# impute
imputations <- preProcess(ChemicalManufacturingProcess,
method = c("knnImpute"),
k=5)
chem_man_imputed <- predict(imputations, ChemicalManufacturingProcess)
# filter nonzero variance variables
chem_man_filtered <- chem_man_imputed[,-nearZeroVar(chem_man_imputed)]
set.seed(613) # for reproducibility
# split into training and testing
train_indices <- sample(nrow(chem_man_filtered), nrow(chem_man_filtered)*.8, replace=F)
trainChem <- chem_man_filtered[train_indices,]
testChem <- chem_man_filtered[-train_indices,]
(a) Which nonlinear regression model gives the optimal resampling and test set performance?
knnModel <- train(Yield ~ .,
data=trainChem,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 140 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.7904113 0.4030058 0.6188164
## 7 0.7660230 0.4275085 0.6036355
## 9 0.7580353 0.4398730 0.6072703
## 11 0.7544116 0.4465885 0.6074600
## 13 0.7533996 0.4477019 0.6078428
## 15 0.7529933 0.4482641 0.6103877
## 17 0.7539529 0.4495622 0.6125312
## 19 0.7541291 0.4507737 0.6123520
## 21 0.7562679 0.4490719 0.6156502
## 23 0.7589265 0.4510659 0.6185059
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
knnPred <- predict(knnModel, testChem)
trainChem_x <- trainChem |>
dplyr::select(-Yield)
trainChem_y <- trainChem |>
dplyr::select(Yield)
testChem_x <- testChem |>
dplyr::select(-Yield)
testChem_y <- testChem |>
dplyr::select(Yield)
corr_indices <- findCorrelation(cor(trainChem_x), cutoff = 0.7)
trainChemFiltered <- trainChem_x[, -corr_indices]
testChemFiltered <- testChem_x[, -corr_indices]
trainChemFiltered$Yield <- trainChem_y$Yield
testChemFiltered$Yield <- testChem_y$Yield
There are no highly correlated variables. We can proceed with a neural network model..
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
set.seed(613)
nnetTune <- train(Yield ~ .,
data=trainChemFiltered,
method = "nnet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv"),
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainChemFiltered)) + 10 + 1,
maxit = 500)
nnetTune
## Neural Network
##
## 140 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 126, 127, 126, 124, 126, 126, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 0.9023743 0.3606740 0.7138930
## 0.00 2 1.1531904 0.3055392 0.8629739
## 0.00 3 1.1092196 0.3408580 0.8634495
## 0.00 4 1.3169755 0.2837269 1.0059900
## 0.00 5 1.3545693 0.2307622 1.1082244
## 0.00 6 1.1423440 0.1982726 0.9209122
## 0.00 7 1.0979439 0.3388930 0.9254208
## 0.00 8 1.0222017 0.3915784 0.8079884
## 0.00 9 1.1264186 0.3072208 0.9113648
## 0.00 10 1.0145311 0.3390948 0.7789101
## 0.01 1 0.8989321 0.3783182 0.7167324
## 0.01 2 0.9999304 0.3839547 0.7940897
## 0.01 3 1.2912675 0.3080647 1.0005925
## 0.01 4 1.1982446 0.2797407 0.9193959
## 0.01 5 1.0552648 0.3356105 0.8564671
## 0.01 6 1.0255680 0.3534374 0.8273259
## 0.01 7 0.9132168 0.4486564 0.7156832
## 0.01 8 0.7902489 0.5151836 0.6463621
## 0.01 9 0.7863965 0.4788332 0.6458413
## 0.01 10 0.8033223 0.5436836 0.6398243
## 0.10 1 0.7422871 0.5569085 0.5959207
## 0.10 2 0.8681757 0.4567303 0.7085806
## 0.10 3 1.0457898 0.3434175 0.8517325
## 0.10 4 1.0309420 0.3171089 0.8004476
## 0.10 5 0.8977346 0.4462025 0.6993282
## 0.10 6 0.7829058 0.5234373 0.6300609
## 0.10 7 0.7962628 0.5722234 0.6048363
## 0.10 8 0.8089790 0.4854038 0.6471713
## 0.10 9 0.8618077 0.4806067 0.6683053
## 0.10 10 0.7789757 0.5558081 0.6239207
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.1.
nnetPred <- predict(nnetTune, testChemFiltered)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(613)
marsTuned <- train(Yield ~ .,
data=trainChem,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsTuned
## Multivariate Adaptive Regression Spline
##
## 140 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 126, 127, 126, 124, 126, 126, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.7640112 0.4730040 0.6054886
## 1 3 0.6905839 0.5703117 0.5425640
## 1 4 0.6201914 0.6456619 0.4999825
## 1 5 0.6174550 0.6412863 0.5041031
## 1 6 0.6268730 0.6354610 0.5104723
## 1 7 0.6263290 0.6324963 0.4989311
## 1 8 0.6251756 0.6350417 0.4877998
## 1 9 0.6511419 0.6181760 0.5100114
## 1 10 0.6611002 0.6074284 0.5190580
## 1 11 0.6716547 0.5968301 0.5270800
## 1 12 0.6755107 0.5993407 0.5308035
## 1 13 0.6668321 0.6071475 0.5254225
## 1 14 0.6732291 0.6034138 0.5313733
## 1 15 0.6789386 0.6001463 0.5390155
## 1 16 0.6789386 0.6001463 0.5390155
## 1 17 0.6849114 0.5955717 0.5420275
## 1 18 0.6950174 0.5944802 0.5471197
## 1 19 0.6947070 0.5948302 0.5460859
## 1 20 0.6981732 0.5928353 0.5467338
## 1 21 0.6981732 0.5928353 0.5467338
## 1 22 0.6981732 0.5928353 0.5467338
## 1 23 0.6981732 0.5928353 0.5467338
## 1 24 0.6981732 0.5928353 0.5467338
## 1 25 0.6981732 0.5928353 0.5467338
## 1 26 0.6981732 0.5928353 0.5467338
## 1 27 0.6981732 0.5928353 0.5467338
## 1 28 0.6981732 0.5928353 0.5467338
## 1 29 0.6981732 0.5928353 0.5467338
## 1 30 0.6981732 0.5928353 0.5467338
## 1 31 0.6981732 0.5928353 0.5467338
## 1 32 0.6981732 0.5928353 0.5467338
## 1 33 0.6981732 0.5928353 0.5467338
## 1 34 0.6981732 0.5928353 0.5467338
## 1 35 0.6981732 0.5928353 0.5467338
## 1 36 0.6981732 0.5928353 0.5467338
## 1 37 0.6981732 0.5928353 0.5467338
## 1 38 0.6981732 0.5928353 0.5467338
## 2 2 0.7640112 0.4730040 0.6054886
## 2 3 0.7168840 0.5486186 0.5754676
## 2 4 0.6937957 0.5788373 0.5547862
## 2 5 0.7078381 0.5342889 0.5516909
## 2 6 0.6893286 0.5506447 0.5411029
## 2 7 0.7157568 0.5461100 0.5584418
## 2 8 0.7282620 0.5433741 0.5625683
## 2 9 0.7114619 0.5600860 0.5482608
## 2 10 0.7531098 0.5073165 0.5671176
## 2 11 0.7146329 0.5436995 0.5557421
## 2 12 0.7380431 0.5283689 0.5672111
## 2 13 0.7394269 0.5240770 0.5733455
## 2 14 0.7603234 0.5134955 0.5860571
## 2 15 0.7720708 0.4953263 0.5934611
## 2 16 0.7878424 0.4858962 0.6061875
## 2 17 0.8242476 0.4652130 0.6061352
## 2 18 0.8271907 0.4735686 0.6047289
## 2 19 0.8227314 0.4749919 0.6031580
## 2 20 0.8153485 0.4769102 0.5968650
## 2 21 0.8085394 0.4778036 0.5971506
## 2 22 0.8176210 0.4775053 0.6051469
## 2 23 0.8089366 0.4843386 0.6014091
## 2 24 0.8097640 0.4780909 0.5999898
## 2 25 0.8157439 0.4685095 0.6068908
## 2 26 0.8176699 0.4659336 0.6097518
## 2 27 0.8176699 0.4659336 0.6097518
## 2 28 0.8189274 0.4621674 0.6103464
## 2 29 0.8172696 0.4638202 0.6084594
## 2 30 0.8172696 0.4638202 0.6084594
## 2 31 0.8172696 0.4638202 0.6084594
## 2 32 0.8172696 0.4638202 0.6084594
## 2 33 0.8172696 0.4638202 0.6084594
## 2 34 0.8172696 0.4638202 0.6084594
## 2 35 0.8172696 0.4638202 0.6084594
## 2 36 0.8172696 0.4638202 0.6084594
## 2 37 0.8172696 0.4638202 0.6084594
## 2 38 0.8172696 0.4638202 0.6084594
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.
marsPred <- predict(marsTuned, testChem)
svmRTuned <- train(Yield ~ .,
data=trainChem,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 140 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 126, 124, 128, 126, 127, 125, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7283012 0.5592173 0.5939106
## 0.50 0.6633791 0.6180803 0.5369524
## 1.00 0.6191929 0.6697225 0.5000055
## 2.00 0.5787403 0.7120096 0.4638110
## 4.00 0.5637067 0.7157505 0.4448834
## 8.00 0.5488654 0.7283117 0.4395413
## 16.00 0.5480276 0.7290105 0.4391489
## 32.00 0.5480276 0.7290105 0.4391489
## 64.00 0.5480276 0.7290105 0.4391489
## 128.00 0.5480276 0.7290105 0.4391489
## 256.00 0.5480276 0.7290105 0.4391489
## 512.00 0.5480276 0.7290105 0.4391489
## 1024.00 0.5480276 0.7290105 0.4391489
## 2048.00 0.5480276 0.7290105 0.4391489
##
## Tuning parameter 'sigma' was held constant at a value of 0.01414407
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01414407 and C = 16.
svmPred <- predict(svmRTuned, testChem)
Let’s compare the accuracy metrics for each model.
rbind(knnMod = postResample(knnPred, testChem$Yield),
nnetMod = postResample(nnetPred, testChemFiltered$Yield),
marsMod = postResample(marsPred, testChem$Yield),
svmMod = postResample(svmPred, testChem$Yield))
## RMSE Rsquared MAE
## knnMod 0.8357339 0.1900885 0.6474526
## nnetMod 0.6769329 0.4697276 0.5464301
## marsMod 0.6301207 0.5449365 0.4941829
## svmMod 0.5850169 0.5842473 0.4863866
The SVM model has the highest \(R^2\) of 0.58 and lowest \(RMSE\) of 0.59 so this is the best performing model.
(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
Let’s take a look at the top 10 most important predictors for the SVM model.
plot(varImp(svmRTuned), 10)
The process variables dominate the most important predictors. However, there are more biological variables that are important in the SVM model than there were in the linear model from HW7.
lasso_mod <- train(Yield ~ .,
data=trainChem,
method = "glmnet",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv"),
tuneGrid = expand.grid(.alpha = 1, .lambda = seq(0, 1, 0.05)))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
plot(varImp(lasso_mod), 10)
There were only seven actually important variables in the linear
model trained for HW 7. ManufacturingProcess32 remains the
top most important variable for both models.
Other predictors that are present as important variables in both
models are ManufacturingProcess13,
ManufacturingProcess09,
ManufacturingProcess17,
ManufacturingProcess36,
ManufacturingProcess06, and
BiologicalMaterical02.
(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
chem_man_filtered[,c("Yield", "BiologicalMaterial06", "ManufacturingProcess31", "BiologicalMaterial03", "BiologicalMaterial12")] |>
cor() |>
corrplot(method="color",
diag=FALSE,
type="lower",
addCoef.col = "black",
number.cex=0.5)
If we take a look down the first column,
BiologicalMaterial06 has the highest positive correlation
with Yield, followed by BiologicalMaterial03
and BiologicalMaterial12.
ManufacturingProcess31 has only a slight negative
correlation with Yield. It is unclear why this should be
such an important variable in the optimal model. Further testing would
need to be done to see if this variable is possibly highly correlated
with other variables in the dataset.