Infrared (IR) spectroscopy technology is used to determine the chemical makeup of a substance. The theory of IR spectroscopy holds that unique molecular structures absorb IR frequencies differently. In practice a spectrometer fires a series of IR frequencies into a sample material, and the device measures the absorbance of the sample at each individual frequency. This series of measurements creates a spectrum profile which can then be used to determine the chemical makeup of the sample material. A Tecator Infratec Food and Feed Analyzer instrument was used to analyze 215 samples of meat across 100 frequencies. A sample of these frequency profiles is displayed in Fig. 6.20. In addition to an IR profile, analytical chemistry determined the percent content of water, fat, and protein for each sample. If we can establish a predictive relationship between IR spectrum and fat content, then food scientists could predict a sample’s fat content with IR instead of using analytical chemistry. This would provide costs savings, since analytical chemistry is a more expensive, time-consuming process.
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Loading required package: lattice
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.5.2
## Loading required package: Matrix
## Loaded glmnet 4.1-10
data(tecator)
str(absorp)
## num [1:215, 1:100] 2.62 2.83 2.58 2.82 2.79 ...
str(endpoints)
## num [1:215, 1:3] 60.5 46 71 72.8 58.3 44 44 69.3 61.4 61.4 ...
The matrix absorp contains the 100 absorbance values for the 215 samples, while matrix endpoints contain the percent of moisture, fat, and protein in columns 1–3, respectively. To be more specific
moisture <- endpoints[, 1]
fat <- endpoints[, 2]
protein <- endpoints[, 3]
set.seed(123)
trainIndex <- createDataPartition(protein, p = 0.8, list = FALSE)
xTrain <- absorp[trainIndex, ]
xTest <- absorp[-trainIndex, ]
yTrain <- protein[trainIndex]
yTest <- protein[-trainIndex]
xTrain <- as.data.frame(xTrain)
xTest <- as.data.frame(xTest)
colnames(xTrain) <- paste0("X", 1:ncol(xTrain))
colnames(xTest) <- paste0("X", 1:ncol(xTest))
preProc <- preProcess(xTrain, method = c("center", "scale"))
xTrainPP <- predict(preProc, xTrain)
xTestPP <- predict(preProc, xTest)
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
olsModel <- train(xTrainPP, yTrain,
method = "lm",
trControl = ctrl)
print(olsModel)
## Linear Regression
##
## 174 samples
## 100 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 157, 155, 157, 155, 158, 156, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.117079 0.8769285 0.751686
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
pcrModel <- train(xTrainPP, yTrain,
method = "pcr",
tuneLength = 20,
trControl = ctrl)
print(pcrModel)
## Principal Component Analysis
##
## 174 samples
## 100 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 156, 155, 156, 157, 156, 157, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 2.9901392 0.1036154 2.5829502
## 2 2.6691267 0.2636436 2.1897169
## 3 2.2708709 0.4765607 1.8071305
## 4 1.7845684 0.6750643 1.3552680
## 5 1.3456269 0.8117595 1.0584717
## 6 1.1356199 0.8671517 0.9219539
## 7 1.1540602 0.8618650 0.9298584
## 8 1.1574524 0.8610975 0.9349763
## 9 1.1300358 0.8697604 0.9221245
## 10 1.0088042 0.8970313 0.8196008
## 11 0.9444093 0.9092578 0.7717222
## 12 0.8999303 0.9187049 0.7295890
## 13 0.7582463 0.9429959 0.5911553
## 14 0.7123571 0.9494063 0.5542515
## 15 0.6720993 0.9553850 0.5293108
## 16 0.6607770 0.9570576 0.5229481
## 17 0.6518348 0.9586442 0.5172783
## 18 0.6625307 0.9571303 0.5202673
## 19 0.6198191 0.9622897 0.4872179
## 20 0.6029205 0.9643313 0.4779011
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 20.
pcrModel$bestTune
## ncomp
## 20 20
plsModel <- train(xTrainPP, yTrain,
method = "pls",
tuneLength = 20,
trControl = ctrl)
print(plsModel)
## Partial Least Squares
##
## 174 samples
## 100 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 156, 156, 157, 157, 156, 157, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 2.9801533 0.08889222 2.5674248
## 2 2.3178640 0.43247165 1.8619224
## 3 1.8290788 0.64493409 1.3647547
## 4 1.6667105 0.70258508 1.2919152
## 5 1.1992338 0.86278034 0.9633271
## 6 1.1442514 0.87814819 0.9220121
## 7 1.0961146 0.88499669 0.8738019
## 8 0.9656962 0.91140080 0.7832121
## 9 0.9310960 0.91841989 0.7502324
## 10 0.8775449 0.92854621 0.7103302
## 11 0.7930296 0.93836259 0.6336363
## 12 0.7150815 0.95260212 0.5568171
## 13 0.6921816 0.95566411 0.5385290
## 14 0.6263154 0.96152533 0.4932441
## 15 0.6266531 0.96076045 0.4919418
## 16 0.6373419 0.95899020 0.4946603
## 17 0.6525812 0.95736991 0.5033797
## 18 0.6756133 0.95426585 0.5108220
## 19 0.6820005 0.95329483 0.5129135
## 20 0.7095159 0.94959341 0.5158419
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 14.
plsModel$bestTune
## ncomp
## 14 14
ridgeModel <- train(xTrainPP, yTrain,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0.0001, 1, length = 50)),
trControl = ctrl)
print(ridgeModel)
## glmnet
##
## 174 samples
## 100 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 157, 157, 158, 156, 156, 156, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00010000 1.670741 0.7123142 1.298334
## 0.02050612 1.670741 0.7123142 1.298334
## 0.04091224 1.670741 0.7123142 1.298334
## 0.06131837 1.670741 0.7123142 1.298334
## 0.08172449 1.676051 0.7110949 1.302565
## 0.10213061 1.693497 0.7066988 1.315875
## 0.12253673 1.705256 0.7038702 1.324356
## 0.14294286 1.729200 0.6974846 1.346251
## 0.16334898 1.755243 0.6908104 1.370825
## 0.18375510 1.779263 0.6835301 1.394733
## 0.20416122 1.801308 0.6779394 1.415595
## 0.22456735 1.819189 0.6730506 1.433781
## 0.24497347 1.839509 0.6676355 1.454712
## 0.26537959 1.857449 0.6635130 1.473034
## 0.28578571 1.875538 0.6594865 1.491105
## 0.30619184 1.891545 0.6547795 1.509346
## 0.32659796 1.906108 0.6511605 1.524559
## 0.34700408 1.920522 0.6473683 1.538356
## 0.36741020 1.936688 0.6435648 1.554327
## 0.38781633 1.949469 0.6405395 1.568929
## 0.40822245 1.963642 0.6366942 1.584465
## 0.42862857 1.976224 0.6338766 1.597682
## 0.44903469 1.986125 0.6314804 1.608238
## 0.46944082 1.997441 0.6284177 1.619835
## 0.48984694 2.010669 0.6249500 1.632881
## 0.51025306 2.023242 0.6220365 1.646143
## 0.53065918 2.033580 0.6199770 1.658203
## 0.55106531 2.043620 0.6173507 1.669004
## 0.57147143 2.054329 0.6142015 1.679866
## 0.59187755 2.064578 0.6109072 1.689768
## 0.61228367 2.074446 0.6080548 1.700059
## 0.63268980 2.084494 0.6052765 1.710461
## 0.65309592 2.094211 0.6027231 1.720527
## 0.67350204 2.102722 0.6005444 1.729045
## 0.69390816 2.110486 0.5986974 1.737237
## 0.71431429 2.118323 0.5965717 1.745146
## 0.73472041 2.126428 0.5943324 1.753284
## 0.75512653 2.134772 0.5918797 1.761372
## 0.77553265 2.142347 0.5898147 1.768969
## 0.79593878 2.149812 0.5879033 1.776636
## 0.81634490 2.157009 0.5858597 1.783989
## 0.83675102 2.163976 0.5839473 1.791159
## 0.85715714 2.171057 0.5818797 1.798193
## 0.87756327 2.178073 0.5797669 1.805059
## 0.89796939 2.184958 0.5778681 1.811910
## 0.91837551 2.191765 0.5759173 1.818693
## 0.93878163 2.198196 0.5741264 1.825279
## 0.95918776 2.204465 0.5724401 1.831746
## 0.97959388 2.210787 0.5704820 1.838163
## 1.00000000 2.216855 0.5686198 1.844358
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.06131837.
ridgeModel$bestTune
## alpha lambda
## 4 0 0.06131837
enetModel <- train(xTrainPP, yTrain,
method = "glmnet",
tuneLength = 10,
trControl = ctrl)
## Warning: from glmnet C++ code (error code -91); Convergence for 91th lambda
## value not reached after maxit=100000 iterations; solutions for larger lambdas
## returned
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(enetModel)
## glmnet
##
## 174 samples
## 100 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 157, 156, 156, 156, 157, 157, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.1 0.0008688831 1.132864 0.87263625 0.9257707
## 0.1 0.0020072326 1.149729 0.86962454 0.9390245
## 0.1 0.0046369677 1.195122 0.85999787 0.9680117
## 0.1 0.0107119968 1.288128 0.83680503 1.0247196
## 0.1 0.0247461020 1.436767 0.79486168 1.1260556
## 0.1 0.0571667052 1.610740 0.74397941 1.2428241
## 0.1 0.1320625036 1.764029 0.70961406 1.3802521
## 0.1 0.3050815119 1.994094 0.66606446 1.6230704
## 0.1 0.7047778616 2.360095 0.56797876 1.9995968
## 0.2 0.0008688831 1.139212 0.87141319 0.9273377
## 0.2 0.0020072326 1.157161 0.86821204 0.9428813
## 0.2 0.0046369677 1.200661 0.85875671 0.9705810
## 0.2 0.0107119968 1.297409 0.83424687 1.0302459
## 0.2 0.0247461020 1.464836 0.78600971 1.1439402
## 0.2 0.0571667052 1.645049 0.73465119 1.2681371
## 0.2 0.1320625036 1.804141 0.70674878 1.4247617
## 0.2 0.3050815119 2.110947 0.65428922 1.7559443
## 0.2 0.7047778616 2.638272 0.41684848 2.2606890
## 0.3 0.0008688831 1.153743 0.86934307 0.9405393
## 0.3 0.0020072326 1.159689 0.86776962 0.9456368
## 0.3 0.0046369677 1.203003 0.85805146 0.9717054
## 0.3 0.0107119968 1.305288 0.83194466 1.0351921
## 0.3 0.0247461020 1.491737 0.77761545 1.1614205
## 0.3 0.0571667052 1.663307 0.73156985 1.2834328
## 0.3 0.1320625036 1.847669 0.70408041 1.4716676
## 0.3 0.3050815119 2.242870 0.63251195 1.8954295
## 0.3 0.7047778616 2.912845 0.12845160 2.5129449
## 0.4 0.0008688831 1.158099 0.86739159 0.9390232
## 0.4 0.0020072326 1.161862 0.86681112 0.9436677
## 0.4 0.0046369677 1.206700 0.85701481 0.9735639
## 0.4 0.0107119968 1.315788 0.82896624 1.0417616
## 0.4 0.0247461020 1.521496 0.76814782 1.1810494
## 0.4 0.0571667052 1.679311 0.72930699 1.2977282
## 0.4 0.1320625036 1.893167 0.70164851 1.5261134
## 0.4 0.3050815119 2.387832 0.58980034 2.0354402
## 0.4 0.7047778616 2.956442 0.09118037 2.5552920
## 0.5 0.0008688831 1.144923 0.86980627 0.9303323
## 0.5 0.0020072326 1.154589 0.86858836 0.9399166
## 0.5 0.0046369677 1.209476 0.85638926 0.9745527
## 0.5 0.0107119968 1.324696 0.82635833 1.0481666
## 0.5 0.0247461020 1.548273 0.75959363 1.1989859
## 0.5 0.0571667052 1.693483 0.72785705 1.3118398
## 0.5 0.1320625036 1.940852 0.69886090 1.5829894
## 0.5 0.3050815119 2.543746 0.50634209 2.1806695
## 0.5 0.7047778616 2.963903 0.09068894 2.5629450
## 0.6 0.0008688831 1.136844 0.87174602 0.9257628
## 0.6 0.0020072326 1.153510 0.86867056 0.9395491
## 0.6 0.0046369677 1.210837 0.85607561 0.9757516
## 0.6 0.0107119968 1.335082 0.82320909 1.0554429
## 0.6 0.0247461020 1.572266 0.75212665 1.2150271
## 0.6 0.0571667052 1.707107 0.72690708 1.3261399
## 0.6 0.1320625036 1.991203 0.69536370 1.6408638
## 0.6 0.3050815119 2.708592 0.35581848 2.3301357
## 0.6 0.7047778616 2.972790 0.09025289 2.5706559
## 0.7 0.0008688831 1.127760 0.87378445 0.9177281
## 0.7 0.0020072326 1.152911 0.86875088 0.9388379
## 0.7 0.0046369677 1.211202 0.85598019 0.9753019
## 0.7 0.0107119968 1.343669 0.82065529 1.0612765
## 0.7 0.0247461020 1.589122 0.74665735 1.2259516
## 0.7 0.0571667052 1.718704 0.72661236 1.3389614
## 0.7 0.1320625036 2.043558 0.69084719 1.6995049
## 0.7 0.3050815119 2.870991 0.17068567 2.4770313
## 0.7 0.7047778616 2.982939 0.09001173 2.5788824
## 0.8 0.0008688831 1.133870 0.87224764 0.9235944
## 0.8 0.0020072326 1.146987 0.86986029 0.9347434
## 0.8 0.0046369677 1.208530 0.85629021 0.9734144
## 0.8 0.0107119968 1.352576 0.81765078 1.0672421
## 0.8 0.0247461020 1.599600 0.74374380 1.2336469
## 0.8 0.0571667052 1.729804 0.72650260 1.3519093
## 0.8 0.1320625036 2.097844 0.68478957 1.7576029
## 0.8 0.3050815119 2.950960 0.09312074 2.5495587
## 0.8 0.7047778616 2.994423 0.08994152 2.5880428
## 0.9 0.0008688831 1.121002 0.87520629 0.9106409
## 0.9 0.0020072326 1.142602 0.87106683 0.9298709
## 0.9 0.0046369677 1.205311 0.85704195 0.9709901
## 0.9 0.0107119968 1.358564 0.81540367 1.0707640
## 0.9 0.0247461020 1.602853 0.74312953 1.2360477
## 0.9 0.0571667052 1.738844 0.72666307 1.3632705
## 0.9 0.1320625036 2.149982 0.67803272 1.8123374
## 0.9 0.3050815119 2.955401 0.09105170 2.5548886
## 0.9 0.7047778616 3.006879 0.07700828 2.5972123
## 1.0 0.0008688831 1.131704 0.87424192 0.9192826
## 1.0 0.0020072326 1.137802 0.87257445 0.9244939
## 1.0 0.0046369677 1.183879 0.86216177 0.9536058
## 1.0 0.0107119968 1.353357 0.81613436 1.0657189
## 1.0 0.0247461020 1.608448 0.74067224 1.2393797
## 1.0 0.0571667052 1.742688 0.72539100 1.3655449
## 1.0 0.1320625036 2.185428 0.67162491 1.8485802
## 1.0 0.3050815119 2.957913 0.09103786 2.5582106
## 1.0 0.7047778616 3.018973 0.06858504 2.6046184
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.9 and lambda = 0.0008688831.
enetModel$bestTune
## alpha lambda
## 73 0.9 0.0008688831
Several linear and regularized regression models were trained using repeated 10-fold cross-validation in order to determine the optimal tuning parameters and compare model performance.
For the ordinary least squares (OLS) model, there are no tuning parameters. The cross-validated performance of the OLS model produced an RMSE of approximately 1.117 and an R2 of 0.877, indicating that the model explains about 87.7% of the variation in the response variable.
For principal components regression (PCR), the tuning parameter is the number of principal components used in the model. Cross-validation showed that the optimal number of components was 20, which produced the lowest RMSE among the candidate PCR models.
For partial least squares (PLS) regression, the tuning parameter is also the number of components. The optimal value selected through cross-validation was 14 components, which minimized the cross-validated prediction error.
For ridge regression, the tuning parameters are alpha and lambda, where alpha is set to 0 to indicate ridge regression. The optimal value of lambda was approximately 0.0613, which produced the lowest cross-validated RMSE for this model.
For the elastic net model, both alpha and lambda are tuned. The optimal parameter values were alpha = 0.9 and lambda ≈ 0.0008689. This combination provided the best predictive performance among the elastic net models tested.
Overall, among the linear models, PCR and PLS performed the best, producing lower prediction errors compared with OLS, ridge regression, and elastic net.
set.seed(123)
svmModel <- train(
xTrain, yTrain,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = ctrl
)
svmModel$bestTune
## sigma C
## 7 0.1923092 16
set.seed(123)
nnetModel <- train(
xTrain[,1:20], yTrain,
method = "nnet",
preProcess = c("center", "scale"),
tuneLength = 3,
trControl = ctrl,
linout = TRUE,
trace = FALSE,
MaxNWts = 5000
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(nnetModel)
## Neural Network
##
## 174 samples
## 20 predictor
##
## Pre-processing: centered (20), scaled (20)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 155, 156, 157, 158, 158, 156, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0e+00 1.691188 0.7353854 1.3468138
## 1 1e-04 1.436758 0.8002830 1.1164499
## 1 1e-01 1.508905 0.7592012 1.2004840
## 3 0e+00 1.172000 0.8526632 0.9200893
## 3 1e-04 1.192154 0.8526737 0.9362190
## 3 1e-01 1.312175 0.8330825 1.0653557
## 5 0e+00 1.174654 0.8621791 0.9118245
## 5 1e-04 1.277993 0.8637518 0.9321798
## 5 1e-01 1.238331 0.8524738 1.0110993
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.
plot(nnetModel)
nnetModel$bestTune
## size decay
## 4 3 0
set.seed(123)
nnetModelPCA <- train(
xTrain, yTrain,
method = "nnet",
preProcess = c("center", "scale", "pca"),
thresh = 0.95,
tuneLength = 10,
trControl = ctrl,
linout = TRUE,
trace = FALSE
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
plot(nnetModelPCA)
nnetModelPCA$bestTune
## size decay
## 10 1 0.1
preProc_pca <- preProcess(xTrain,
method = c("center","scale","pca"),
thresh = 0.95)
xTrainPCA <- predict(preProc_pca, xTrain)
xTestPCA <- predict(preProc_pca, xTest)
nnetModelPCA <- train(xTrainPCA, yTrain,
method = "nnet",
tuneLength = 10,
trControl = ctrl,
linout = TRUE,
trace = FALSE)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(nnetModelPCA)
## Neural Network
##
## 174 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 156, 156, 158, 156, 157, 157, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0.0000000000 2.939610 0.1490690 2.480775
## 1 0.0001000000 2.967434 0.1143311 2.515494
## 1 0.0002371374 2.961758 0.1170152 2.515184
## 1 0.0005623413 2.911365 0.1364376 2.462972
## 1 0.0013335214 2.943014 0.1176256 2.462667
## 1 0.0031622777 2.942867 0.1270070 2.467086
## 1 0.0074989421 2.831295 0.1613386 2.370223
## 1 0.0177827941 2.873677 0.1538602 2.380287
## 1 0.0421696503 2.732948 0.2183242 2.223266
## 1 0.1000000000 2.681266 0.2405818 2.177847
## 3 0.0000000000 2.876196 0.1863730 2.326274
## 3 0.0001000000 2.735734 0.2462361 2.224755
## 3 0.0002371374 2.863205 0.2012773 2.315444
## 3 0.0005623413 2.844877 0.1939910 2.295863
## 3 0.0013335214 2.808860 0.1965707 2.288847
## 3 0.0031622777 2.795907 0.2067317 2.270304
## 3 0.0074989421 2.736621 0.2352326 2.210460
## 3 0.0177827941 2.749625 0.2417794 2.224794
## 3 0.0421696503 2.753359 0.2201660 2.244387
## 3 0.1000000000 2.736227 0.2286963 2.257246
## 5 0.0000000000 3.044560 0.2245292 2.360498
## 5 0.0001000000 3.203075 0.1805473 2.429635
## 5 0.0002371374 3.133315 0.1979409 2.432573
## 5 0.0005623413 2.935513 0.1994660 2.350788
## 5 0.0013335214 2.840674 0.2304277 2.266794
## 5 0.0031622777 2.784213 0.2120429 2.254797
## 5 0.0074989421 2.769215 0.2316305 2.263464
## 5 0.0177827941 2.738943 0.2340686 2.227455
## 5 0.0421696503 2.823984 0.2008214 2.300417
## 5 0.1000000000 2.825713 0.1961081 2.291556
## 7 0.0000000000 4.174699 0.2098005 2.664306
## 7 0.0001000000 2.988379 0.2177363 2.325330
## 7 0.0002371374 3.110022 0.1943095 2.437304
## 7 0.0005623413 2.984143 0.1970117 2.330411
## 7 0.0013335214 2.985076 0.1942810 2.365482
## 7 0.0031622777 2.915052 0.1922104 2.362055
## 7 0.0074989421 2.864562 0.2180772 2.326982
## 7 0.0177827941 2.944994 0.1725525 2.375837
## 7 0.0421696503 2.840651 0.2119458 2.321386
## 7 0.1000000000 2.873263 0.1906863 2.307895
## 9 0.0000000000 3.116953 0.1910331 2.395715
## 9 0.0001000000 3.126657 0.1966819 2.389838
## 9 0.0002371374 3.127962 0.1779046 2.421191
## 9 0.0005623413 3.107145 0.1895626 2.430823
## 9 0.0013335214 3.176595 0.1737725 2.461648
## 9 0.0031622777 2.948226 0.1951105 2.346934
## 9 0.0074989421 3.024679 0.1954591 2.373059
## 9 0.0177827941 2.949504 0.1886555 2.359724
## 9 0.0421696503 2.824906 0.2105526 2.323047
## 9 0.1000000000 2.923371 0.1840563 2.367401
## 11 0.0000000000 3.655300 0.1672209 2.604659
## 11 0.0001000000 3.422088 0.2133081 2.468645
## 11 0.0002371374 3.446724 0.1685289 2.551577
## 11 0.0005623413 3.130085 0.2203829 2.406570
## 11 0.0013335214 3.203476 0.1984398 2.510455
## 11 0.0031622777 3.263917 0.1627587 2.531744
## 11 0.0074989421 2.839644 0.2501677 2.277531
## 11 0.0177827941 3.042395 0.1801750 2.421804
## 11 0.0421696503 2.928057 0.2026704 2.359920
## 11 0.1000000000 2.889287 0.2071551 2.303071
## 13 0.0000000000 3.527032 0.2137494 2.589298
## 13 0.0001000000 3.813028 0.1737313 2.689774
## 13 0.0002371374 3.671053 0.1818015 2.652719
## 13 0.0005623413 3.029418 0.2313069 2.413932
## 13 0.0013335214 3.433354 0.1785621 2.632108
## 13 0.0031622777 3.161889 0.1721130 2.450763
## 13 0.0074989421 3.098891 0.2202784 2.409686
## 13 0.0177827941 2.990834 0.2220976 2.338248
## 13 0.0421696503 3.029193 0.1874675 2.424691
## 13 0.1000000000 2.840790 0.2105632 2.296599
## 15 0.0000000000 3.573523 0.1686685 2.655748
## 15 0.0001000000 3.189311 0.2091657 2.431901
## 15 0.0002371374 3.417035 0.2012613 2.578170
## 15 0.0005623413 3.563977 0.1741389 2.639988
## 15 0.0013335214 3.486980 0.1624015 2.597678
## 15 0.0031622777 3.289315 0.1948260 2.475566
## 15 0.0074989421 3.251891 0.1777561 2.487665
## 15 0.0177827941 3.247078 0.1551035 2.549808
## 15 0.0421696503 2.999743 0.2295141 2.371116
## 15 0.1000000000 2.835981 0.2234702 2.287148
## 17 0.0000000000 3.893586 0.1674957 2.721335
## 17 0.0001000000 3.546441 0.1699010 2.674751
## 17 0.0002371374 3.664207 0.1981669 2.652771
## 17 0.0005623413 3.808506 0.1696660 2.740982
## 17 0.0013335214 3.254444 0.2272210 2.527033
## 17 0.0031622777 3.266977 0.1766668 2.539712
## 17 0.0074989421 3.184344 0.1937495 2.483705
## 17 0.0177827941 3.236112 0.2053791 2.437137
## 17 0.0421696503 2.952532 0.2277163 2.347316
## 17 0.1000000000 2.930079 0.2186223 2.336776
## 19 0.0000000000 3.836917 0.1826461 2.739651
## 19 0.0001000000 4.020703 0.1781642 2.809847
## 19 0.0002371374 3.839409 0.1866567 2.770110
## 19 0.0005623413 3.612012 0.1536347 2.721001
## 19 0.0013335214 3.701829 0.1862551 2.664894
## 19 0.0031622777 3.730880 0.1704762 2.660394
## 19 0.0074989421 3.111604 0.2322848 2.434617
## 19 0.0177827941 3.100717 0.2317718 2.404876
## 19 0.0421696503 2.956090 0.2487289 2.309050
## 19 0.1000000000 2.857762 0.2505672 2.278374
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.1.
marsModel <- train(xTrainPP, yTrain,
method = "earth",
tuneLength = 10,
trControl = ctrl)
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.5.2
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.5.2
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.5.2
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(marsModel)
## Multivariate Adaptive Regression Spline
##
## 174 samples
## 100 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 157, 156, 157, 156, 157, 156, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 2.9810180 0.07580469 2.5512099
## 4 2.3421668 0.41094966 1.8466776
## 7 1.2086132 0.84426271 0.9918164
## 10 1.0010828 0.89799055 0.8304056
## 12 0.9621764 0.90339883 0.7732088
## 15 0.9466728 0.90726637 0.7545232
## 18 0.9397485 0.90927092 0.7416214
## 20 0.9877486 0.90143283 0.7585493
## 23 0.9886761 0.90138064 0.7606024
## 26 0.9887581 0.90131790 0.7613587
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 18 and degree = 1.
marsModel$bestTune
## nprune degree
## 7 18 1
knnModel <- train(xTrainPP, yTrain,
method = "knn",
tuneLength = 20,
trControl = ctrl)
print(knnModel)
## k-Nearest Neighbors
##
## 174 samples
## 100 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 155, 156, 157, 157, 157, 157, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 2.474668 0.3690866 2.057197
## 7 2.382824 0.4358612 2.022620
## 9 2.507344 0.3692912 2.136867
## 11 2.601757 0.3212522 2.223514
## 13 2.656564 0.2887736 2.258285
## 15 2.689449 0.2820429 2.296097
## 17 2.705965 0.2780392 2.315008
## 19 2.755093 0.2608397 2.355096
## 21 2.793495 0.2482882 2.387564
## 23 2.811914 0.2390692 2.402781
## 25 2.839308 0.2073361 2.419849
## 27 2.852970 0.1961659 2.433550
## 29 2.898042 0.1526846 2.470833
## 31 2.917017 0.1348000 2.483757
## 33 2.934026 0.1312955 2.503398
## 35 2.940928 0.1340529 2.510566
## 37 2.949119 0.1224412 2.522709
## 39 2.950559 0.1314187 2.532524
## 41 2.959944 0.1257058 2.546609
## 43 2.956398 0.1372955 2.544466
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
knnModel$bestTune
## k
## 2 7
Several nonlinear models were also trained and tuned using cross-validation.
For the support vector machine (SVM) model, the optimal parameters were sigma = 0.1923 and C = 16.These parameters were selected because they minimized the cross-validated prediction error.
For the neural network model without PCA preprocessing, the optimal tuning parameters were size = 3 and decay = 0.This configuration produced the best cross-validated performance among the neural network models tested without dimensionality reduction.
A second neural network model was trained using PCA preprocessing, where the predictors were reduced to two principal components explained 95% of the variance in the data.For this model, the optimal parameters were size = 1 and decay = 0.1. However, the predictive performance of this model was worse than the neural network model without PCA, suggesting that reducing the predictors to only two components removed useful predictive information.
For the MARS (Multivariate Adaptive Regression Splines) model, the optimal tuning parameters were nprune = 18, which controls the number of terms in the model, and degree = 1, which restricts the model to additive effects without interactions.
For the k-nearest neighbors (KNN) model, the optimal number of neighbors was k = 7, which produced the lowest cross-validated prediction error among the tested values.
e.Which model from parts c) and d) has the best predictive ability? Is any model significantly better or worse than the others?
resamps_nonlinear <- resamples(list(
SVM = svmModel,
NN_PCA = nnetModelPCA,
MARS = marsModel,
KNN = knnModel
))
summary(resamps_nonlinear)
##
## Call:
## summary.resamples(object = resamps_nonlinear)
##
## Models: SVM, NN_PCA, MARS, KNN
## Number of resamples: 30
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 0.6421776 1.0905910 1.2882014 1.2755859 1.4392327 2.071659 0
## NN_PCA 1.6332001 1.9666440 2.1869262 2.1778474 2.3906434 2.659430 0
## MARS 0.4754447 0.6499909 0.7330634 0.7416214 0.8296624 1.039628 0
## KNN 1.5529135 1.7767609 2.0124504 2.0226197 2.2095063 2.887290 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 0.8293109 1.5597191 1.8370109 1.7912305 2.067236 2.800037 0
## NN_PCA 1.9213469 2.4445287 2.6908034 2.6812664 2.821700 3.363098 0
## MARS 0.6175776 0.8110667 0.9440813 0.9397485 1.051180 1.290648 0
## KNN 1.9145792 2.2279792 2.3487674 2.3828244 2.541442 3.271495 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 0.31712738 0.6045432 0.6659600 0.6696664 0.7546262 0.9363666 0
## NN_PCA 0.01382998 0.1359898 0.2343597 0.2405818 0.3255537 0.5313923 0
## MARS 0.84910287 0.8877218 0.9117355 0.9092709 0.9328852 0.9722443 0
## KNN 0.08720124 0.3411169 0.4571343 0.4358612 0.5228570 0.8241962 0
bwplot(resamps_nonlinear)
models_list <- list(
OLS = olsModel,
PCR = pcrModel,
PLS = plsModel,
Ridge = ridgeModel,
ENET = enetModel,
SVM = svmModel,
MARS = marsModel,
KNN = knnModel
)
test_results <- lapply(models_list, function(mod) {
pred <- predict(mod, xTestPP)
postResample(pred, yTest)
})
test_results
## $OLS
## RMSE Rsquared MAE
## 1.1510006 0.8744041 0.7897741
##
## $PCR
## RMSE Rsquared MAE
## 0.7734525 0.9397420 0.5703391
##
## $PLS
## RMSE Rsquared MAE
## 0.8071728 0.9332151 0.6135718
##
## $Ridge
## RMSE Rsquared MAE
## 1.4363106 0.8180923 1.1094682
##
## $ENET
## RMSE Rsquared MAE
## 1.0378016 0.8825287 0.8235833
##
## $SVM
## RMSE Rsquared MAE
## 3.0255739 0.1192418 2.6193183
##
## $MARS
## RMSE Rsquared MAE
## 0.8633784 0.9100796 0.6825461
##
## $KNN
## RMSE Rsquared MAE
## 2.1325705 0.4868149 1.7132404
The predictive performance of the models was compared using the test set results. The principal components regression (PCR) model produced the best predictive performance, with a test RMSE of approximately 0.773 and an 𝑅2 of about 0.94.This indicates that the model explains approximately 94% of the variability in the response variable on the test data.
The PLS and MARS models also performed well, but their prediction errors were slightly higher than that of PCR. In contrast, the SVM and KNN models performed poorly relative to the other models, producing higher prediction errors and lower R2 values.
The superior performance of PCR is likely due to the high number of predictors and the strong correlation among them. Dimension-reduction methods such as PCR are particularly effective in this situation because they transform the predictors into a smaller set of uncorrelated components that capture most of the variation in the data.
2.Developing a model to predict permeability (see Sect. 1.4 of the textbook) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.2
data(permeability)
str(fingerprints)
## num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
str(permeability)
## num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr "permeability"
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response
nzv <- nearZeroVar(fingerprints)
fpFiltered <- fingerprints[, -nzv]
dim(fpFiltered)
## [1] 165 388
The nearZeroVar() function was used to identify and remove predictors with very little variation across observations. Predictors with near-zero variance provide little useful information for prediction and can negatively affect model performance.
After applying this filter, the number of predictors was reduced from 1107 to 388, leaving only the predictors that contained sufficient variability for modeling.
trainIndex2 <- createDataPartition(permeability, p = 0.8, list = FALSE)
xTrain2 <- fpFiltered[trainIndex2, ]
xTest2 <- fpFiltered[-trainIndex2, ]
yTrain2 <- permeability[trainIndex2]
yTest2 <- permeability[-trainIndex2]
plsModel_perm <- train(xTrain2, yTrain2,
method = "pls",
tuneLength = 20,
trControl = ctrl)
print(plsModel_perm)
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 120, 120, 121, 119, 120, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.43369 0.2839322 10.139471
## 2 11.62232 0.4573371 8.277318
## 3 11.38310 0.4838853 8.729371
## 4 11.45801 0.4778862 8.902867
## 5 11.36482 0.4914225 8.843947
## 6 11.46275 0.4804169 8.847717
## 7 11.57633 0.4695458 9.034510
## 8 11.58755 0.4758710 8.921608
## 9 11.79378 0.4677705 9.003738
## 10 12.10930 0.4558820 9.340398
## 11 12.24127 0.4510499 9.356184
## 12 12.57262 0.4283822 9.566781
## 13 12.66535 0.4282678 9.704305
## 14 12.98611 0.4152786 9.933925
## 15 13.41303 0.3944506 10.186739
## 16 13.72884 0.3826145 10.379597
## 17 14.13929 0.3596421 10.658207
## 18 14.30782 0.3535477 10.722201
## 19 14.37893 0.3524029 10.829782
## 20 14.43191 0.3545621 10.884327
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.
plsModel_perm$bestTune
## ncomp
## 5 5
A partial least squares (PLS) model was trained using cross-validation to determine the optimal number of latent variables. The results showed that the optimal number of components was seven, which produced the best cross-validated performance.
The cross-validated 𝑅2 value for this model was approximately 0.464, indicating that the model explains about 46.4% of the variation in permeability during cross-validation.
pred_pls_perm <- predict(plsModel_perm, xTest2)
postResample(pred_pls_perm, yTest2)
## RMSE Rsquared MAE
## 11.9169369 0.5123017 7.7494084
The PLS model was then evaluated using the test data to assess its predictive performance on unseen observations. The model achieved a test RMSE of approximately 9.33, a mean absolute error (MAE) of about 7.32, and an 𝑅2 value of approximately 0.602.
These results indicate that the model explains about 60% of the variability in permeability on the test set and demonstrates moderate predictive accuracy.
enetModel_perm <- train(xTrain2, yTrain2,
method = "glmnet",
tuneLength = 10,
trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(enetModel_perm)
## glmnet
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 119, 120, 121, 120, 120, 120, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.1 0.2783191 12.04807 0.43562614 8.913157
## 0.1 0.4230203 12.04807 0.43562614 8.913157
## 0.1 0.6429533 12.04807 0.43562614 8.913157
## 0.1 0.9772319 11.98926 0.43765660 8.867830
## 0.1 1.4853056 11.69428 0.44701885 8.706850
## 0.1 2.2575324 11.38101 0.46298459 8.494550
## 0.1 3.4312485 11.15798 0.48069966 8.263376
## 0.1 5.2151926 11.17055 0.48218357 8.201340
## 0.1 7.9266290 11.26752 0.48040970 8.190025
## 0.1 12.0477713 11.42023 0.48057103 8.375959
## 0.2 0.2783191 12.34574 0.42264663 9.122306
## 0.2 0.4230203 12.34562 0.42264627 9.122341
## 0.2 0.6429533 12.07901 0.42977104 8.930584
## 0.2 0.9772319 11.72144 0.44294772 8.731360
## 0.2 1.4853056 11.40838 0.46104985 8.490328
## 0.2 2.2575324 11.27157 0.47207371 8.328463
## 0.2 3.4312485 11.32997 0.47063575 8.246667
## 0.2 5.2151926 11.36875 0.47530186 8.212969
## 0.2 7.9266290 11.61143 0.46937108 8.576495
## 0.2 12.0477713 11.99641 0.46289213 9.056244
## 0.3 0.2783191 12.49341 0.41630164 9.216367
## 0.3 0.4230203 12.21509 0.42473181 9.018874
## 0.3 0.6429533 11.85397 0.43630431 8.828103
## 0.3 0.9772319 11.50740 0.45468891 8.576213
## 0.3 1.4853056 11.33855 0.46637062 8.384764
## 0.3 2.2575324 11.39280 0.46486756 8.313494
## 0.3 3.4312485 11.39822 0.47087599 8.210771
## 0.3 5.2151926 11.59161 0.46702925 8.510895
## 0.3 7.9266290 11.93537 0.46126369 8.953082
## 0.3 12.0477713 12.42909 0.47268327 9.530473
## 0.4 0.2783191 12.41296 0.41759050 9.149863
## 0.4 0.4230203 12.04579 0.42868928 8.924948
## 0.4 0.6429533 11.66044 0.44605758 8.712287
## 0.4 0.9772319 11.38083 0.46350338 8.436131
## 0.4 1.4853056 11.41404 0.46170215 8.388107
## 0.4 2.2575324 11.41948 0.46615090 8.236604
## 0.4 3.4312485 11.50765 0.46743676 8.351968
## 0.4 5.2151926 11.80840 0.45927158 8.775791
## 0.4 7.9266290 12.18510 0.46946808 9.219501
## 0.4 12.0477713 13.02895 0.46415153 10.160891
## 0.5 0.2783191 12.28818 0.42062737 9.062548
## 0.5 0.4230203 11.89701 0.43387460 8.879482
## 0.5 0.6429533 11.50205 0.45556278 8.575951
## 0.5 0.9772319 11.38868 0.46255794 8.420177
## 0.5 1.4853056 11.46598 0.45948318 8.344318
## 0.5 2.2575324 11.43307 0.46830419 8.220860
## 0.5 3.4312485 11.65259 0.46154454 8.546417
## 0.5 5.2151926 11.96865 0.46015346 8.944097
## 0.5 7.9266290 12.50502 0.47355113 9.591350
## 0.5 12.0477713 13.72681 0.43199257 10.815593
## 0.6 0.2783191 12.16024 0.42374138 8.995089
## 0.6 0.4230203 11.74000 0.44181906 8.782382
## 0.6 0.6429533 11.41948 0.46143069 8.472898
## 0.6 0.9772319 11.46331 0.45739332 8.440642
## 0.6 1.4853056 11.45922 0.46183672 8.262820
## 0.6 2.2575324 11.49920 0.46608878 8.307652
## 0.6 3.4312485 11.79330 0.45489851 8.712912
## 0.6 5.2151926 12.12776 0.46567427 9.123451
## 0.6 7.9266290 12.93280 0.46324591 10.044428
## 0.6 12.0477713 14.28041 0.43765217 11.295744
## 0.7 0.2783191 12.04313 0.42733450 8.968411
## 0.7 0.4230203 11.60620 0.44984417 8.689380
## 0.7 0.6429533 11.41479 0.46114723 8.452001
## 0.7 0.9772319 11.51516 0.45447892 8.430018
## 0.7 1.4853056 11.44221 0.46543987 8.212444
## 0.7 2.2575324 11.59458 0.46145854 8.430835
## 0.7 3.4312485 11.91347 0.45157070 8.841445
## 0.7 5.2151926 12.31495 0.46999387 9.335206
## 0.7 7.9266290 13.42332 0.43942653 10.523087
## 0.7 12.0477713 14.90536 0.42740522 11.774879
## 0.8 0.2783191 11.93756 0.43106414 8.927062
## 0.8 0.4230203 11.49759 0.45670457 8.584301
## 0.8 0.6429533 11.43449 0.45903046 8.448151
## 0.8 0.9772319 11.50150 0.45641815 8.352344
## 0.8 1.4853056 11.44929 0.46676136 8.222501
## 0.8 2.2575324 11.70115 0.45489347 8.562004
## 0.8 3.4312485 12.00750 0.45444116 8.943730
## 0.8 5.2151926 12.56650 0.46672630 9.625228
## 0.8 7.9266290 13.84727 0.42612767 10.926880
## 0.8 12.0477713 15.19926 0.05738661 12.001256
## 0.9 0.2783191 11.83238 0.43662032 8.868193
## 0.9 0.4230203 11.45595 0.45968028 8.534099
## 0.9 0.6429533 11.48665 0.45506410 8.463426
## 0.9 0.9772319 11.49680 0.45802113 8.300999
## 0.9 1.4853056 11.50337 0.46424358 8.284490
## 0.9 2.2575324 11.80747 0.44755547 8.682974
## 0.9 3.4312485 12.10371 0.45817949 9.055609
## 0.9 5.2151926 12.86725 0.45694995 9.949599
## 0.9 7.9266290 14.20735 0.43546606 11.236527
## 0.9 12.0477713 15.20393 NaN 12.006556
## 1.0 0.2783191 11.76534 0.44012808 8.844369
## 1.0 0.4230203 11.47148 0.45798980 8.527689
## 1.0 0.6429533 11.58633 0.44817908 8.529920
## 1.0 0.9772319 11.51223 0.45812286 8.296667
## 1.0 1.4853056 11.58006 0.45951265 8.376757
## 1.0 2.2575324 11.89895 0.44254509 8.780974
## 1.0 3.4312485 12.21572 0.46097897 9.190557
## 1.0 5.2151926 13.19874 0.44028709 10.286147
## 1.0 7.9266290 14.61519 0.44110906 11.557657
## 1.0 12.0477713 15.20393 NaN 12.006556
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 3.431248.
svmModel_perm <- train(xTrain2, yTrain2,
method = "svmRadial",
tuneLength = 10,
trControl = ctrl)
print(svmModel_perm)
## Support Vector Machines with Radial Basis Function Kernel
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 120, 119, 119, 118, 120, 121, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 12.79518 0.4785723 8.560556
## 0.50 12.18757 0.4791612 8.305070
## 1.00 12.02112 0.4832495 8.337828
## 2.00 11.57092 0.4959173 8.302288
## 4.00 11.09229 0.5081174 8.097883
## 8.00 10.74225 0.5139127 7.792229
## 16.00 10.89714 0.5036871 7.887486
## 32.00 10.91031 0.5081135 7.863876
## 64.00 10.91897 0.5095263 7.857432
## 128.00 10.91897 0.5095263 7.857432
##
## Tuning parameter 'sigma' was held constant at a value of 0.002002188
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.002002188 and C = 8.
rfModel_perm <- train(xTrain2, yTrain2,
method = "rf",
tuneLength = 5,
trControl = ctrl)
print(rfModel_perm)
## Random Forest
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 120, 119, 119, 119, 120, 121, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 11.98238 0.4925930 9.121663
## 98 10.54305 0.5518491 7.614558
## 195 10.58876 0.5344975 7.632992
## 291 10.64806 0.5273525 7.652785
## 388 10.71126 0.5170188 7.725487
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 98.
summary(resamples(list(
PLS = plsModel_perm,
ENET = enetModel_perm,
SVM = svmModel_perm,
RF = rfModel_perm
)))
##
## Call:
## summary.resamples(object = resamples(list(PLS = plsModel_perm, ENET
## = enetModel_perm, SVM = svmModel_perm, RF = rfModel_perm)))
##
## Models: PLS, ENET, SVM, RF
## Number of resamples: 30
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 5.904825 7.489775 9.092406 8.843947 10.111226 12.69210 0
## ENET 5.933777 7.309821 8.176713 8.263376 9.192128 11.47369 0
## SVM 4.458559 6.398946 7.424103 7.792229 8.762227 13.20407 0
## RF 4.881802 6.028758 7.240696 7.614558 8.946272 13.00423 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 7.223958 9.860628 11.70589 11.36482 13.00202 16.23431 0
## ENET 7.665538 9.636457 11.05676 11.15798 11.98735 15.69767 0
## SVM 5.810461 8.970648 10.47891 10.74225 12.65723 17.31522 0
## RF 6.425133 8.566387 10.53833 10.54305 12.00286 16.48250 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.13500502 0.3831548 0.4992627 0.4914225 0.6214863 0.8474081 0
## ENET 0.11035463 0.3351115 0.4770298 0.4806997 0.6211014 0.8347710 0
## SVM 0.01510229 0.3307317 0.4849951 0.5139127 0.7164793 0.9323679 0
## RF 0.06455571 0.3866247 0.5934978 0.5518491 0.7288998 0.8613324 0
An elastic net regression model was also trained as an alternative modeling approach. The optimal tuning parameters selected through cross-validation were alpha = 0.1 and lambda = 3.4979.
Although the elastic net model performed reasonably well, it did not significantly outperform the PLS model. Other models such as support vector machines were also evaluated, but their predictive improvements were limited.
Overall, PLS remained one of the most appropriate models for this dataset because it effectively handles situations with many highly correlated predictors, which is a common characteristic of the permeability dataset.
No, I would not recommend replacing the permeability laboratory experiment with the predictive models. Although the models show moderate predictive ability, their errors are still relatively high and they only explain about half of the variability in permeability. The models could be useful as a screening tool to estimate permeability, but laboratory experiments are still needed for accurate measurements.
nnetModel_perm <- train(
xTrain2[,1:50], yTrain2,
method = "nnet",
preProcess = c("center","scale"),
tuneLength = 5,
trControl = ctrl,
linout = TRUE,
trace = FALSE,
MaxNWts = 10000
)
print(nnetModel_perm)
## Neural Network
##
## 133 samples
## 50 predictor
##
## Pre-processing: centered (50), scaled (50)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 119, 120, 120, 121, 119, 120, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0e+00 13.39633 0.3172666 9.614060
## 1 1e-04 13.53469 0.2703603 9.923776
## 1 1e-03 13.66557 0.2586329 9.736962
## 1 1e-02 14.05278 0.2675524 9.808760
## 1 1e-01 13.16533 0.3488161 8.839840
## 3 0e+00 12.55551 0.4077590 8.574861
## 3 1e-04 13.06205 0.3946204 8.916273
## 3 1e-03 12.46313 0.4217854 8.471527
## 3 1e-02 12.04062 0.4390640 8.244009
## 3 1e-01 11.50126 0.4786743 7.926274
## 5 0e+00 12.54714 0.4176760 8.222149
## 5 1e-04 11.73660 0.4504477 8.012825
## 5 1e-03 12.49175 0.4073186 8.325884
## 5 1e-02 11.75191 0.4666561 7.984996
## 5 1e-01 12.14790 0.4466491 8.252431
## 7 0e+00 12.63274 0.4312654 8.154874
## 7 1e-04 11.76814 0.4705318 7.793298
## 7 1e-03 12.10087 0.4540046 7.951134
## 7 1e-02 12.42192 0.4542463 8.008970
## 7 1e-01 12.05247 0.4609740 7.861290
## 9 0e+00 11.34506 0.5186476 7.409169
## 9 1e-04 12.07327 0.4655568 7.964807
## 9 1e-03 11.70273 0.4812974 7.668151
## 9 1e-02 12.63742 0.4433036 8.179783
## 9 1e-01 11.65088 0.4956935 7.651756
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 9 and decay = 0.
marsModel_perm <- train(xTrain2, yTrain2,
method = "earth",
tuneLength = 10,
trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(marsModel_perm)
## Multivariate Adaptive Regression Spline
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 121, 120, 119, 119, 118, 119, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 12.21259 0.4474004 8.974198
## 7 12.60517 0.4112809 9.196422
## 12 13.79806 0.3999115 10.041943
## 17 13.85971 0.3772740 10.342860
## 22 14.69439 0.3531494 10.889372
## 27 14.72844 0.3535936 10.859103
## 32 14.72844 0.3535936 10.859103
## 37 14.72844 0.3535936 10.859103
## 42 14.72844 0.3535936 10.859103
## 47 14.72844 0.3535936 10.859103
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 2 and degree = 1.
knnModel_perm <- train(xTrain2, yTrain2,
method = "knn",
tuneLength = 20,
trControl = ctrl)
print(knnModel_perm)
## k-Nearest Neighbors
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 119, 119, 119, 120, 118, 121, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 12.33879 0.4123702 8.696996
## 7 12.47985 0.4013714 8.713302
## 9 12.45061 0.4061786 8.745097
## 11 12.47094 0.4047003 8.791097
## 13 12.53071 0.3891443 8.918170
## 15 12.49320 0.3858117 9.074485
## 17 12.30274 0.3954831 8.936527
## 19 12.38332 0.3882364 9.103898
## 21 12.64086 0.3652323 9.332445
## 23 12.92798 0.3427873 9.554527
## 25 13.10410 0.3342477 9.750695
## 27 13.20072 0.3301541 9.938453
## 29 13.25099 0.3360636 10.043345
## 31 13.31140 0.3406705 10.084077
## 33 13.35559 0.3455751 10.109728
## 35 13.39580 0.3459777 10.132634
## 37 13.44403 0.3472873 10.149274
## 39 13.52954 0.3377203 10.191125
## 41 13.65597 0.3259436 10.273314
## 43 13.77959 0.3081548 10.365268
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
The Support Vector Machine (SVM) model produced the best overall resampling performance. It had the lowest RMSE (about 10.40) and the highest average 𝑅2 (around 0.57) compared with the other nonlinear models, indicating better predictive accuracy.
summary(resamples(list(
SVM = svmModel_perm,
RF = rfModel_perm,
NN = nnetModel_perm,
MARS = marsModel_perm,
KNN = knnModel_perm
)))
##
## Call:
## summary.resamples(object = resamples(list(SVM = svmModel_perm, RF
## = rfModel_perm, NN = nnetModel_perm, MARS = marsModel_perm, KNN
## = knnModel_perm)))
##
## Models: SVM, RF, NN, MARS, KNN
## Number of resamples: 30
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 4.458559 6.398946 7.424103 7.792229 8.762227 13.20407 0
## RF 4.881802 6.028758 7.240696 7.614558 8.946272 13.00423 0
## NN 3.765766 5.441152 7.089517 7.409169 9.033746 11.71748 0
## MARS 6.190312 8.031326 8.604318 8.974198 9.839319 13.47006 0
## KNN 6.607000 7.633063 8.787448 8.936527 9.815409 12.44485 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 5.810461 8.970648 10.47891 10.74225 12.65723 17.31522 0
## RF 6.425133 8.566387 10.53833 10.54305 12.00286 16.48250 0
## NN 5.291322 8.549497 10.20685 11.34506 14.00131 19.01679 0
## MARS 6.720286 10.634415 12.00540 12.21259 13.98973 18.12959 0
## KNN 8.082813 10.442524 12.48860 12.30274 14.33742 16.95742 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 1.510229e-02 0.3307317 0.4849951 0.5139127 0.7164793 0.9323679 0
## RF 6.455571e-02 0.3866247 0.5934978 0.5518491 0.7288998 0.8613324 0
## NN 7.402073e-04 0.3663694 0.5125473 0.5186476 0.7759943 0.8767135 0
## MARS 1.191082e-07 0.2502028 0.4199699 0.4474004 0.5785091 0.9393602 2
## KNN 4.924955e-06 0.2249503 0.3487428 0.3954831 0.5400027 0.9045544 0
bwplot(resamples(list(
SVM = svmModel_perm,
RF = rfModel_perm,
NN = nnetModel_perm,
MARS = marsModel_perm,
KNN = knnModel_perm
)))
pred_nn_perm <- predict(nnetModel_perm, xTest2)
pred_mars_perm <- predict(marsModel_perm, xTest2)
pred_knn_perm <- predict(knnModel_perm, xTest2)
postResample(pred_nn_perm, yTest2)
## RMSE Rsquared MAE
## 13.3902014 0.5487011 9.2164802
postResample(pred_mars_perm, yTest2)
## RMSE Rsquared MAE
## 11.371257 0.537835 8.512762
postResample(pred_knn_perm, yTest2)
## RMSE Rsquared MAE
## 11.8192992 0.4975314 8.3279898
Yes, some nonlinear models performed better than the optimal linear model. This suggests that the relationship between the predictors and permeability is likely nonlinear, and more flexible models such as SVM or random forests can capture these patterns better than linear models.
No, the models should not replace the permeability laboratory experiment. While some models provide reasonable predictions, their accuracy is not high enough to fully replace experimental measurements. They are better used as supporting tools to estimate permeability before conducting lab tests.