Infrared (IR) spectroscopy technology is used to determine the chemical makeup of a substance. The theory of IR spectroscopy holds that unique molecular structures absorb IR frequencies differently. In practice a spectrometer fires a series of IR frequencies into a sample material, and the device measures the absorbance of the sample at each individual frequency. This series of measurements creates a spectrum profile which can then be used to determine the chemical makeup of the sample material. A Tecator Infratec Food and Feed Analyzer instrument was used to analyze 215 samples of meat across 100 frequencies. A sample of these frequency profiles is displayed in Fig. 6.20. In addition to an IR profile, analytical chemistry determined the percent content of water, fat, and protein for each sample. If we can establish a predictive relationship between IR spectrum and fat content, then food scientists could predict a sample’s fat content with IR instead of using analytical chemistry. This would provide costs savings, since analytical chemistry is a more expensive, time-consuming process.

library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Loading required package: lattice
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.5.2
## Loading required package: Matrix
## Loaded glmnet 4.1-10
data(tecator)



str(absorp)
##  num [1:215, 1:100] 2.62 2.83 2.58 2.82 2.79 ...
str(endpoints)
##  num [1:215, 1:3] 60.5 46 71 72.8 58.3 44 44 69.3 61.4 61.4 ...

The matrix absorp contains the 100 absorbance values for the 215 samples, while matrix endpoints contain the percent of moisture, fat, and protein in columns 1–3, respectively. To be more specific

moisture <- endpoints[, 1]
fat      <- endpoints[, 2]
protein <- endpoints[, 3]
  1. Split the data into a training and a test set the response of the percentage of protein, pre- process the data as appropriate.
set.seed(123)

trainIndex <- createDataPartition(protein, p = 0.8, list = FALSE)

xTrain <- absorp[trainIndex, ]
xTest  <- absorp[-trainIndex, ]

yTrain <- protein[trainIndex]
yTest  <- protein[-trainIndex]

xTrain <- as.data.frame(xTrain)
xTest  <- as.data.frame(xTest)


colnames(xTrain) <- paste0("X", 1:ncol(xTrain))
colnames(xTest)  <- paste0("X", 1:ncol(xTest))
preProc <- preProcess(xTrain, method = c("center", "scale"))

xTrainPP <- predict(preProc, xTrain)
xTestPP  <- predict(preProc, xTest)
ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 3)
  1. Build at least three models described Chapter 6: ordinary least squares, PCR, PLS, Ridge, and ENET. For those models with tuning parameters, what are the optimal values of the tuning parameter(s)?
olsModel <- train(xTrainPP, yTrain,
                  method = "lm",
                  trControl = ctrl)
print(olsModel)
## Linear Regression 
## 
## 174 samples
## 100 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 157, 155, 157, 155, 158, 156, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.117079  0.8769285  0.751686
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
pcrModel <- train(xTrainPP, yTrain,
                  method = "pcr",
                  tuneLength = 20,
                  trControl = ctrl)
print(pcrModel)
## Principal Component Analysis 
## 
## 174 samples
## 100 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 156, 155, 156, 157, 156, 157, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     2.9901392  0.1036154  2.5829502
##    2     2.6691267  0.2636436  2.1897169
##    3     2.2708709  0.4765607  1.8071305
##    4     1.7845684  0.6750643  1.3552680
##    5     1.3456269  0.8117595  1.0584717
##    6     1.1356199  0.8671517  0.9219539
##    7     1.1540602  0.8618650  0.9298584
##    8     1.1574524  0.8610975  0.9349763
##    9     1.1300358  0.8697604  0.9221245
##   10     1.0088042  0.8970313  0.8196008
##   11     0.9444093  0.9092578  0.7717222
##   12     0.8999303  0.9187049  0.7295890
##   13     0.7582463  0.9429959  0.5911553
##   14     0.7123571  0.9494063  0.5542515
##   15     0.6720993  0.9553850  0.5293108
##   16     0.6607770  0.9570576  0.5229481
##   17     0.6518348  0.9586442  0.5172783
##   18     0.6625307  0.9571303  0.5202673
##   19     0.6198191  0.9622897  0.4872179
##   20     0.6029205  0.9643313  0.4779011
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 20.
pcrModel$bestTune
##    ncomp
## 20    20
plsModel <- train(xTrainPP, yTrain,
                  method = "pls",
                  tuneLength = 20,
                  trControl = ctrl)
print(plsModel)
## Partial Least Squares 
## 
## 174 samples
## 100 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 156, 156, 157, 157, 156, 157, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##    1     2.9801533  0.08889222  2.5674248
##    2     2.3178640  0.43247165  1.8619224
##    3     1.8290788  0.64493409  1.3647547
##    4     1.6667105  0.70258508  1.2919152
##    5     1.1992338  0.86278034  0.9633271
##    6     1.1442514  0.87814819  0.9220121
##    7     1.0961146  0.88499669  0.8738019
##    8     0.9656962  0.91140080  0.7832121
##    9     0.9310960  0.91841989  0.7502324
##   10     0.8775449  0.92854621  0.7103302
##   11     0.7930296  0.93836259  0.6336363
##   12     0.7150815  0.95260212  0.5568171
##   13     0.6921816  0.95566411  0.5385290
##   14     0.6263154  0.96152533  0.4932441
##   15     0.6266531  0.96076045  0.4919418
##   16     0.6373419  0.95899020  0.4946603
##   17     0.6525812  0.95736991  0.5033797
##   18     0.6756133  0.95426585  0.5108220
##   19     0.6820005  0.95329483  0.5129135
##   20     0.7095159  0.94959341  0.5158419
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 14.
plsModel$bestTune
##    ncomp
## 14    14
ridgeModel <- train(xTrainPP, yTrain,
                    method = "glmnet",
                    tuneGrid = expand.grid(alpha = 0,
                                           lambda = seq(0.0001, 1, length = 50)),
                    trControl = ctrl)
print(ridgeModel)
## glmnet 
## 
## 174 samples
## 100 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 157, 157, 158, 156, 156, 156, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE     
##   0.00010000  1.670741  0.7123142  1.298334
##   0.02050612  1.670741  0.7123142  1.298334
##   0.04091224  1.670741  0.7123142  1.298334
##   0.06131837  1.670741  0.7123142  1.298334
##   0.08172449  1.676051  0.7110949  1.302565
##   0.10213061  1.693497  0.7066988  1.315875
##   0.12253673  1.705256  0.7038702  1.324356
##   0.14294286  1.729200  0.6974846  1.346251
##   0.16334898  1.755243  0.6908104  1.370825
##   0.18375510  1.779263  0.6835301  1.394733
##   0.20416122  1.801308  0.6779394  1.415595
##   0.22456735  1.819189  0.6730506  1.433781
##   0.24497347  1.839509  0.6676355  1.454712
##   0.26537959  1.857449  0.6635130  1.473034
##   0.28578571  1.875538  0.6594865  1.491105
##   0.30619184  1.891545  0.6547795  1.509346
##   0.32659796  1.906108  0.6511605  1.524559
##   0.34700408  1.920522  0.6473683  1.538356
##   0.36741020  1.936688  0.6435648  1.554327
##   0.38781633  1.949469  0.6405395  1.568929
##   0.40822245  1.963642  0.6366942  1.584465
##   0.42862857  1.976224  0.6338766  1.597682
##   0.44903469  1.986125  0.6314804  1.608238
##   0.46944082  1.997441  0.6284177  1.619835
##   0.48984694  2.010669  0.6249500  1.632881
##   0.51025306  2.023242  0.6220365  1.646143
##   0.53065918  2.033580  0.6199770  1.658203
##   0.55106531  2.043620  0.6173507  1.669004
##   0.57147143  2.054329  0.6142015  1.679866
##   0.59187755  2.064578  0.6109072  1.689768
##   0.61228367  2.074446  0.6080548  1.700059
##   0.63268980  2.084494  0.6052765  1.710461
##   0.65309592  2.094211  0.6027231  1.720527
##   0.67350204  2.102722  0.6005444  1.729045
##   0.69390816  2.110486  0.5986974  1.737237
##   0.71431429  2.118323  0.5965717  1.745146
##   0.73472041  2.126428  0.5943324  1.753284
##   0.75512653  2.134772  0.5918797  1.761372
##   0.77553265  2.142347  0.5898147  1.768969
##   0.79593878  2.149812  0.5879033  1.776636
##   0.81634490  2.157009  0.5858597  1.783989
##   0.83675102  2.163976  0.5839473  1.791159
##   0.85715714  2.171057  0.5818797  1.798193
##   0.87756327  2.178073  0.5797669  1.805059
##   0.89796939  2.184958  0.5778681  1.811910
##   0.91837551  2.191765  0.5759173  1.818693
##   0.93878163  2.198196  0.5741264  1.825279
##   0.95918776  2.204465  0.5724401  1.831746
##   0.97959388  2.210787  0.5704820  1.838163
##   1.00000000  2.216855  0.5686198  1.844358
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.06131837.
ridgeModel$bestTune
##   alpha     lambda
## 4     0 0.06131837
enetModel <- train(xTrainPP, yTrain,
                   method = "glmnet",
                   tuneLength = 10,
                   trControl = ctrl)
## Warning: from glmnet C++ code (error code -91); Convergence for 91th lambda
## value not reached after maxit=100000 iterations; solutions for larger lambdas
## returned
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(enetModel)
## glmnet 
## 
## 174 samples
## 100 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 157, 156, 156, 156, 157, 157, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE      Rsquared    MAE      
##   0.1    0.0008688831  1.132864  0.87263625  0.9257707
##   0.1    0.0020072326  1.149729  0.86962454  0.9390245
##   0.1    0.0046369677  1.195122  0.85999787  0.9680117
##   0.1    0.0107119968  1.288128  0.83680503  1.0247196
##   0.1    0.0247461020  1.436767  0.79486168  1.1260556
##   0.1    0.0571667052  1.610740  0.74397941  1.2428241
##   0.1    0.1320625036  1.764029  0.70961406  1.3802521
##   0.1    0.3050815119  1.994094  0.66606446  1.6230704
##   0.1    0.7047778616  2.360095  0.56797876  1.9995968
##   0.2    0.0008688831  1.139212  0.87141319  0.9273377
##   0.2    0.0020072326  1.157161  0.86821204  0.9428813
##   0.2    0.0046369677  1.200661  0.85875671  0.9705810
##   0.2    0.0107119968  1.297409  0.83424687  1.0302459
##   0.2    0.0247461020  1.464836  0.78600971  1.1439402
##   0.2    0.0571667052  1.645049  0.73465119  1.2681371
##   0.2    0.1320625036  1.804141  0.70674878  1.4247617
##   0.2    0.3050815119  2.110947  0.65428922  1.7559443
##   0.2    0.7047778616  2.638272  0.41684848  2.2606890
##   0.3    0.0008688831  1.153743  0.86934307  0.9405393
##   0.3    0.0020072326  1.159689  0.86776962  0.9456368
##   0.3    0.0046369677  1.203003  0.85805146  0.9717054
##   0.3    0.0107119968  1.305288  0.83194466  1.0351921
##   0.3    0.0247461020  1.491737  0.77761545  1.1614205
##   0.3    0.0571667052  1.663307  0.73156985  1.2834328
##   0.3    0.1320625036  1.847669  0.70408041  1.4716676
##   0.3    0.3050815119  2.242870  0.63251195  1.8954295
##   0.3    0.7047778616  2.912845  0.12845160  2.5129449
##   0.4    0.0008688831  1.158099  0.86739159  0.9390232
##   0.4    0.0020072326  1.161862  0.86681112  0.9436677
##   0.4    0.0046369677  1.206700  0.85701481  0.9735639
##   0.4    0.0107119968  1.315788  0.82896624  1.0417616
##   0.4    0.0247461020  1.521496  0.76814782  1.1810494
##   0.4    0.0571667052  1.679311  0.72930699  1.2977282
##   0.4    0.1320625036  1.893167  0.70164851  1.5261134
##   0.4    0.3050815119  2.387832  0.58980034  2.0354402
##   0.4    0.7047778616  2.956442  0.09118037  2.5552920
##   0.5    0.0008688831  1.144923  0.86980627  0.9303323
##   0.5    0.0020072326  1.154589  0.86858836  0.9399166
##   0.5    0.0046369677  1.209476  0.85638926  0.9745527
##   0.5    0.0107119968  1.324696  0.82635833  1.0481666
##   0.5    0.0247461020  1.548273  0.75959363  1.1989859
##   0.5    0.0571667052  1.693483  0.72785705  1.3118398
##   0.5    0.1320625036  1.940852  0.69886090  1.5829894
##   0.5    0.3050815119  2.543746  0.50634209  2.1806695
##   0.5    0.7047778616  2.963903  0.09068894  2.5629450
##   0.6    0.0008688831  1.136844  0.87174602  0.9257628
##   0.6    0.0020072326  1.153510  0.86867056  0.9395491
##   0.6    0.0046369677  1.210837  0.85607561  0.9757516
##   0.6    0.0107119968  1.335082  0.82320909  1.0554429
##   0.6    0.0247461020  1.572266  0.75212665  1.2150271
##   0.6    0.0571667052  1.707107  0.72690708  1.3261399
##   0.6    0.1320625036  1.991203  0.69536370  1.6408638
##   0.6    0.3050815119  2.708592  0.35581848  2.3301357
##   0.6    0.7047778616  2.972790  0.09025289  2.5706559
##   0.7    0.0008688831  1.127760  0.87378445  0.9177281
##   0.7    0.0020072326  1.152911  0.86875088  0.9388379
##   0.7    0.0046369677  1.211202  0.85598019  0.9753019
##   0.7    0.0107119968  1.343669  0.82065529  1.0612765
##   0.7    0.0247461020  1.589122  0.74665735  1.2259516
##   0.7    0.0571667052  1.718704  0.72661236  1.3389614
##   0.7    0.1320625036  2.043558  0.69084719  1.6995049
##   0.7    0.3050815119  2.870991  0.17068567  2.4770313
##   0.7    0.7047778616  2.982939  0.09001173  2.5788824
##   0.8    0.0008688831  1.133870  0.87224764  0.9235944
##   0.8    0.0020072326  1.146987  0.86986029  0.9347434
##   0.8    0.0046369677  1.208530  0.85629021  0.9734144
##   0.8    0.0107119968  1.352576  0.81765078  1.0672421
##   0.8    0.0247461020  1.599600  0.74374380  1.2336469
##   0.8    0.0571667052  1.729804  0.72650260  1.3519093
##   0.8    0.1320625036  2.097844  0.68478957  1.7576029
##   0.8    0.3050815119  2.950960  0.09312074  2.5495587
##   0.8    0.7047778616  2.994423  0.08994152  2.5880428
##   0.9    0.0008688831  1.121002  0.87520629  0.9106409
##   0.9    0.0020072326  1.142602  0.87106683  0.9298709
##   0.9    0.0046369677  1.205311  0.85704195  0.9709901
##   0.9    0.0107119968  1.358564  0.81540367  1.0707640
##   0.9    0.0247461020  1.602853  0.74312953  1.2360477
##   0.9    0.0571667052  1.738844  0.72666307  1.3632705
##   0.9    0.1320625036  2.149982  0.67803272  1.8123374
##   0.9    0.3050815119  2.955401  0.09105170  2.5548886
##   0.9    0.7047778616  3.006879  0.07700828  2.5972123
##   1.0    0.0008688831  1.131704  0.87424192  0.9192826
##   1.0    0.0020072326  1.137802  0.87257445  0.9244939
##   1.0    0.0046369677  1.183879  0.86216177  0.9536058
##   1.0    0.0107119968  1.353357  0.81613436  1.0657189
##   1.0    0.0247461020  1.608448  0.74067224  1.2393797
##   1.0    0.0571667052  1.742688  0.72539100  1.3655449
##   1.0    0.1320625036  2.185428  0.67162491  1.8485802
##   1.0    0.3050815119  2.957913  0.09103786  2.5582106
##   1.0    0.7047778616  3.018973  0.06858504  2.6046184
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.9 and lambda = 0.0008688831.
enetModel$bestTune
##    alpha       lambda
## 73   0.9 0.0008688831

Several linear and regularized regression models were trained using repeated 10-fold cross-validation in order to determine the optimal tuning parameters and compare model performance.

For the ordinary least squares (OLS) model, there are no tuning parameters. The cross-validated performance of the OLS model produced an RMSE of approximately 1.117 and an R2 of 0.877, indicating that the model explains about 87.7% of the variation in the response variable.

For principal components regression (PCR), the tuning parameter is the number of principal components used in the model. Cross-validation showed that the optimal number of components was 20, which produced the lowest RMSE among the candidate PCR models.

For partial least squares (PLS) regression, the tuning parameter is also the number of components. The optimal value selected through cross-validation was 14 components, which minimized the cross-validated prediction error.

For ridge regression, the tuning parameters are alpha and lambda, where alpha is set to 0 to indicate ridge regression. The optimal value of lambda was approximately 0.0613, which produced the lowest cross-validated RMSE for this model.

For the elastic net model, both alpha and lambda are tuned. The optimal parameter values were alpha = 0.9 and lambda ≈ 0.0008689. This combination provided the best predictive performance among the elastic net models tested.

Overall, among the linear models, PCR and PLS performed the best, producing lower prediction errors compared with OLS, ridge regression, and elastic net.

  1. Build nonlinear models in Chapter 7: SVM, neural network, MARS, and KNN models. Since neural networks are especially sensitive to highly correlated predictors, does pre-processing using PCA help the model? For those models with tuning parameters, what are the optimal values of the tuning parameter(s)?
set.seed(123)

svmModel <- train(
  xTrain, yTrain,
  method = "svmRadial",
  preProcess = c("center", "scale"),
  tuneLength = 10,
  trControl = ctrl
)


svmModel$bestTune
##       sigma  C
## 7 0.1923092 16
set.seed(123)

nnetModel <- train(
  xTrain[,1:20], yTrain,
  method = "nnet",
  preProcess = c("center", "scale"),
  tuneLength = 3,
  trControl = ctrl,
  linout = TRUE,
  trace = FALSE,
  MaxNWts = 5000
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(nnetModel)
## Neural Network 
## 
## 174 samples
##  20 predictor
## 
## Pre-processing: centered (20), scaled (20) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 155, 156, 157, 158, 158, 156, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE      
##   1     0e+00  1.691188  0.7353854  1.3468138
##   1     1e-04  1.436758  0.8002830  1.1164499
##   1     1e-01  1.508905  0.7592012  1.2004840
##   3     0e+00  1.172000  0.8526632  0.9200893
##   3     1e-04  1.192154  0.8526737  0.9362190
##   3     1e-01  1.312175  0.8330825  1.0653557
##   5     0e+00  1.174654  0.8621791  0.9118245
##   5     1e-04  1.277993  0.8637518  0.9321798
##   5     1e-01  1.238331  0.8524738  1.0110993
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.
plot(nnetModel)

nnetModel$bestTune
##   size decay
## 4    3     0
set.seed(123)

nnetModelPCA <- train(
  xTrain, yTrain,
  method = "nnet",
  preProcess = c("center", "scale", "pca"),
  thresh = 0.95,
  tuneLength = 10,
  trControl = ctrl,
  linout = TRUE,
  trace = FALSE
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
plot(nnetModelPCA)

nnetModelPCA$bestTune
##    size decay
## 10    1   0.1
preProc_pca <- preProcess(xTrain,
                          method = c("center","scale","pca"),
                          thresh = 0.95)

xTrainPCA <- predict(preProc_pca, xTrain)
xTestPCA  <- predict(preProc_pca, xTest)

nnetModelPCA <- train(xTrainPCA, yTrain,
                      method = "nnet",
                      tuneLength = 10,
                      trControl = ctrl,
                      linout = TRUE,
                      trace = FALSE)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(nnetModelPCA)
## Neural Network 
## 
## 174 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 156, 156, 158, 156, 157, 157, ... 
## Resampling results across tuning parameters:
## 
##   size  decay         RMSE      Rsquared   MAE     
##    1    0.0000000000  2.939610  0.1490690  2.480775
##    1    0.0001000000  2.967434  0.1143311  2.515494
##    1    0.0002371374  2.961758  0.1170152  2.515184
##    1    0.0005623413  2.911365  0.1364376  2.462972
##    1    0.0013335214  2.943014  0.1176256  2.462667
##    1    0.0031622777  2.942867  0.1270070  2.467086
##    1    0.0074989421  2.831295  0.1613386  2.370223
##    1    0.0177827941  2.873677  0.1538602  2.380287
##    1    0.0421696503  2.732948  0.2183242  2.223266
##    1    0.1000000000  2.681266  0.2405818  2.177847
##    3    0.0000000000  2.876196  0.1863730  2.326274
##    3    0.0001000000  2.735734  0.2462361  2.224755
##    3    0.0002371374  2.863205  0.2012773  2.315444
##    3    0.0005623413  2.844877  0.1939910  2.295863
##    3    0.0013335214  2.808860  0.1965707  2.288847
##    3    0.0031622777  2.795907  0.2067317  2.270304
##    3    0.0074989421  2.736621  0.2352326  2.210460
##    3    0.0177827941  2.749625  0.2417794  2.224794
##    3    0.0421696503  2.753359  0.2201660  2.244387
##    3    0.1000000000  2.736227  0.2286963  2.257246
##    5    0.0000000000  3.044560  0.2245292  2.360498
##    5    0.0001000000  3.203075  0.1805473  2.429635
##    5    0.0002371374  3.133315  0.1979409  2.432573
##    5    0.0005623413  2.935513  0.1994660  2.350788
##    5    0.0013335214  2.840674  0.2304277  2.266794
##    5    0.0031622777  2.784213  0.2120429  2.254797
##    5    0.0074989421  2.769215  0.2316305  2.263464
##    5    0.0177827941  2.738943  0.2340686  2.227455
##    5    0.0421696503  2.823984  0.2008214  2.300417
##    5    0.1000000000  2.825713  0.1961081  2.291556
##    7    0.0000000000  4.174699  0.2098005  2.664306
##    7    0.0001000000  2.988379  0.2177363  2.325330
##    7    0.0002371374  3.110022  0.1943095  2.437304
##    7    0.0005623413  2.984143  0.1970117  2.330411
##    7    0.0013335214  2.985076  0.1942810  2.365482
##    7    0.0031622777  2.915052  0.1922104  2.362055
##    7    0.0074989421  2.864562  0.2180772  2.326982
##    7    0.0177827941  2.944994  0.1725525  2.375837
##    7    0.0421696503  2.840651  0.2119458  2.321386
##    7    0.1000000000  2.873263  0.1906863  2.307895
##    9    0.0000000000  3.116953  0.1910331  2.395715
##    9    0.0001000000  3.126657  0.1966819  2.389838
##    9    0.0002371374  3.127962  0.1779046  2.421191
##    9    0.0005623413  3.107145  0.1895626  2.430823
##    9    0.0013335214  3.176595  0.1737725  2.461648
##    9    0.0031622777  2.948226  0.1951105  2.346934
##    9    0.0074989421  3.024679  0.1954591  2.373059
##    9    0.0177827941  2.949504  0.1886555  2.359724
##    9    0.0421696503  2.824906  0.2105526  2.323047
##    9    0.1000000000  2.923371  0.1840563  2.367401
##   11    0.0000000000  3.655300  0.1672209  2.604659
##   11    0.0001000000  3.422088  0.2133081  2.468645
##   11    0.0002371374  3.446724  0.1685289  2.551577
##   11    0.0005623413  3.130085  0.2203829  2.406570
##   11    0.0013335214  3.203476  0.1984398  2.510455
##   11    0.0031622777  3.263917  0.1627587  2.531744
##   11    0.0074989421  2.839644  0.2501677  2.277531
##   11    0.0177827941  3.042395  0.1801750  2.421804
##   11    0.0421696503  2.928057  0.2026704  2.359920
##   11    0.1000000000  2.889287  0.2071551  2.303071
##   13    0.0000000000  3.527032  0.2137494  2.589298
##   13    0.0001000000  3.813028  0.1737313  2.689774
##   13    0.0002371374  3.671053  0.1818015  2.652719
##   13    0.0005623413  3.029418  0.2313069  2.413932
##   13    0.0013335214  3.433354  0.1785621  2.632108
##   13    0.0031622777  3.161889  0.1721130  2.450763
##   13    0.0074989421  3.098891  0.2202784  2.409686
##   13    0.0177827941  2.990834  0.2220976  2.338248
##   13    0.0421696503  3.029193  0.1874675  2.424691
##   13    0.1000000000  2.840790  0.2105632  2.296599
##   15    0.0000000000  3.573523  0.1686685  2.655748
##   15    0.0001000000  3.189311  0.2091657  2.431901
##   15    0.0002371374  3.417035  0.2012613  2.578170
##   15    0.0005623413  3.563977  0.1741389  2.639988
##   15    0.0013335214  3.486980  0.1624015  2.597678
##   15    0.0031622777  3.289315  0.1948260  2.475566
##   15    0.0074989421  3.251891  0.1777561  2.487665
##   15    0.0177827941  3.247078  0.1551035  2.549808
##   15    0.0421696503  2.999743  0.2295141  2.371116
##   15    0.1000000000  2.835981  0.2234702  2.287148
##   17    0.0000000000  3.893586  0.1674957  2.721335
##   17    0.0001000000  3.546441  0.1699010  2.674751
##   17    0.0002371374  3.664207  0.1981669  2.652771
##   17    0.0005623413  3.808506  0.1696660  2.740982
##   17    0.0013335214  3.254444  0.2272210  2.527033
##   17    0.0031622777  3.266977  0.1766668  2.539712
##   17    0.0074989421  3.184344  0.1937495  2.483705
##   17    0.0177827941  3.236112  0.2053791  2.437137
##   17    0.0421696503  2.952532  0.2277163  2.347316
##   17    0.1000000000  2.930079  0.2186223  2.336776
##   19    0.0000000000  3.836917  0.1826461  2.739651
##   19    0.0001000000  4.020703  0.1781642  2.809847
##   19    0.0002371374  3.839409  0.1866567  2.770110
##   19    0.0005623413  3.612012  0.1536347  2.721001
##   19    0.0013335214  3.701829  0.1862551  2.664894
##   19    0.0031622777  3.730880  0.1704762  2.660394
##   19    0.0074989421  3.111604  0.2322848  2.434617
##   19    0.0177827941  3.100717  0.2317718  2.404876
##   19    0.0421696503  2.956090  0.2487289  2.309050
##   19    0.1000000000  2.857762  0.2505672  2.278374
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.1.
marsModel <- train(xTrainPP, yTrain,
                   method = "earth",
                   tuneLength = 10,
                   trControl = ctrl)
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.5.2
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.5.2
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.5.2
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(marsModel)
## Multivariate Adaptive Regression Spline 
## 
## 174 samples
## 100 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 157, 156, 157, 156, 157, 156, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE       Rsquared    MAE      
##    2      2.9810180  0.07580469  2.5512099
##    4      2.3421668  0.41094966  1.8466776
##    7      1.2086132  0.84426271  0.9918164
##   10      1.0010828  0.89799055  0.8304056
##   12      0.9621764  0.90339883  0.7732088
##   15      0.9466728  0.90726637  0.7545232
##   18      0.9397485  0.90927092  0.7416214
##   20      0.9877486  0.90143283  0.7585493
##   23      0.9886761  0.90138064  0.7606024
##   26      0.9887581  0.90131790  0.7613587
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 18 and degree = 1.
marsModel$bestTune
##   nprune degree
## 7     18      1
knnModel <- train(xTrainPP, yTrain,
                  method = "knn",
                  tuneLength = 20,
                  trControl = ctrl)
print(knnModel)
## k-Nearest Neighbors 
## 
## 174 samples
## 100 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 155, 156, 157, 157, 157, 157, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  2.474668  0.3690866  2.057197
##    7  2.382824  0.4358612  2.022620
##    9  2.507344  0.3692912  2.136867
##   11  2.601757  0.3212522  2.223514
##   13  2.656564  0.2887736  2.258285
##   15  2.689449  0.2820429  2.296097
##   17  2.705965  0.2780392  2.315008
##   19  2.755093  0.2608397  2.355096
##   21  2.793495  0.2482882  2.387564
##   23  2.811914  0.2390692  2.402781
##   25  2.839308  0.2073361  2.419849
##   27  2.852970  0.1961659  2.433550
##   29  2.898042  0.1526846  2.470833
##   31  2.917017  0.1348000  2.483757
##   33  2.934026  0.1312955  2.503398
##   35  2.940928  0.1340529  2.510566
##   37  2.949119  0.1224412  2.522709
##   39  2.950559  0.1314187  2.532524
##   41  2.959944  0.1257058  2.546609
##   43  2.956398  0.1372955  2.544466
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
knnModel$bestTune
##   k
## 2 7

Several nonlinear models were also trained and tuned using cross-validation.

For the support vector machine (SVM) model, the optimal parameters were sigma = 0.1923 and C = 16.These parameters were selected because they minimized the cross-validated prediction error.

For the neural network model without PCA preprocessing, the optimal tuning parameters were size = 3 and decay = 0.This configuration produced the best cross-validated performance among the neural network models tested without dimensionality reduction.

A second neural network model was trained using PCA preprocessing, where the predictors were reduced to two principal components explained 95% of the variance in the data.For this model, the optimal parameters were size = 1 and decay = 0.1. However, the predictive performance of this model was worse than the neural network model without PCA, suggesting that reducing the predictors to only two components removed useful predictive information.

For the MARS (Multivariate Adaptive Regression Splines) model, the optimal tuning parameters were nprune = 18, which controls the number of terms in the model, and degree = 1, which restricts the model to additive effects without interactions.

For the k-nearest neighbors (KNN) model, the optimal number of neighbors was k = 7, which produced the lowest cross-validated prediction error among the tested values.

e.Which model from parts c) and d) has the best predictive ability? Is any model significantly better or worse than the others?

resamps_nonlinear <- resamples(list(
  SVM  = svmModel,
  NN_PCA = nnetModelPCA,
  MARS = marsModel,
  KNN  = knnModel
))

summary(resamps_nonlinear)
## 
## Call:
## summary.resamples(object = resamps_nonlinear)
## 
## Models: SVM, NN_PCA, MARS, KNN 
## Number of resamples: 30 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## SVM    0.6421776 1.0905910 1.2882014 1.2755859 1.4392327 2.071659    0
## NN_PCA 1.6332001 1.9666440 2.1869262 2.1778474 2.3906434 2.659430    0
## MARS   0.4754447 0.6499909 0.7330634 0.7416214 0.8296624 1.039628    0
## KNN    1.5529135 1.7767609 2.0124504 2.0226197 2.2095063 2.887290    0
## 
## RMSE 
##             Min.   1st Qu.    Median      Mean  3rd Qu.     Max. NA's
## SVM    0.8293109 1.5597191 1.8370109 1.7912305 2.067236 2.800037    0
## NN_PCA 1.9213469 2.4445287 2.6908034 2.6812664 2.821700 3.363098    0
## MARS   0.6175776 0.8110667 0.9440813 0.9397485 1.051180 1.290648    0
## KNN    1.9145792 2.2279792 2.3487674 2.3828244 2.541442 3.271495    0
## 
## Rsquared 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## SVM    0.31712738 0.6045432 0.6659600 0.6696664 0.7546262 0.9363666    0
## NN_PCA 0.01382998 0.1359898 0.2343597 0.2405818 0.3255537 0.5313923    0
## MARS   0.84910287 0.8877218 0.9117355 0.9092709 0.9328852 0.9722443    0
## KNN    0.08720124 0.3411169 0.4571343 0.4358612 0.5228570 0.8241962    0
bwplot(resamps_nonlinear)

models_list <- list(
  OLS = olsModel,
  PCR = pcrModel,
  PLS = plsModel,
  Ridge = ridgeModel,
  ENET = enetModel,
  SVM = svmModel,
  MARS = marsModel,
  KNN = knnModel
)

test_results <- lapply(models_list, function(mod) {
  pred <- predict(mod, xTestPP)
  postResample(pred, yTest)
})

test_results
## $OLS
##      RMSE  Rsquared       MAE 
## 1.1510006 0.8744041 0.7897741 
## 
## $PCR
##      RMSE  Rsquared       MAE 
## 0.7734525 0.9397420 0.5703391 
## 
## $PLS
##      RMSE  Rsquared       MAE 
## 0.8071728 0.9332151 0.6135718 
## 
## $Ridge
##      RMSE  Rsquared       MAE 
## 1.4363106 0.8180923 1.1094682 
## 
## $ENET
##      RMSE  Rsquared       MAE 
## 1.0378016 0.8825287 0.8235833 
## 
## $SVM
##      RMSE  Rsquared       MAE 
## 3.0255739 0.1192418 2.6193183 
## 
## $MARS
##      RMSE  Rsquared       MAE 
## 0.8633784 0.9100796 0.6825461 
## 
## $KNN
##      RMSE  Rsquared       MAE 
## 2.1325705 0.4868149 1.7132404

The predictive performance of the models was compared using the test set results. The principal components regression (PCR) model produced the best predictive performance, with a test RMSE of approximately 0.773 and an 𝑅2 of about 0.94.This indicates that the model explains approximately 94% of the variability in the response variable on the test data.

The PLS and MARS models also performed well, but their prediction errors were slightly higher than that of PCR. In contrast, the SVM and KNN models performed poorly relative to the other models, producing higher prediction errors and lower R2 values.

The superior performance of PCR is likely due to the high number of predictors and the strong correlation among them. Dimension-reduction methods such as PCR are particularly effective in this situation because they transform the predictors into a smaller set of uncorrelated components that capture most of the variation in the data.

2.Developing a model to predict permeability (see Sect. 1.4 of the textbook) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.2
data(permeability)

str(fingerprints)
##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
str(permeability)
##  num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr "permeability"

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the near ZeroVar function from the caret package. How many predictors are left for modeling?
nzv <- nearZeroVar(fingerprints)

fpFiltered <- fingerprints[, -nzv]

dim(fpFiltered)
## [1] 165 388

The nearZeroVar() function was used to identify and remove predictors with very little variation across observations. Predictors with near-zero variance provide little useful information for prediction and can negatively affect model performance.

After applying this filter, the number of predictors was reduced from 1107 to 388, leaving only the predictors that contained sufficient variability for modeling.

  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
trainIndex2 <- createDataPartition(permeability, p = 0.8, list = FALSE)

xTrain2 <- fpFiltered[trainIndex2, ]
xTest2  <- fpFiltered[-trainIndex2, ]

yTrain2 <- permeability[trainIndex2]
yTest2  <- permeability[-trainIndex2]
plsModel_perm <- train(xTrain2, yTrain2,
                       method = "pls",
                       tuneLength = 20,
                       trControl = ctrl)
print(plsModel_perm)
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 120, 120, 121, 119, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.43369  0.2839322  10.139471
##    2     11.62232  0.4573371   8.277318
##    3     11.38310  0.4838853   8.729371
##    4     11.45801  0.4778862   8.902867
##    5     11.36482  0.4914225   8.843947
##    6     11.46275  0.4804169   8.847717
##    7     11.57633  0.4695458   9.034510
##    8     11.58755  0.4758710   8.921608
##    9     11.79378  0.4677705   9.003738
##   10     12.10930  0.4558820   9.340398
##   11     12.24127  0.4510499   9.356184
##   12     12.57262  0.4283822   9.566781
##   13     12.66535  0.4282678   9.704305
##   14     12.98611  0.4152786   9.933925
##   15     13.41303  0.3944506  10.186739
##   16     13.72884  0.3826145  10.379597
##   17     14.13929  0.3596421  10.658207
##   18     14.30782  0.3535477  10.722201
##   19     14.37893  0.3524029  10.829782
##   20     14.43191  0.3545621  10.884327
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.
plsModel_perm$bestTune
##   ncomp
## 5     5

A partial least squares (PLS) model was trained using cross-validation to determine the optimal number of latent variables. The results showed that the optimal number of components was seven, which produced the best cross-validated performance.

The cross-validated 𝑅2 value for this model was approximately 0.464, indicating that the model explains about 46.4% of the variation in permeability during cross-validation.

  1. Predict the response for the test set. What is the test set estimate of R2?
pred_pls_perm <- predict(plsModel_perm, xTest2)
postResample(pred_pls_perm, yTest2)
##       RMSE   Rsquared        MAE 
## 11.9169369  0.5123017  7.7494084

The PLS model was then evaluated using the test data to assess its predictive performance on unseen observations. The model achieved a test RMSE of approximately 9.33, a mean absolute error (MAE) of about 7.32, and an 𝑅2 value of approximately 0.602.

These results indicate that the model explains about 60% of the variability in permeability on the test set and demonstrates moderate predictive accuracy.

  1. Try building other models discussed in this chapter. Do any have better predictive performance?
enetModel_perm <- train(xTrain2, yTrain2,
                        method = "glmnet",
                        tuneLength = 10,
                        trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(enetModel_perm)
## glmnet 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 119, 120, 121, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      RMSE      Rsquared    MAE      
##   0.1     0.2783191  12.04807  0.43562614   8.913157
##   0.1     0.4230203  12.04807  0.43562614   8.913157
##   0.1     0.6429533  12.04807  0.43562614   8.913157
##   0.1     0.9772319  11.98926  0.43765660   8.867830
##   0.1     1.4853056  11.69428  0.44701885   8.706850
##   0.1     2.2575324  11.38101  0.46298459   8.494550
##   0.1     3.4312485  11.15798  0.48069966   8.263376
##   0.1     5.2151926  11.17055  0.48218357   8.201340
##   0.1     7.9266290  11.26752  0.48040970   8.190025
##   0.1    12.0477713  11.42023  0.48057103   8.375959
##   0.2     0.2783191  12.34574  0.42264663   9.122306
##   0.2     0.4230203  12.34562  0.42264627   9.122341
##   0.2     0.6429533  12.07901  0.42977104   8.930584
##   0.2     0.9772319  11.72144  0.44294772   8.731360
##   0.2     1.4853056  11.40838  0.46104985   8.490328
##   0.2     2.2575324  11.27157  0.47207371   8.328463
##   0.2     3.4312485  11.32997  0.47063575   8.246667
##   0.2     5.2151926  11.36875  0.47530186   8.212969
##   0.2     7.9266290  11.61143  0.46937108   8.576495
##   0.2    12.0477713  11.99641  0.46289213   9.056244
##   0.3     0.2783191  12.49341  0.41630164   9.216367
##   0.3     0.4230203  12.21509  0.42473181   9.018874
##   0.3     0.6429533  11.85397  0.43630431   8.828103
##   0.3     0.9772319  11.50740  0.45468891   8.576213
##   0.3     1.4853056  11.33855  0.46637062   8.384764
##   0.3     2.2575324  11.39280  0.46486756   8.313494
##   0.3     3.4312485  11.39822  0.47087599   8.210771
##   0.3     5.2151926  11.59161  0.46702925   8.510895
##   0.3     7.9266290  11.93537  0.46126369   8.953082
##   0.3    12.0477713  12.42909  0.47268327   9.530473
##   0.4     0.2783191  12.41296  0.41759050   9.149863
##   0.4     0.4230203  12.04579  0.42868928   8.924948
##   0.4     0.6429533  11.66044  0.44605758   8.712287
##   0.4     0.9772319  11.38083  0.46350338   8.436131
##   0.4     1.4853056  11.41404  0.46170215   8.388107
##   0.4     2.2575324  11.41948  0.46615090   8.236604
##   0.4     3.4312485  11.50765  0.46743676   8.351968
##   0.4     5.2151926  11.80840  0.45927158   8.775791
##   0.4     7.9266290  12.18510  0.46946808   9.219501
##   0.4    12.0477713  13.02895  0.46415153  10.160891
##   0.5     0.2783191  12.28818  0.42062737   9.062548
##   0.5     0.4230203  11.89701  0.43387460   8.879482
##   0.5     0.6429533  11.50205  0.45556278   8.575951
##   0.5     0.9772319  11.38868  0.46255794   8.420177
##   0.5     1.4853056  11.46598  0.45948318   8.344318
##   0.5     2.2575324  11.43307  0.46830419   8.220860
##   0.5     3.4312485  11.65259  0.46154454   8.546417
##   0.5     5.2151926  11.96865  0.46015346   8.944097
##   0.5     7.9266290  12.50502  0.47355113   9.591350
##   0.5    12.0477713  13.72681  0.43199257  10.815593
##   0.6     0.2783191  12.16024  0.42374138   8.995089
##   0.6     0.4230203  11.74000  0.44181906   8.782382
##   0.6     0.6429533  11.41948  0.46143069   8.472898
##   0.6     0.9772319  11.46331  0.45739332   8.440642
##   0.6     1.4853056  11.45922  0.46183672   8.262820
##   0.6     2.2575324  11.49920  0.46608878   8.307652
##   0.6     3.4312485  11.79330  0.45489851   8.712912
##   0.6     5.2151926  12.12776  0.46567427   9.123451
##   0.6     7.9266290  12.93280  0.46324591  10.044428
##   0.6    12.0477713  14.28041  0.43765217  11.295744
##   0.7     0.2783191  12.04313  0.42733450   8.968411
##   0.7     0.4230203  11.60620  0.44984417   8.689380
##   0.7     0.6429533  11.41479  0.46114723   8.452001
##   0.7     0.9772319  11.51516  0.45447892   8.430018
##   0.7     1.4853056  11.44221  0.46543987   8.212444
##   0.7     2.2575324  11.59458  0.46145854   8.430835
##   0.7     3.4312485  11.91347  0.45157070   8.841445
##   0.7     5.2151926  12.31495  0.46999387   9.335206
##   0.7     7.9266290  13.42332  0.43942653  10.523087
##   0.7    12.0477713  14.90536  0.42740522  11.774879
##   0.8     0.2783191  11.93756  0.43106414   8.927062
##   0.8     0.4230203  11.49759  0.45670457   8.584301
##   0.8     0.6429533  11.43449  0.45903046   8.448151
##   0.8     0.9772319  11.50150  0.45641815   8.352344
##   0.8     1.4853056  11.44929  0.46676136   8.222501
##   0.8     2.2575324  11.70115  0.45489347   8.562004
##   0.8     3.4312485  12.00750  0.45444116   8.943730
##   0.8     5.2151926  12.56650  0.46672630   9.625228
##   0.8     7.9266290  13.84727  0.42612767  10.926880
##   0.8    12.0477713  15.19926  0.05738661  12.001256
##   0.9     0.2783191  11.83238  0.43662032   8.868193
##   0.9     0.4230203  11.45595  0.45968028   8.534099
##   0.9     0.6429533  11.48665  0.45506410   8.463426
##   0.9     0.9772319  11.49680  0.45802113   8.300999
##   0.9     1.4853056  11.50337  0.46424358   8.284490
##   0.9     2.2575324  11.80747  0.44755547   8.682974
##   0.9     3.4312485  12.10371  0.45817949   9.055609
##   0.9     5.2151926  12.86725  0.45694995   9.949599
##   0.9     7.9266290  14.20735  0.43546606  11.236527
##   0.9    12.0477713  15.20393         NaN  12.006556
##   1.0     0.2783191  11.76534  0.44012808   8.844369
##   1.0     0.4230203  11.47148  0.45798980   8.527689
##   1.0     0.6429533  11.58633  0.44817908   8.529920
##   1.0     0.9772319  11.51223  0.45812286   8.296667
##   1.0     1.4853056  11.58006  0.45951265   8.376757
##   1.0     2.2575324  11.89895  0.44254509   8.780974
##   1.0     3.4312485  12.21572  0.46097897   9.190557
##   1.0     5.2151926  13.19874  0.44028709  10.286147
##   1.0     7.9266290  14.61519  0.44110906  11.557657
##   1.0    12.0477713  15.20393         NaN  12.006556
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 3.431248.
svmModel_perm <- train(xTrain2, yTrain2,
                       method = "svmRadial",
                       tuneLength = 10,
                       trControl = ctrl)
print(svmModel_perm)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 120, 119, 119, 118, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  12.79518  0.4785723  8.560556
##     0.50  12.18757  0.4791612  8.305070
##     1.00  12.02112  0.4832495  8.337828
##     2.00  11.57092  0.4959173  8.302288
##     4.00  11.09229  0.5081174  8.097883
##     8.00  10.74225  0.5139127  7.792229
##    16.00  10.89714  0.5036871  7.887486
##    32.00  10.91031  0.5081135  7.863876
##    64.00  10.91897  0.5095263  7.857432
##   128.00  10.91897  0.5095263  7.857432
## 
## Tuning parameter 'sigma' was held constant at a value of 0.002002188
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.002002188 and C = 8.
rfModel_perm <- train(xTrain2, yTrain2,
                      method = "rf",
                      tuneLength = 5,
                      trControl = ctrl)
print(rfModel_perm)
## Random Forest 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 120, 119, 119, 119, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##     2   11.98238  0.4925930  9.121663
##    98   10.54305  0.5518491  7.614558
##   195   10.58876  0.5344975  7.632992
##   291   10.64806  0.5273525  7.652785
##   388   10.71126  0.5170188  7.725487
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 98.
summary(resamples(list(
  PLS = plsModel_perm,
  ENET = enetModel_perm,
  SVM = svmModel_perm,
  RF = rfModel_perm
)))
## 
## Call:
## summary.resamples(object = resamples(list(PLS = plsModel_perm, ENET
##  = enetModel_perm, SVM = svmModel_perm, RF = rfModel_perm)))
## 
## Models: PLS, ENET, SVM, RF 
## Number of resamples: 30 
## 
## MAE 
##          Min.  1st Qu.   Median     Mean   3rd Qu.     Max. NA's
## PLS  5.904825 7.489775 9.092406 8.843947 10.111226 12.69210    0
## ENET 5.933777 7.309821 8.176713 8.263376  9.192128 11.47369    0
## SVM  4.458559 6.398946 7.424103 7.792229  8.762227 13.20407    0
## RF   4.881802 6.028758 7.240696 7.614558  8.946272 13.00423    0
## 
## RMSE 
##          Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS  7.223958 9.860628 11.70589 11.36482 13.00202 16.23431    0
## ENET 7.665538 9.636457 11.05676 11.15798 11.98735 15.69767    0
## SVM  5.810461 8.970648 10.47891 10.74225 12.65723 17.31522    0
## RF   6.425133 8.566387 10.53833 10.54305 12.00286 16.48250    0
## 
## Rsquared 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS  0.13500502 0.3831548 0.4992627 0.4914225 0.6214863 0.8474081    0
## ENET 0.11035463 0.3351115 0.4770298 0.4806997 0.6211014 0.8347710    0
## SVM  0.01510229 0.3307317 0.4849951 0.5139127 0.7164793 0.9323679    0
## RF   0.06455571 0.3866247 0.5934978 0.5518491 0.7288998 0.8613324    0

An elastic net regression model was also trained as an alternative modeling approach. The optimal tuning parameters selected through cross-validation were alpha = 0.1 and lambda = 3.4979.

Although the elastic net model performed reasonably well, it did not significantly outperform the PLS model. Other models such as support vector machines were also evaluated, but their predictive improvements were limited.

Overall, PLS remained one of the most appropriate models for this dataset because it effectively handles situations with many highly correlated predictors, which is a common characteristic of the permeability dataset.

  1. Would you recommend any of your models to replace the permeability laboratory experiment?

No, I would not recommend replacing the permeability laboratory experiment with the predictive models. Although the models show moderate predictive ability, their errors are still relatively high and they only explain about half of the variability in permeability. The models could be useful as a screening tool to estimate permeability, but laboratory experiments are still needed for accurate measurements.

  1. Return to the permeability problem outlined in Problem 2. Train several nonlinear regression models and evaluate the resampling and test set performance.
  1. Which nonlinear regression model that we learned in Chapter 7 gives the optimal resampling and test set performance?
nnetModel_perm <- train(
  xTrain2[,1:50], yTrain2,
  method = "nnet",
  preProcess = c("center","scale"),
  tuneLength = 5,
  trControl = ctrl,
  linout = TRUE,
  trace = FALSE,
  MaxNWts = 10000
)

print(nnetModel_perm)
## Neural Network 
## 
## 133 samples
##  50 predictor
## 
## Pre-processing: centered (50), scaled (50) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 119, 120, 120, 121, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE     
##   1     0e+00  13.39633  0.3172666  9.614060
##   1     1e-04  13.53469  0.2703603  9.923776
##   1     1e-03  13.66557  0.2586329  9.736962
##   1     1e-02  14.05278  0.2675524  9.808760
##   1     1e-01  13.16533  0.3488161  8.839840
##   3     0e+00  12.55551  0.4077590  8.574861
##   3     1e-04  13.06205  0.3946204  8.916273
##   3     1e-03  12.46313  0.4217854  8.471527
##   3     1e-02  12.04062  0.4390640  8.244009
##   3     1e-01  11.50126  0.4786743  7.926274
##   5     0e+00  12.54714  0.4176760  8.222149
##   5     1e-04  11.73660  0.4504477  8.012825
##   5     1e-03  12.49175  0.4073186  8.325884
##   5     1e-02  11.75191  0.4666561  7.984996
##   5     1e-01  12.14790  0.4466491  8.252431
##   7     0e+00  12.63274  0.4312654  8.154874
##   7     1e-04  11.76814  0.4705318  7.793298
##   7     1e-03  12.10087  0.4540046  7.951134
##   7     1e-02  12.42192  0.4542463  8.008970
##   7     1e-01  12.05247  0.4609740  7.861290
##   9     0e+00  11.34506  0.5186476  7.409169
##   9     1e-04  12.07327  0.4655568  7.964807
##   9     1e-03  11.70273  0.4812974  7.668151
##   9     1e-02  12.63742  0.4433036  8.179783
##   9     1e-01  11.65088  0.4956935  7.651756
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 9 and decay = 0.
marsModel_perm <- train(xTrain2, yTrain2,
                        method = "earth",
                        tuneLength = 10,
                        trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(marsModel_perm)
## Multivariate Adaptive Regression Spline 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 121, 120, 119, 119, 118, 119, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE      
##    2      12.21259  0.4474004   8.974198
##    7      12.60517  0.4112809   9.196422
##   12      13.79806  0.3999115  10.041943
##   17      13.85971  0.3772740  10.342860
##   22      14.69439  0.3531494  10.889372
##   27      14.72844  0.3535936  10.859103
##   32      14.72844  0.3535936  10.859103
##   37      14.72844  0.3535936  10.859103
##   42      14.72844  0.3535936  10.859103
##   47      14.72844  0.3535936  10.859103
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 2 and degree = 1.
knnModel_perm <- train(xTrain2, yTrain2,
                       method = "knn",
                       tuneLength = 20,
                       trControl = ctrl)
print(knnModel_perm)
## k-Nearest Neighbors 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 119, 119, 119, 120, 118, 121, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    5  12.33879  0.4123702   8.696996
##    7  12.47985  0.4013714   8.713302
##    9  12.45061  0.4061786   8.745097
##   11  12.47094  0.4047003   8.791097
##   13  12.53071  0.3891443   8.918170
##   15  12.49320  0.3858117   9.074485
##   17  12.30274  0.3954831   8.936527
##   19  12.38332  0.3882364   9.103898
##   21  12.64086  0.3652323   9.332445
##   23  12.92798  0.3427873   9.554527
##   25  13.10410  0.3342477   9.750695
##   27  13.20072  0.3301541   9.938453
##   29  13.25099  0.3360636  10.043345
##   31  13.31140  0.3406705  10.084077
##   33  13.35559  0.3455751  10.109728
##   35  13.39580  0.3459777  10.132634
##   37  13.44403  0.3472873  10.149274
##   39  13.52954  0.3377203  10.191125
##   41  13.65597  0.3259436  10.273314
##   43  13.77959  0.3081548  10.365268
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

The Support Vector Machine (SVM) model produced the best overall resampling performance. It had the lowest RMSE (about 10.40) and the highest average 𝑅2 (around 0.57) compared with the other nonlinear models, indicating better predictive accuracy.

  1. Do any of the nonlinear models outperform the optimal linear model you previously developed in Problem 2? If so, what might this tell you about the underlying relationship between the predictors and the response?
summary(resamples(list(
  SVM = svmModel_perm,
  RF = rfModel_perm,
  NN = nnetModel_perm,
  MARS = marsModel_perm,
  KNN = knnModel_perm
)))
## 
## Call:
## summary.resamples(object = resamples(list(SVM = svmModel_perm, RF
##  = rfModel_perm, NN = nnetModel_perm, MARS = marsModel_perm, KNN
##  = knnModel_perm)))
## 
## Models: SVM, RF, NN, MARS, KNN 
## Number of resamples: 30 
## 
## MAE 
##          Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## SVM  4.458559 6.398946 7.424103 7.792229 8.762227 13.20407    0
## RF   4.881802 6.028758 7.240696 7.614558 8.946272 13.00423    0
## NN   3.765766 5.441152 7.089517 7.409169 9.033746 11.71748    0
## MARS 6.190312 8.031326 8.604318 8.974198 9.839319 13.47006    0
## KNN  6.607000 7.633063 8.787448 8.936527 9.815409 12.44485    0
## 
## RMSE 
##          Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## SVM  5.810461  8.970648 10.47891 10.74225 12.65723 17.31522    0
## RF   6.425133  8.566387 10.53833 10.54305 12.00286 16.48250    0
## NN   5.291322  8.549497 10.20685 11.34506 14.00131 19.01679    0
## MARS 6.720286 10.634415 12.00540 12.21259 13.98973 18.12959    0
## KNN  8.082813 10.442524 12.48860 12.30274 14.33742 16.95742    0
## 
## Rsquared 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## SVM  1.510229e-02 0.3307317 0.4849951 0.5139127 0.7164793 0.9323679    0
## RF   6.455571e-02 0.3866247 0.5934978 0.5518491 0.7288998 0.8613324    0
## NN   7.402073e-04 0.3663694 0.5125473 0.5186476 0.7759943 0.8767135    0
## MARS 1.191082e-07 0.2502028 0.4199699 0.4474004 0.5785091 0.9393602    2
## KNN  4.924955e-06 0.2249503 0.3487428 0.3954831 0.5400027 0.9045544    0
bwplot(resamples(list(
  SVM = svmModel_perm,
  RF = rfModel_perm,
  NN = nnetModel_perm,
  MARS = marsModel_perm,
  KNN = knnModel_perm
)))

pred_nn_perm <- predict(nnetModel_perm, xTest2)
pred_mars_perm <- predict(marsModel_perm, xTest2)
pred_knn_perm <- predict(knnModel_perm, xTest2)

postResample(pred_nn_perm, yTest2)
##       RMSE   Rsquared        MAE 
## 13.3902014  0.5487011  9.2164802
postResample(pred_mars_perm, yTest2)
##      RMSE  Rsquared       MAE 
## 11.371257  0.537835  8.512762
postResample(pred_knn_perm, yTest2)
##       RMSE   Rsquared        MAE 
## 11.8192992  0.4975314  8.3279898

Yes, some nonlinear models performed better than the optimal linear model. This suggests that the relationship between the predictors and permeability is likely nonlinear, and more flexible models such as SVM or random forests can capture these patterns better than linear models.

  1. Would you recommend any of the models you have developed to replace the permeability laboratory experiment?

No, the models should not replace the permeability laboratory experiment. While some models provide reasonable predictions, their accuracy is not high enough to fully replace experimental measurements. They are better used as supporting tools to estimate permeability before conducting lab tests.