1a

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
data(tecator)

str(absorp)
##  num [1:215, 1:100] 2.62 2.83 2.58 2.82 2.79 ...
str(endpoints)
##  num [1:215, 1:3] 60.5 46 71 72.8 58.3 44 44 69.3 61.4 61.4 ...
moisture <- endpoints[, 1]
fat <- endpoints[, 2]
protein <- endpoints[, 3]

tecator_df <- as.data.frame(absorp)
tecator_df$protein <- protein


dim(tecator_df)
## [1] 215 101
sum(is.na(tecator_df))
## [1] 0
set.seed(123)
train_index <- createDataPartition(tecator_df$protein, p = 0.8, list = FALSE)

train_data <- tecator_df[train_index, ]
test_data  <- tecator_df[-train_index, ]

dim(train_data)
## [1] 174 101
dim(test_data)
## [1]  41 101
summary(train_data$protein)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   15.25   18.70   17.66   20.10   21.80
summary(test_data$protein)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   15.60   18.80   17.77   20.10   21.60

The absorp object contains 215 meat samples with 100 numeric predictor variables. These predictors are the absorbance values measured at different wavelengths from the near infrared spectrum. The endpoints object contains three response variables for the same 215 samples: moisture, fat, and protein percentages. In this problem, the response variable used for modeling is the percentage of protein.

1b

protein <- endpoints[, 3]

tecator_df <- as.data.frame(absorp)
tecator_df$protein <- protein

dim(tecator_df)
## [1] 215 101
sum(is.na(tecator_df))
## [1] 0
set.seed(123)
train_index <- createDataPartition(tecator_df$protein, p = 0.8, list = FALSE)

train_data <- tecator_df[train_index, ]
test_data  <- tecator_df[-train_index, ]

dim(train_data)
## [1] 174 101
dim(test_data)
## [1]  41 101
summary(train_data$protein)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   15.25   18.70   17.66   20.10   21.80
summary(test_data$protein)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   15.60   18.80   17.77   20.10   21.60

I created a modeling data set by combining the 100 absorbance predictors with the protein variable from endpoints. The final data set had 215 observations and 101 variables. There were no missing values in the data. I then split the data into training and testing sets using an 80/20 split, resulting in 174 observations in the training set and 41 observations in the test set. The protein distributions in the training and test sets were similar, so the split appeared to be reasonable. For preprocessing, I used centering and scaling for the models as appropriate. For the neural network with PCA, I additionally applied principal component analysis as part of preprocessing.

1c

set.seed(123)

ctrl <- trainControl(
  method = "cv",
  number = 10
)

OLS

set.seed(123)
ols_model <- train(
  protein ~ .,
  data = train_data,
  method = "lm",
  trControl = ctrl,
  preProcess = c("center", "scale")
)

PCR

set.seed(123)
pcr_grid <- expand.grid(ncomp = 1:30)

pcr_model <- train(
  protein ~ .,
  data = train_data,
  method = "pcr",
  trControl = ctrl,
  tuneGrid = pcr_grid,
  preProcess = c("center", "scale")
)

PLS

set.seed(123)
pls_grid <- expand.grid(ncomp = 1:30)

pls_model <- train(
  protein ~ .,
  data = train_data,
  method = "pls",
  trControl = ctrl,
  tuneGrid = pls_grid,
  preProcess = c("center", "scale")
)

Ridge

set.seed(123)
ridge_grid <- expand.grid(
  alpha = 0,
  lambda = seq(0.0001, 1, length = 50)
)

ridge_model <- train(
  protein ~ .,
  data = train_data,
  method = "glmnet",
  trControl = ctrl,
  tuneGrid = ridge_grid,
  preProcess = c("center", "scale")
)

Elastic Net

set.seed(123)
enet_grid <- expand.grid(
  alpha = seq(0, 1, by = 0.1),
  lambda = seq(0.0001, 1, length = 30)
)

enet_model <- train(
  protein ~ .,
  data = train_data,
  method = "glmnet",
  trControl = ctrl,
  tuneGrid = enet_grid,
  preProcess = c("center", "scale")
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
ols_model
## Linear Regression 
## 
## 174 samples
## 100 predictors
## 
## Pre-processing: centered (100), scaled (100) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 155, 156, 157, 158, 158, 156, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.138413  0.8615363  0.7803177
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
pcr_model
## Principal Component Analysis 
## 
## 174 samples
## 100 predictors
## 
## Pre-processing: centered (100), scaled (100) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 155, 156, 157, 158, 158, 156, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##    1     2.9876366  0.06668801  2.5722701
##    2     2.6698774  0.24631136  2.1870159
##    3     2.2683011  0.46093673  1.8045825
##    4     1.7716311  0.66926708  1.3503536
##    5     1.3304824  0.83053138  1.0547325
##    6     1.1342060  0.87908695  0.9202439
##    7     1.1448648  0.87663141  0.9238714
##    8     1.1484247  0.87488770  0.9262312
##    9     1.1218411  0.88140528  0.9067982
##   10     1.0047657  0.90859721  0.8128093
##   11     0.9344906  0.91684229  0.7623180
##   12     0.8931422  0.92458219  0.7267819
##   13     0.7780215  0.94436772  0.5952670
##   14     0.7295296  0.95143872  0.5646718
##   15     0.6925265  0.95652758  0.5365408
##   16     0.6746817  0.95829271  0.5236343
##   17     0.6736075  0.95845119  0.5227128
##   18     0.6857422  0.95855110  0.5241364
##   19     0.6183256  0.96449632  0.4829720
##   20     0.5981827  0.96581467  0.4659080
##   21     0.6031565  0.96533032  0.4699201
##   22     0.6219143  0.96334478  0.4785197
##   23     0.6364964  0.96045499  0.4890731
##   24     0.6380436  0.96036786  0.4886656
##   25     0.6242135  0.96215708  0.4813794
##   26     0.6201498  0.96195645  0.4712542
##   27     0.6214427  0.96185212  0.4745361
##   28     0.6077482  0.96385012  0.4662808
##   29     0.6238763  0.96191593  0.4699408
##   30     0.6072830  0.96428949  0.4607085
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 20.
pls_model
## Partial Least Squares 
## 
## 174 samples
## 100 predictors
## 
## Pre-processing: centered (100), scaled (100) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 155, 156, 157, 158, 158, 156, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##    1     2.9795263  0.07249389  2.5621976
##    2     2.3197129  0.44055459  1.8695658
##    3     1.8491700  0.63867287  1.3803345
##    4     1.6782982  0.70271733  1.2912059
##    5     1.1867493  0.86819786  0.9562263
##    6     1.1289054  0.88011995  0.9145243
##    7     1.0983415  0.88627175  0.8725466
##    8     0.9345048  0.91927376  0.7539510
##    9     0.9114904  0.92331292  0.7335407
##   10     0.8595929  0.93153221  0.6965721
##   11     0.7614413  0.94283846  0.6113109
##   12     0.7027750  0.95387565  0.5432915
##   13     0.6840968  0.95752966  0.5251180
##   14     0.6285124  0.96319068  0.4880807
##   15     0.6127603  0.96446143  0.4766054
##   16     0.6145679  0.96340201  0.4756950
##   17     0.6227682  0.96212424  0.4791824
##   18     0.6532416  0.95881638  0.4857910
##   19     0.6877777  0.95503858  0.5067198
##   20     0.7429902  0.94743812  0.5242525
##   21     0.7656573  0.94331614  0.5333422
##   22     0.8180435  0.93430272  0.5613121
##   23     0.9287112  0.91480243  0.6214008
##   24     0.9610835  0.91158640  0.6326250
##   25     0.9720434  0.91040971  0.6286592
##   26     0.9773493  0.91032973  0.6260162
##   27     0.9497919  0.91473732  0.6057266
##   28     0.9188307  0.91936009  0.5863544
##   29     0.9280125  0.91774400  0.5843153
##   30     0.9054861  0.91986090  0.5843395
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 15.
ridge_model
## glmnet 
## 
## 174 samples
## 100 predictors
## 
## Pre-processing: centered (100), scaled (100) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 155, 156, 157, 158, 158, 156, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE     
##   0.00010000  1.659658  0.7158218  1.291954
##   0.02050612  1.659658  0.7158218  1.291954
##   0.04091224  1.659658  0.7158218  1.291954
##   0.06131837  1.659658  0.7158218  1.291954
##   0.08172449  1.664941  0.7155915  1.296075
##   0.10213061  1.687953  0.7118066  1.312181
##   0.12253673  1.700752  0.7104086  1.323613
##   0.14294286  1.724764  0.7055937  1.347213
##   0.16334898  1.753502  0.6984111  1.374848
##   0.18375510  1.771947  0.6940747  1.393501
##   0.20416122  1.802417  0.6858172  1.423789
##   0.22456735  1.819665  0.6822046  1.440474
##   0.24497347  1.832859  0.6790502  1.455233
##   0.26537959  1.859682  0.6717668  1.481134
##   0.28578571  1.872719  0.6689538  1.494428
##   0.30619184  1.888483  0.6655957  1.509850
##   0.32659796  1.900648  0.6631272  1.524522
##   0.34700408  1.920676  0.6577068  1.544963
##   0.36741020  1.933081  0.6548576  1.558555
##   0.38781633  1.944796  0.6518740  1.570858
##   0.40822245  1.957731  0.6486168  1.583467
##   0.42862857  1.968955  0.6459500  1.596549
##   0.44903469  1.982956  0.6422985  1.611156
##   0.46944082  1.995650  0.6393739  1.622740
##   0.48984694  2.003579  0.6379998  1.631602
##   0.51025306  2.012133  0.6359931  1.641573
##   0.53065918  2.024126  0.6327831  1.654175
##   0.55106531  2.039183  0.6285863  1.669419
##   0.57147143  2.051092  0.6253813  1.682060
##   0.59187755  2.060711  0.6226803  1.691989
##   0.61228367  2.070851  0.6196520  1.702044
##   0.63268980  2.081290  0.6165453  1.711743
##   0.65309592  2.091688  0.6135885  1.722547
##   0.67350204  2.101081  0.6107686  1.732685
##   0.69390816  2.109009  0.6085264  1.741505
##   0.71431429  2.116331  0.6065120  1.748986
##   0.73472041  2.123249  0.6048515  1.755678
##   0.75512653  2.129937  0.6032479  1.762029
##   0.77553265  2.136432  0.6018597  1.768539
##   0.79593878  2.142884  0.6004277  1.775016
##   0.81634490  2.150121  0.5984356  1.782322
##   0.83675102  2.157929  0.5961970  1.789918
##   0.85715714  2.166181  0.5934578  1.797761
##   0.87756327  2.174181  0.5908770  1.805436
##   0.89796939  2.181511  0.5886465  1.812446
##   0.91837551  2.187967  0.5867253  1.818751
##   0.93878163  2.194040  0.5850817  1.824715
##   0.95918776  2.200047  0.5833956  1.830463
##   0.97959388  2.206305  0.5814833  1.836356
##   1.00000000  2.212470  0.5795629  1.842083
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.06131837.
enet_model
## glmnet 
## 
## 174 samples
## 100 predictors
## 
## Pre-processing: centered (100), scaled (100) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 155, 156, 157, 158, 158, 156, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      RMSE      Rsquared     MAE      
##   0.0    0.00010000  1.659658  0.715821840  1.2919543
##   0.0    0.03457931  1.659658  0.715821840  1.2919543
##   0.0    0.06905862  1.659658  0.715821840  1.2919543
##   0.0    0.10353793  1.689288  0.711553026  1.3130625
##   0.0    0.13801724  1.718508  0.707065632  1.3407804
##   0.0    0.17249655  1.764649  0.696171909  1.3850545
##   0.0    0.20697586  1.804238  0.685324883  1.4255801
##   0.0    0.24145517  1.831015  0.679432525  1.4523924
##   0.0    0.27593448  1.867485  0.669786953  1.4881251
##   0.0    0.31041379  1.891495  0.664972932  1.5129243
##   0.0    0.34489310  1.918835  0.658145118  1.5431980
##   0.0    0.37937241  1.939481  0.653299232  1.5655474
##   0.0    0.41385172  1.960798  0.647914006  1.5866666
##   0.0    0.44833103  1.982480  0.642423067  1.6106874
##   0.0    0.48281034  2.001431  0.638300999  1.6289289
##   0.0    0.51728966  2.015808  0.635137491  1.6455758
##   0.0    0.55176897  2.039628  0.628467264  1.6698453
##   0.0    0.58624828  2.058196  0.623475582  1.6895761
##   0.0    0.62072759  2.075155  0.618400635  1.7060264
##   0.0    0.65520690  2.092612  0.613328594  1.7235277
##   0.0    0.68968621  2.107482  0.608938386  1.7398764
##   0.0    0.72416552  2.119858  0.605589623  1.7525010
##   0.0    0.75864483  2.131080  0.602982165  1.7631609
##   0.0    0.79312414  2.141936  0.600670569  1.7740794
##   0.0    0.82760345  2.154398  0.597271950  1.7865316
##   0.0    0.86208276  2.168240  0.592746604  1.7997110
##   0.0    0.89656207  2.181062  0.588783820  1.8120107
##   0.0    0.93104138  2.191828  0.585653400  1.8225579
##   0.0    0.96552069  2.201997  0.582801510  1.8323035
##   0.0    1.00000000  2.212470  0.579562864  1.8420830
##   0.1    0.00010000  1.125772  0.880485418  0.9221312
##   0.1    0.03457931  1.501731  0.771352901  1.1706462
##   0.1    0.06905862  1.640397  0.728766049  1.2736482
##   0.1    0.10353793  1.709578  0.712428889  1.3330759
##   0.1    0.13801724  1.766053  0.700968312  1.3906625
##   0.1    0.17249655  1.817896  0.690899856  1.4442248
##   0.1    0.20697586  1.866236  0.681458140  1.4933100
##   0.1    0.24145517  1.911455  0.672729524  1.5415490
##   0.1    0.27593448  1.954415  0.664159574  1.5872745
##   0.1    0.31041379  1.994943  0.655938435  1.6297498
##   0.1    0.34489310  2.033346  0.648085482  1.6708535
##   0.1    0.37937241  2.070140  0.640210194  1.7092306
##   0.1    0.41385172  2.105339  0.632356888  1.7451146
##   0.1    0.44833103  2.139047  0.624525930  1.7800796
##   0.1    0.48281034  2.171233  0.616684256  1.8138331
##   0.1    0.51728966  2.202109  0.608787436  1.8456567
##   0.1    0.55176897  2.232076  0.600744840  1.8760201
##   0.1    0.58624828  2.261367  0.592380775  1.9052651
##   0.1    0.62072759  2.289793  0.583799414  1.9335204
##   0.1    0.65520690  2.317547  0.574935199  1.9606887
##   0.1    0.68968621  2.344072  0.565928654  1.9863380
##   0.1    0.72416552  2.369634  0.556796080  2.0106681
##   0.1    0.75864483  2.394312  0.547468841  2.0338812
##   0.1    0.79312414  2.418245  0.537973390  2.0561363
##   0.1    0.82760345  2.441361  0.528316091  2.0774786
##   0.1    0.86208276  2.463828  0.518430500  2.0979893
##   0.1    0.89656207  2.485156  0.508499057  2.1173380
##   0.1    0.93104138  2.506052  0.498186777  2.1361808
##   0.1    0.96552069  2.525995  0.487838054  2.1544397
##   0.1    1.00000000  2.545617  0.477006909  2.1722442
##   0.2    0.00010000  1.126904  0.880578845  0.9210240
##   0.2    0.03457931  1.535941  0.759977521  1.1932457
##   0.2    0.06905862  1.666601  0.723442709  1.2946535
##   0.2    0.10353793  1.741279  0.709165289  1.3666303
##   0.2    0.13801724  1.810678  0.697398146  1.4387496
##   0.2    0.17249655  1.876468  0.686411006  1.5072117
##   0.2    0.20697586  1.938533  0.676482874  1.5748503
##   0.2    0.24145517  1.999284  0.666002589  1.6396374
##   0.2    0.27593448  2.057661  0.655003894  1.7021803
##   0.2    0.31041379  2.113979  0.644019756  1.7629029
##   0.2    0.34489310  2.168562  0.632002138  1.8193651
##   0.2    0.37937241  2.221644  0.618923100  1.8729116
##   0.2    0.41385172  2.272708  0.604808133  1.9232404
##   0.2    0.44833103  2.321649  0.589675101  1.9703558
##   0.2    0.48281034  2.369264  0.573002730  2.0150924
##   0.2    0.51728966  2.415497  0.554846923  2.0575966
##   0.2    0.55176897  2.458750  0.535959390  2.0969985
##   0.2    0.58624828  2.500703  0.515434446  2.1352397
##   0.2    0.62072759  2.541724  0.492674437  2.1732671
##   0.2    0.65520690  2.580724  0.468107177  2.2092272
##   0.2    0.68968621  2.618972  0.440854530  2.2439800
##   0.2    0.72416552  2.655835  0.411411924  2.2771156
##   0.2    0.75864483  2.691593  0.379754723  2.3088981
##   0.2    0.79312414  2.725965  0.346178312  2.3391449
##   0.2    0.82760345  2.759093  0.311149900  2.3684444
##   0.2    0.86208276  2.790116  0.276405777  2.3965761
##   0.2    0.89656207  2.820089  0.241501292  2.4239428
##   0.2    0.93104138  2.847617  0.209001291  2.4493344
##   0.2    0.96552069  2.873720  0.178219723  2.4732239
##   0.2    1.00000000  2.897732  0.150701000  2.4950078
##   0.3    0.00010000  1.131614  0.879525007  0.9307496
##   0.3    0.03457931  1.571176  0.748066244  1.2175295
##   0.3    0.06905862  1.688219  0.720446445  1.3139460
##   0.3    0.10353793  1.774554  0.706521010  1.4027377
##   0.3    0.13801724  1.856560  0.694778969  1.4884311
##   0.3    0.17249655  1.937423  0.683244452  1.5762505
##   0.3    0.20697586  2.017427  0.670596583  1.6640772
##   0.3    0.24145517  2.096292  0.656602159  1.7496298
##   0.3    0.27593448  2.173584  0.640617423  1.8290522
##   0.3    0.31041379  2.248272  0.622143088  1.9031092
##   0.3    0.34489310  2.320824  0.600462748  1.9724782
##   0.3    0.37937241  2.391672  0.574856885  2.0379259
##   0.3    0.41385172  2.458970  0.545508295  2.0999370
##   0.3    0.44833103  2.523379  0.511158510  2.1595075
##   0.3    0.48281034  2.585059  0.471508593  2.2157912
##   0.3    0.51728966  2.644562  0.425280542  2.2689962
##   0.3    0.55176897  2.701764  0.372684704  2.3192763
##   0.3    0.58624828  2.756472  0.315090106  2.3677919
##   0.3    0.62072759  2.808441  0.255315206  2.4147909
##   0.3    0.65520690  2.856686  0.197660092  2.4585964
##   0.3    0.68968621  2.898076  0.149567730  2.4960376
##   0.3    0.72416552  2.931340  0.112836221  2.5256414
##   0.3    0.75864483  2.945832  0.101098909  2.5388888
##   0.3    0.79312414  2.952297  0.096355368  2.5446494
##   0.3    0.82760345  2.953775  0.095798265  2.5463146
##   0.3    0.86208276  2.954779  0.095712591  2.5475950
##   0.3    0.89656207  2.955635  0.095701216  2.5486747
##   0.3    0.93104138  2.956537  0.095690059  2.5497678
##   0.3    0.96552069  2.957478  0.095680883  2.5508685
##   0.3    1.00000000  2.958453  0.095674822  2.5519678
##   0.4    0.00010000  1.150505  0.874128794  0.9365800
##   0.4    0.03457931  1.593803  0.741393668  1.2343912
##   0.4    0.06905862  1.708868  0.718330953  1.3346107
##   0.4    0.10353793  1.806131  0.705217974  1.4369465
##   0.4    0.13801724  1.904405  0.692530219  1.5426512
##   0.4    0.17249655  2.003948  0.678401360  1.6532075
##   0.4    0.20697586  2.103585  0.661878365  1.7610526
##   0.4    0.24145517  2.203604  0.640934301  1.8619320
##   0.4    0.27593448  2.301310  0.614221600  1.9556836
##   0.4    0.31041379  2.396661  0.580044691  2.0449631
##   0.4    0.34489310  2.487971  0.537046356  2.1294894
##   0.4    0.37937241  2.574801  0.483377323  2.2085781
##   0.4    0.41385172  2.657494  0.417064978  2.2819482
##   0.4    0.44833103  2.736080  0.338451522  2.3512567
##   0.4    0.48281034  2.810944  0.252335083  2.4185150
##   0.4    0.51728966  2.879721  0.170059191  2.4803647
##   0.4    0.55176897  2.929948  0.114707910  2.5248972
##   0.4    0.58624828  2.949064  0.098888674  2.5421436
##   0.4    0.62072759  2.953746  0.095552815  2.5465435
##   0.4    0.65520690  2.955187  0.095395514  2.5483526
##   0.4    0.68968621  2.956306  0.095397099  2.5497642
##   0.4    0.72416552  2.957496  0.095397831  2.5511801
##   0.4    0.75864483  2.958765  0.095384339  2.5526116
##   0.4    0.79312414  2.960099  0.095371938  2.5540442
##   0.4    0.82760345  2.961494  0.095356725  2.5554777
##   0.4    0.86208276  2.962961  0.095335325  2.5569175
##   0.4    0.89656207  2.964496  0.095308042  2.5583589
##   0.4    0.93104138  2.966098  0.095281389  2.5598074
##   0.4    0.96552069  2.967755  0.095258917  2.5612524
##   0.4    1.00000000  2.969473  0.095231520  2.5626956
##   0.5    0.00010000  1.137503  0.878127434  0.9273956
##   0.5    0.03457931  1.608873  0.737577645  1.2462129
##   0.5    0.06905862  1.726319  0.717480614  1.3531381
##   0.5    0.10353793  1.838550  0.703820476  1.4719512
##   0.5    0.13801724  1.955810  0.689396924  1.6018615
##   0.5    0.17249655  2.076115  0.671610638  1.7342710
##   0.5    0.20697586  2.198159  0.647665462  1.8584840
##   0.5    0.24145517  2.321002  0.613913000  1.9762401
##   0.5    0.27593448  2.440847  0.565929647  2.0879889
##   0.5    0.31041379  2.555403  0.500152845  2.1926747
##   0.5    0.34489310  2.662999  0.414013650  2.2880178
##   0.5    0.37937241  2.767370  0.303799133  2.3808503
##   0.5    0.41385172  2.863205  0.189228919  2.4664277
##   0.5    0.44833103  2.930554  0.115680772  2.5261619
##   0.5    0.48281034  2.952613  0.095681747  2.5455513
##   0.5    0.51728966  2.955266  0.094968441  2.5487023
##   0.5    0.55176897  2.956628  0.094965153  2.5504450
##   0.5    0.58624828  2.958098  0.094956335  2.5521963
##   0.5    0.62072759  2.959679  0.094934954  2.5539639
##   0.5    0.65520690  2.961363  0.094912221  2.5557338
##   0.5    0.68968621  2.963153  0.094881563  2.5575091
##   0.5    0.72416552  2.965044  0.094843455  2.5592887
##   0.5    0.75864483  2.967043  0.094796935  2.5610778
##   0.5    0.79312414  2.969139  0.094747281  2.5628651
##   0.5    0.82760345  2.971335  0.094692478  2.5647083
##   0.5    0.86208276  2.973630  0.094627083  2.5667930
##   0.5    0.89656207  2.976019  0.094563772  2.5691731
##   0.5    0.93104138  2.978495  0.094504064  2.5715470
##   0.5    0.96552069  2.981065  0.094429367  2.5739189
##   0.5    1.00000000  2.983736  0.094341095  2.5762933
##   0.6    0.00010000  1.129340  0.879258089  0.9170077
##   0.6    0.03457931  1.621463  0.734592427  1.2570742
##   0.6    0.06905862  1.743903  0.716922065  1.3726485
##   0.6    0.10353793  1.872587  0.702537252  1.5105790
##   0.6    0.13801724  2.009448  0.685478409  1.6642602
##   0.6    0.17249655  2.152035  0.661703059  1.8140207
##   0.6    0.20697586  2.299290  0.625617328  1.9577276
##   0.6    0.24145517  2.445795  0.568184505  2.0945636
##   0.6    0.27593448  2.587926  0.479565954  2.2228823
##   0.6    0.31041379  2.723503  0.353833460  2.3431337
##   0.6    0.34489310  2.847965  0.207098266  2.4537113
##   0.6    0.37937241  2.933171  0.113783881  2.5290537
##   0.6    0.41385172  2.954130  0.094909440  2.5474781
##   0.6    0.44833103  2.956458  0.094558505  2.5503893
##   0.6    0.48281034  2.958163  0.094526756  2.5524957
##   0.6    0.51728966  2.960017  0.094489253  2.5546061
##   0.6    0.55176897  2.962015  0.094452627  2.5567134
##   0.6    0.58624828  2.964160  0.094411671  2.5588224
##   0.6    0.62072759  2.966452  0.094362955  2.5609373
##   0.6    0.65520690  2.968885  0.094310260  2.5630543
##   0.6    0.68968621  2.971461  0.094247746  2.5651857
##   0.6    0.72416552  2.974173  0.094185755  2.5676460
##   0.6    0.75864483  2.977026  0.094125172  2.5703997
##   0.6    0.79312414  2.980014  0.094047893  2.5732181
##   0.6    0.82760345  2.983143  0.093947149  2.5760351
##   0.6    0.86208276  2.986396  0.093867473  2.5788360
##   0.6    0.89656207  2.989785  0.093814701  2.5816362
##   0.6    0.93104138  2.993307  0.093760257  2.5844330
##   0.6    0.96552069  2.996956  0.093703391  2.5872121
##   0.6    1.00000000  3.000737  0.093612374  2.5899789
##   0.7    0.00010000  1.147522  0.874592605  0.9354348
##   0.7    0.03457931  1.628639  0.733651506  1.2636309
##   0.7    0.06905862  1.760312  0.716821326  1.3905159
##   0.7    0.10353793  1.906406  0.701326281  1.5505932
##   0.7    0.13801724  2.065815  0.680291729  1.7262305
##   0.7    0.17249655  2.232786  0.647338187  1.8951834
##   0.7    0.20697586  2.405745  0.589879711  2.0589632
##   0.7    0.24145517  2.578328  0.488926914  2.2156280
##   0.7    0.27593448  2.742513  0.332914308  2.3615517
##   0.7    0.31041379  2.891438  0.155561538  2.4926902
##   0.7    0.34489310  2.952172  0.095650200  2.5456418
##   0.7    0.37937241  2.956258  0.094438483  2.5504258
##   0.7    0.41385172  2.958231  0.094386938  2.5528943
##   0.7    0.44833103  2.960403  0.094325271  2.5553602
##   0.7    0.48281034  2.962771  0.094260929  2.5578283
##   0.7    0.51728966  2.965330  0.094196667  2.5602905
##   0.7    0.55176897  2.968083  0.094130664  2.5627474
##   0.7    0.58624828  2.971021  0.094078047  2.5651866
##   0.7    0.62072759  2.974141  0.094027907  2.5679444
##   0.7    0.65520690  2.977453  0.093966498  2.5710528
##   0.7    0.68968621  2.980956  0.093899136  2.5743181
##   0.7    0.72416552  2.984643  0.093840697  2.5775795
##   0.7    0.75864483  2.988512  0.093778925  2.5808363
##   0.7    0.79312414  2.992564  0.093705022  2.5840845
##   0.7    0.82760345  2.996789  0.093620993  2.5873063
##   0.7    0.86208276  3.001183  0.093538125  2.5905037
##   0.7    0.89656207  3.005726  0.093516976  2.5936282
##   0.7    0.93104138  3.010438  0.093537969  2.5967979
##   0.7    0.96552069  3.015239  0.093668892  2.6000964
##   0.7    1.00000000  3.020158  0.093927670  2.6033511
##   0.8    0.00010000  1.131266  0.879563395  0.9249437
##   0.8    0.03457931  1.634088  0.733328631  1.2691600
##   0.8    0.06905862  1.776683  0.716697251  1.4073730
##   0.8    0.10353793  1.941885  0.699426836  1.5913068
##   0.8    0.13801724  2.124365  0.673401158  1.7878231
##   0.8    0.17249655  2.319247  0.625473720  1.9791989
##   0.8    0.20697586  2.518949  0.532305726  2.1638257
##   0.8    0.24145517  2.716087  0.362305530  2.3396378
##   0.8    0.27593448  2.896932  0.150547723  2.4982403
##   0.8    0.31041379  2.953827  0.095022036  2.5477299
##   0.8    0.34489310  2.957044  0.094478223  2.5517224
##   0.8    0.37937241  2.959393  0.094416213  2.5545382
##   0.8    0.41385172  2.961996  0.094354843  2.5573484
##   0.8    0.44833103  2.964859  0.094291602  2.5601554
##   0.8    0.48281034  2.967966  0.094239984  2.5629484
##   0.8    0.51728966  2.971323  0.094180895  2.5657391
##   0.8    0.55176897  2.974935  0.094107041  2.5689440
##   0.8    0.58624828  2.978786  0.094046510  2.5725861
##   0.8    0.62072759  2.982873  0.093978961  2.5762793
##   0.8    0.65520690  2.987186  0.093914149  2.5799476
##   0.8    0.68968621  2.991741  0.093856990  2.5836024
##   0.8    0.72416552  2.996528  0.093801877  2.5872150
##   0.8    0.75864483  3.001543  0.093782615  2.5908141
##   0.8    0.79312414  3.006714  0.093881265  2.5943327
##   0.8    0.82760345  3.012115  0.093998519  2.5979975
##   0.8    0.86208276  3.017681  0.094200738  2.6017517
##   0.8    0.89656207  3.023462  0.094346432  2.6054818
##   0.8    0.93104138  3.028497  0.073914142  2.6086294
##   0.8    0.96552069  3.033114  0.060300643  2.6113550
##   0.8    1.00000000  3.036824  0.047857762  2.6131594
##   0.9    0.00010000  1.111846  0.884106719  0.9048338
##   0.9    0.03457931  1.637104  0.733453895  1.2721791
##   0.9    0.06905862  1.790015  0.717399978  1.4234740
##   0.9    0.10353793  1.973882  0.698693480  1.6291205
##   0.9    0.13801724  2.181300  0.665286110  1.8468425
##   0.9    0.17249655  2.403815  0.595692370  2.0603206
##   0.9    0.20697586  2.636074  0.442756765  2.2701892
##   0.9    0.24145517  2.868163  0.182236279  2.4737202
##   0.9    0.27593448  2.953315  0.095522881  2.5476230
##   0.9    0.31041379  2.957075  0.094858954  2.5522427
##   0.9    0.34489310  2.959761  0.094787782  2.5553802
##   0.9    0.37937241  2.962769  0.094712690  2.5585143
##   0.9    0.41385172  2.966102  0.094628388  2.5616502
##   0.9    0.44833103  2.969762  0.094534048  2.5648079
##   0.9    0.48281034  2.973718  0.094458967  2.5681662
##   0.9    0.51728966  2.977979  0.094389580  2.5720474
##   0.9    0.55176897  2.982554  0.094318194  2.5761529
##   0.9    0.58624828  2.987393  0.094270246  2.5802147
##   0.9    0.62072759  2.992510  0.094252868  2.5842484
##   0.9    0.65520690  2.997887  0.094315667  2.5882387
##   0.9    0.68968621  3.003553  0.094428538  2.5922187
##   0.9    0.72416552  3.009515  0.094587365  2.5962165
##   0.9    0.75864483  3.015771  0.094854578  2.6005066
##   0.9    0.79312414  3.022333  0.095108367  2.6047910
##   0.9    0.82760345  3.028228  0.074699193  2.6084916
##   0.9    0.86208276  3.033478  0.060375626  2.6115550
##   0.9    0.89656207  3.037426  0.037010058  2.6134275
##   0.9    0.93104138  3.038672  0.015264540  2.6134522
##   0.9    0.96552069  3.039217  0.009627915  2.6135659
##   0.9    1.00000000  3.039043  0.005176692  2.6138343
##   1.0    0.00010000  1.115822  0.884218412  0.9118328
##   1.0    0.03457931  1.642179  0.731543769  1.2750512
##   1.0    0.06905862  1.794584  0.717532037  1.4281020
##   1.0    0.10353793  1.990545  0.698054711  1.6473876
##   1.0    0.13801724  2.221486  0.658315739  1.8860426
##   1.0    0.17249655  2.478082  0.559939318  2.1299949
##   1.0    0.20697586  2.754589  0.317694012  2.3761323
##   1.0    0.24145517  2.948338  0.097958829  2.5428322
##   1.0    0.27593448  2.956696  0.094968464  2.5520535
##   1.0    0.31041379  2.959594  0.094904807  2.5554719
##   1.0    0.34489310  2.962893  0.094837550  2.5588881
##   1.0    0.37937241  2.966598  0.094762161  2.5623025
##   1.0    0.41385172  2.970707  0.094670949  2.5657192
##   1.0    0.44833103  2.975105  0.094711636  2.5694675
##   1.0    0.48281034  2.979860  0.094786641  2.5738570
##   1.0    0.51728966  2.985009  0.094881747  2.5783516
##   1.0    0.55176897  2.990559  0.095008918  2.5828468
##   1.0    0.58624828  2.996508  0.095164718  2.5873424
##   1.0    0.62072759  3.002866  0.095259064  2.5918356
##   1.0    0.65520690  3.009619  0.095326497  2.5963686
##   1.0    0.68968621  3.016767  0.095329484  2.6012252
##   1.0    0.72416552  3.024295  0.095334217  2.6060749
##   1.0    0.75864483  3.030454  0.074962929  2.6098999
##   1.0    0.79312414  3.035953  0.052468359  2.6128169
##   1.0    0.82760345  3.038388  0.015645467  2.6135242
##   1.0    0.86208276  3.039169  0.009614598  2.6134576
##   1.0    0.89656207  3.039066  0.005176692  2.6138017
##   1.0    0.93104138  3.038984          NaN  2.6139222
##   1.0    0.96552069  3.038984          NaN  2.6139222
##   1.0    1.00000000  3.038984          NaN  2.6139222
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.9 and lambda = 1e-04.
pcr_model$bestTune
##    ncomp
## 20    20
pls_model$bestTune
##    ncomp
## 15    15
ridge_model$bestTune
##   alpha     lambda
## 4     0 0.06131837
enet_model$bestTune
##     alpha lambda
## 271   0.9  1e-04
chapter6_results <- data.frame(
  Model = c("OLS", "PCR", "PLS", "Ridge", "Elastic Net"),
  RMSE = c(
    min(ols_model$results$RMSE),
    min(pcr_model$results$RMSE),
    min(pls_model$results$RMSE),
    min(ridge_model$results$RMSE),
    min(enet_model$results$RMSE, na.rm = TRUE)
  ),
  Rsquared = c(
    max(ols_model$results$Rsquared),
    max(pcr_model$results$Rsquared),
    max(pls_model$results$Rsquared),
    max(ridge_model$results$Rsquared),
    max(enet_model$results$Rsquared, na.rm = TRUE)
  ),
  MAE = c(
    min(ols_model$results$MAE),
    min(pcr_model$results$MAE),
    min(pls_model$results$MAE),
    min(ridge_model$results$MAE),
    min(enet_model$results$MAE, na.rm = TRUE)
  )
)

chapter6_results
##         Model      RMSE  Rsquared       MAE
## 1         OLS 1.1384127 0.8615363 0.7803177
## 2         PCR 0.5981827 0.9658147 0.4607085
## 3         PLS 0.6127603 0.9644614 0.4756950
## 4       Ridge 1.6596581 0.7158218 1.2919543
## 5 Elastic Net 1.1118457 0.8842184 0.9048338
chapter6_tuning <- list(
  PCR = pcr_model$bestTune,
  PLS = pls_model$bestTune,
  Ridge = ridge_model$bestTune,
  Elastic_Net = enet_model$bestTune
)

chapter6_tuning
## $PCR
##    ncomp
## 20    20
## 
## $PLS
##    ncomp
## 15    15
## 
## $Ridge
##   alpha     lambda
## 4     0 0.06131837
## 
## $Elastic_Net
##     alpha lambda
## 271   0.9  1e-04

I trained five Chapter 6 models: ordinary least squares (OLS), principal components regression (PCR), partial least squares (PLS), ridge regression, and elastic net. The optimal tuning values were: PCR: ncomp = 20 PLS: ncomp = 15 Ridge: alpha = 0, lambda = 0.0613 Elastic Net: alpha = 0.9, lambda = 0.0001 The resampling results showed that PCR performed the best among the Chapter 6 models, with the lowest RMSE and MAE and the highest R-squared. PLS was a very close second. OLS and elastic net performed worse than PCR and PLS, while ridge regression had the weakest performance among the Chapter 6 models. For ridge regression, the best tuning parameters were alpha = 0 and lambda = 0.0613. For elastic net, the best tuning parameters were alpha = 0.9 and lambda = 0.0001.

1d

set.seed(123)
svm_model <- train(
  protein ~ .,
  data = train_data,
  method = "svmRadial",
  trControl = ctrl,
  tuneLength = 10,
  preProcess = c("center", "scale")
)
set.seed(123)

nnet_grid <- expand.grid(
  size = c(1, 3, 5, 7),
  decay = c(0, 0.001, 0.01, 0.1)
)

nnet_model <- train(
  protein ~ .,
  data = train_data,
  method = "nnet",
  trControl = ctrl,
  tuneGrid = nnet_grid,
  preProcess = c("center", "scale"),
  linout = TRUE,
  trace = FALSE,
  maxit = 500,
  MaxNWts = 5000
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
set.seed(123)

nnet_pca_grid <- expand.grid(
  size = c(1, 3, 5, 7),
  decay = c(0, 0.001, 0.01, 0.1)
)

nnet_pca_model <- train(
  protein ~ .,
  data = train_data,
  method = "nnet",
  trControl = ctrl,
  tuneGrid = nnet_pca_grid,
  preProcess = c("center", "scale", "pca"),
  linout = TRUE,
  trace = FALSE,
  maxit = 500,
  MaxNWts = 5000
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
set.seed(123)
mars_model <- train(
  protein ~ .,
  data = train_data,
  method = "earth",
  trControl = ctrl,
  tuneLength = 10,
  preProcess = c("center", "scale")
)
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
set.seed(123)
knn_model <- train(
  protein ~ .,
  data = train_data,
  method = "knn",
  trControl = ctrl,
  tuneLength = 10,
  preProcess = c("center", "scale")
)
svm_model$bestTune
##       sigma  C
## 7 0.1923092 16
nnet_model$bestTune
##   size decay
## 2    1 0.001
nnet_pca_model$bestTune
##   size decay
## 3    1  0.01
mars_model$bestTune
##   nprune degree
## 6     15      1
knn_model$bestTune
##   k
## 2 7
chapter7_results <- data.frame(
  Model = c("SVM", "Neural Net", "Neural Net + PCA", "MARS", "KNN"),
  RMSE = c(
    min(svm_model$results$RMSE),
    min(nnet_model$results$RMSE),
    min(nnet_pca_model$results$RMSE),
    min(mars_model$results$RMSE),
    min(knn_model$results$RMSE)
  ),
  Rsquared = c(
    max(svm_model$results$Rsquared),
    max(nnet_model$results$Rsquared),
    max(nnet_pca_model$results$Rsquared),
    max(mars_model$results$Rsquared),
    max(knn_model$results$Rsquared)
  ),
  MAE = c(
    min(svm_model$results$MAE),
    min(nnet_model$results$MAE),
    min(nnet_pca_model$results$MAE),
    min(mars_model$results$MAE),
    min(knn_model$results$MAE)
  )
)

chapter7_results
##              Model      RMSE  Rsquared       MAE
## 1              SVM 1.8301993 0.6578423 1.3213403
## 2       Neural Net 0.5645032 0.9692766 0.4395879
## 3 Neural Net + PCA 2.7817828 0.2128713 2.2843683
## 4             MARS 0.9798572 0.9068899 0.7795716
## 5              KNN 2.4128713 0.4234022 2.0463251
problem1_all_results <- rbind(chapter6_results, chapter7_results)
problem1_all_results
##               Model      RMSE  Rsquared       MAE
## 1               OLS 1.1384127 0.8615363 0.7803177
## 2               PCR 0.5981827 0.9658147 0.4607085
## 3               PLS 0.6127603 0.9644614 0.4756950
## 4             Ridge 1.6596581 0.7158218 1.2919543
## 5       Elastic Net 1.1118457 0.8842184 0.9048338
## 6               SVM 1.8301993 0.6578423 1.3213403
## 7        Neural Net 0.5645032 0.9692766 0.4395879
## 8  Neural Net + PCA 2.7817828 0.2128713 2.2843683
## 9              MARS 0.9798572 0.9068899 0.7795716
## 10              KNN 2.4128713 0.4234022 2.0463251

I trained five nonlinear model versions from Chapter 7: SVM, neural network, neural network with PCA, MARS, and KNN. The optimal tuning values were: SVM: sigma = 0.1923, C = 16 Neural Net: size = 1, decay = 0.001 Neural Net + PCA: size = 1, decay = 0.01 MARS: nprune = 15, degree = 1 KNN: k = 7 Among the nonlinear models, the neural network without PCA performed the best. It had the lowest RMSE, the lowest MAE, and the highest R-squared. MARS performed reasonably well, but not as well as the neural network. SVM and KNN performed much worse. PCA preprocessing did not help the neural network model. The neural network without PCA had much better predictive performance than the neural network with PCA. This suggests that PCA removed useful information for predicting protein in this data set.

1e

problem1_all_results <- rbind(chapter6_results, chapter7_results)
problem1_all_results_sorted <- problem1_all_results[order(problem1_all_results$RMSE), ]
problem1_all_results_sorted
##               Model      RMSE  Rsquared       MAE
## 7        Neural Net 0.5645032 0.9692766 0.4395879
## 2               PCR 0.5981827 0.9658147 0.4607085
## 3               PLS 0.6127603 0.9644614 0.4756950
## 9              MARS 0.9798572 0.9068899 0.7795716
## 5       Elastic Net 1.1118457 0.8842184 0.9048338
## 1               OLS 1.1384127 0.8615363 0.7803177
## 4             Ridge 1.6596581 0.7158218 1.2919543
## 6               SVM 1.8301993 0.6578423 1.3213403
## 10              KNN 2.4128713 0.4234022 2.0463251
## 8  Neural Net + PCA 2.7817828 0.2128713 2.2843683

Comparing all models from Parts (c) and (d), the neural network without PCA had the best overall predictive ability. It achieved the lowest RMSE (0.5645), the highest R-squared (0.9693), and the lowest MAE (0.4396). PCR and PLS were the most competitive alternatives and also performed very well. MARS had acceptable performance but was clearly worse than the top three models. The weakest models were neural network with PCA, KNN, and SVM, based on their much larger RMSE values and lower R-squared values. Ridge regression also performed relatively poorly. Overall, the neural network without PCA appeared to be the best model for predicting protein percentage in the Tecator data.

2a

library(AppliedPredictiveModeling)
data(permeability)

str(fingerprints)
##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
str(permeability)
##  num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr "permeability"
dim(fingerprints)
## [1]  165 1107
dim(permeability)
## [1] 165   1

The fingerprints matrix contains 165 compounds and 1,107 binary molecular predictors. The permeability object contains the permeability response for the same 165 compounds. Therefore, fingerprints is used as the predictor matrix and permeability is used as the response variable for modeling. This matches the data description in the assignment.

2b

library(caret)

nzv_cols <- nearZeroVar(fingerprints)
length(nzv_cols)
## [1] 719
fp_filtered <- fingerprints[, -nzv_cols]
dim(fp_filtered)
## [1] 165 388
ncol(fp_filtered)
## [1] 388

Because the fingerprint predictors are sparse, I used the nearZeroVar() function from the caret package to remove predictors with very low frequencies. A total of 719 predictors were removed, leaving 388 predictors for modeling. This directly addresses the assignment requirement to filter sparse predictors before modeling.

2c

set.seed(123)

train_index <- createDataPartition(permeability[,1], p = 0.8, list = FALSE)

x_train <- fp_filtered[train_index, ]
x_test  <- fp_filtered[-train_index, ]

y_train <- permeability[train_index, 1]
y_test  <- permeability[-train_index, 1]

ctrl <- trainControl(method = "cv", number = 10)

pls_grid <- expand.grid(ncomp = 1:20)

pls_model <- train(
  x = x_train,
  y = y_train,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneGrid = pls_grid,
  trControl = ctrl,
  metric = "Rsquared"
)

pls_model
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.31894  0.3442124  10.254018
##    2     11.78898  0.4830504   8.534741
##    3     11.98818  0.4792649   9.219285
##    4     12.04349  0.4923322   9.448926
##    5     11.79823  0.5193195   9.049121
##    6     11.53275  0.5335956   8.658301
##    7     11.64053  0.5229621   8.878265
##    8     11.86459  0.5144801   9.265252
##    9     11.98385  0.5188205   9.218594
##   10     12.55634  0.4808614   9.610747
##   11     12.69674  0.4758068   9.702325
##   12     13.01534  0.4538906   9.956623
##   13     13.12637  0.4367362   9.878017
##   14     13.44865  0.4140715  10.065088
##   15     13.60135  0.4034269  10.188150
##   16     13.79361  0.3943904  10.247160
##   17     14.00756  0.3845119  10.412776
##   18     14.18113  0.3711378  10.587027
##   19     14.25674  0.3703610  10.575726
##   20     14.33121  0.3723176  10.679764
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 6.
pls_model$bestTune
##   ncomp
## 6     6
max(pls_model$results$Rsquared)
## [1] 0.5335956

I split the filtered data into a training set and a test set, then pre-processed the predictors by centering and scaling them before fitting a partial least squares (PLS) model. I used 10-fold cross-validation to tune the number of latent variables. The optimal number of latent variables was 6, and the corresponding resampled estimate of R^2 was 0.5336. This means the tuned PLS model explained about 53.36% of the variation in permeability under cross-validation.

2d

pls_pred <- predict(pls_model, newdata = x_test)

test_r2 <- cor(pls_pred, y_test)^2
test_r2
## [1] 0.3244542

I used the tuned PLS model to predict permeability for the test set. The test set estimate of R^2 was 0.3245, which means the model explained about 32.45% of the variation in permeability for unseen test observations. This answers the test-set performance question separately from the resampled R^2 in part (c).

2e

set.seed(123)

ctrl <- trainControl(method = "cv", number = 10)

# PCR
pcr_grid <- expand.grid(ncomp = 1:20)

pcr_model <- train(
  x = x_train,
  y = y_train,
  method = "pcr",
  preProcess = c("center", "scale"),
  tuneGrid = pcr_grid,
  trControl = ctrl,
  metric = "Rsquared"
)

pcr_best <- pcr_model$bestTune
pcr_resample_r2 <- max(pcr_model$results$Rsquared)
pcr_pred <- predict(pcr_model, newdata = x_test)
pcr_test_r2 <- cor(pcr_pred, y_test)^2

# Ridge
ridge_grid <- expand.grid(
  alpha = 0,
  lambda = seq(0.0001, 1, length = 50)
)

ridge_model <- train(
  x = x_train,
  y = y_train,
  method = "glmnet",
  preProcess = c("center", "scale"),
  tuneGrid = ridge_grid,
  trControl = ctrl,
  metric = "Rsquared"
)

ridge_best <- ridge_model$bestTune
ridge_resample_r2 <- max(ridge_model$results$Rsquared)
ridge_pred <- predict(ridge_model, newdata = x_test)
ridge_test_r2 <- cor(ridge_pred, y_test)^2

# Elastic Net
enet_grid <- expand.grid(
  alpha = seq(0.1, 1, by = 0.1),
  lambda = seq(0.0001, 1, length = 50)
)

enet_model <- train(
  x = x_train,
  y = y_train,
  method = "glmnet",
  preProcess = c("center", "scale"),
  tuneGrid = enet_grid,
  trControl = ctrl,
  metric = "Rsquared"
)

enet_best <- enet_model$bestTune
enet_resample_r2 <- max(enet_model$results$Rsquared)
enet_pred <- predict(enet_model, newdata = x_test)
enet_test_r2 <- cor(enet_pred, y_test)^2

# Summary
comparison <- data.frame(
  Model = c("PLS", "PCR", "Ridge", "Elastic Net"),
  Resampled_R2 = c(
    max(pls_model$results$Rsquared),
    pcr_resample_r2,
    ridge_resample_r2,
    enet_resample_r2
  ),
  Test_R2 = c(
    test_r2,
    pcr_test_r2,
    ridge_test_r2,
    enet_test_r2
  )
)

comparison
##         Model Resampled_R2   Test_R2
## 1         PLS    0.5335956 0.3244542
## 2         PCR    0.5555671 0.2922108
## 3       Ridge    0.5600557 0.3266984
## 4 Elastic Net    0.5289587 0.4032235
pcr_best
##    ncomp
## 11    11
ridge_best
##    alpha lambda
## 50     0      1
enet_best
##     alpha lambda
## 150   0.3      1

I also fit PCR, Ridge, and Elastic Net models to compare their predictive performance with the PLS model, as required in this part. The optimal tuning values were ncomp = 11 for PCR, alpha = 0 and lambda = 1 for Ridge, and alpha = 0.3 and lambda = 1 for Elastic Net. Based on the resampled R^2, Ridge performed best (R^2 = 0.5601), slightly better than PCR and PLS. However, based on the test set R^2, Elastic Net performed best with a test R^2 of 0.4032, which was better than PLS (0.3245), PCR (0.2922), and Ridge (0.3267). Therefore, yes, some of the other models had better predictive performance, and Elastic Net gave the strongest test-set performance in this comparison.

2f

I would not recommend fully replacing the permeability laboratory experiment with these models. Although some models, especially Elastic Net, showed useful predictive ability, the test set R^2 values were still only moderate, with the best result being about 0.4032. This suggests that the models can explain part of the variation in permeability, but they are probably not reliable enough to completely replace laboratory measurement. A more reasonable recommendation is to use the model as a screening or prioritization tool to identify promising compounds before conducting the full laboratory experiment.

3a

set.seed(123)

ctrl <- trainControl(method = "cv", number = 10)

# SVM Radial
svm_grid <- expand.grid(
  sigma = c(0.001, 0.01, 0.1),
  C = c(0.25, 0.5, 1, 2)
)

svm_model <- train(
  x = x_train,
  y = y_train,
  method = "svmRadial",
  preProcess = c("center", "scale"),
  tuneGrid = svm_grid,
  trControl = ctrl,
  metric = "Rsquared"
)

svm_best <- svm_model$bestTune
svm_resample_r2 <- max(svm_model$results$Rsquared)
svm_pred <- predict(svm_model, newdata = x_test)
svm_test_r2 <- cor(svm_pred, y_test)^2

# Neural Network
nnet_grid <- expand.grid(
  size = c(1, 3, 5, 7),
  decay = c(0, 0.01, 0.1)
)

nnet_model <- train(
  x = x_train,
  y = y_train,
  method = "nnet",
  linout = TRUE,
  trace = FALSE,
  maxit = 500,
  preProcess = c("center", "scale"),
  tuneGrid = nnet_grid,
  trControl = ctrl,
  metric = "Rsquared"
)
## Warning: model fit failed for Fold01: size=3, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold01: size=5, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold01: size=7, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold01: size=3, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold01: size=5, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold01: size=7, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold01: size=3, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold01: size=5, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold01: size=7, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold02: size=3, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold02: size=5, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold02: size=7, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold02: size=3, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold02: size=5, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold02: size=7, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold02: size=3, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold02: size=5, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold02: size=7, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold03: size=3, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold03: size=5, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold03: size=7, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold03: size=3, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold03: size=5, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold03: size=7, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold03: size=3, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold03: size=5, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold03: size=7, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold04: size=3, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold04: size=5, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold04: size=7, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold04: size=3, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold04: size=5, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold04: size=7, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold04: size=3, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold04: size=5, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold04: size=7, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold05: size=3, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold05: size=5, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold05: size=7, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold05: size=3, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold05: size=5, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold05: size=7, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold05: size=3, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold05: size=5, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold05: size=7, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold06: size=3, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold06: size=5, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold06: size=7, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold06: size=3, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold06: size=5, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold06: size=7, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold06: size=3, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold06: size=5, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold06: size=7, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold07: size=3, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold07: size=5, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold07: size=7, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold07: size=3, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold07: size=5, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold07: size=7, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold07: size=3, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold07: size=5, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold07: size=7, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold08: size=3, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold08: size=5, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold08: size=7, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold08: size=3, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold08: size=5, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold08: size=7, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold08: size=3, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold08: size=5, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold08: size=7, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold09: size=3, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold09: size=5, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold09: size=7, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold09: size=3, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold09: size=5, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold09: size=7, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold09: size=3, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold09: size=5, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold09: size=7, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold10: size=3, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold10: size=5, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold10: size=7, decay=0.00 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold10: size=3, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold10: size=5, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold10: size=7, decay=0.01 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning: model fit failed for Fold10: size=3, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1171) weights
## Warning: model fit failed for Fold10: size=5, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (1951) weights
## Warning: model fit failed for Fold10: size=7, decay=0.10 Error in nnet.default(x, y, w, ...) : too many (2731) weights
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Warning in train.default(x = x_train, y = y_train, method = "nnet", linout =
## TRUE, : missing values found in aggregated results
nnet_best <- nnet_model$bestTune
nnet_resample_r2 <- max(nnet_model$results$Rsquared)
nnet_pred <- predict(nnet_model, newdata = x_test)
nnet_test_r2 <- cor(nnet_pred, y_test)^2

# MARS
mars_grid <- expand.grid(
  degree = 1:2,
  nprune = seq(2, 20, by = 2)
)

mars_model <- train(
  x = x_train,
  y = y_train,
  method = "earth",
  preProcess = c("center", "scale"),
  tuneGrid = mars_grid,
  trControl = ctrl,
  metric = "Rsquared"
)

mars_best <- mars_model$bestTune
mars_resample_r2 <- max(mars_model$results$Rsquared)
mars_pred <- predict(mars_model, newdata = x_test)
mars_test_r2 <- cor(mars_pred, y_test)^2

# KNN
knn_grid <- expand.grid(k = seq(3, 25, by = 2))

knn_model <- train(
  x = x_train,
  y = y_train,
  method = "knn",
  preProcess = c("center", "scale"),
  tuneGrid = knn_grid,
  trControl = ctrl,
  metric = "Rsquared"
)

knn_best <- knn_model$bestTune
knn_resample_r2 <- max(knn_model$results$Rsquared)
knn_pred <- predict(knn_model, newdata = x_test)
knn_test_r2 <- cor(knn_pred, y_test)^2

# Comparison table for nonlinear models
nonlinear_comparison <- data.frame(
  Model = c("SVM", "Neural Network", "MARS", "KNN"),
  Resampled_R2 = c(
    svm_resample_r2,
    nnet_resample_r2,
    mars_resample_r2,
    knn_resample_r2
  ),
  Test_R2 = c(
    svm_test_r2,
    nnet_test_r2,
    mars_test_r2,
    knn_test_r2
  )
)

nonlinear_comparison
##            Model Resampled_R2   Test_R2
## 1            SVM    0.5651874 0.3563262
## 2 Neural Network          NaN 0.1938560
## 3           MARS    0.4763233 0.4037800
## 4            KNN    0.5093900 0.2436855
svm_best
##   sigma C
## 4 0.001 2
nnet_best
##   size decay
## 2    1  0.01
mars_best
##    nprune degree
## 12      4      2
knn_best
##   k
## 4 9

I trained several nonlinear regression models for the permeability data, including SVM, neural network, MARS, and KNN, and compared both resampling and test set performance as required. Among these models, SVM gave the best resampled performance with a resampled R^2 of 0.5652, while MARS gave the best test set performance with a test R^2 of 0.4038. The neural network model was unstable during resampling, which resulted in NaN for the resampled R^2, so it was not competitive with the other nonlinear models. Therefore, SVM was best under resampling, but MARS was best on the test set.

3b Yes, one nonlinear model slightly outperformed the best linear model from Problem 2. In Problem 2, the best linear model on the test set was Elastic Net with a test R^2 of 0.4032. In Problem 3, the best nonlinear model on the test set was MARS with a test R^2 of 0.4038, which is only slightly higher. This suggests that there may be some nonlinear structure in the relationship between the molecular predictors and permeability, but the improvement is very small. Therefore, the underlying relationship may not be strongly nonlinear, or the nonlinear pattern may be limited relative to the amount of noise in the data.

3c Yes, one nonlinear model slightly outperformed the best linear model from Problem 2. In Problem 2, the best linear model on the test set was Elastic Net with a test R^2 of 0.4032. In Problem 3, the best nonlinear model on the test set was MARS with a test R^2 of 0.4038, which is only slightly higher. This suggests that there may be some nonlinear structure in the relationship between the molecular predictors and permeability, but the improvement is very small. Therefore, the underlying relationship may not be strongly nonlinear, or the nonlinear pattern may be limited relative to the amount of noise in the data.