6.1. Infrared (IR) spectroscopy technology is used to determine the chemical makeup of a substance. The theory of IR spectroscopy holds that unique molecular structures absorb IR frequencies differently. In practice a spectrometer fires a series of IR frequencies into a sample material, and the device measures the absorbance of the sample at each individual frequency. This series of measurements creates a spectrum profile which can then be used to determine the chemical makeup of the sample material. A Tecator Infratec Food and Feed Analyzer instrument was used to analyze 215 samples of meat across 100 frequencies. A sample of these frequency profiles is displayed in Fig. 6.20. In addition to an IR profile, analytical chemistry determined the percent content of water, fat, and protein for each sample. If we can establish a predictive relationship between IR spectrum and fat content, then food scientists could predict a sample’s fat content with IR instead of using analytical chemistry. This would provide costs savings, since analytical chemistry is a more expensive, time-consuming process:
library(caret) data(tecator) # use ?tecator to see more details
The matrix absorp contains the 100 absorbance values for the 215 samples, while matrix endpoints contains the percent of moisture, fat, and protein in columns 1–3, respectively.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
data(tecator)
?tecator
x <- absorp
pca <- prcomp(x, center = TRUE, scale. = TRUE)
var_explained <- pca$sdev^2 / sum(pca$sdev^2)
cumsum(var_explained)[1:10]
## [1] 0.9862619 0.9959590 0.9987522 0.9998965 0.9999611 0.9999874 0.9999946
## [8] 0.9999984 0.9999991 0.9999996
PCA was applied to the 100 absorbance predictors to assess the dimensionality of the Tecator data. Because the frequency measurements are highly correlated, most of the variation is captured by only a few principal components. The first principal component explains approximately 98.6% of the variance, and the first two components explain about 99.6%. Therefore, the effective dimension of the data is approximately 2.
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
x <- as.data.frame(absorp)
y <- endpoints[,2]
set.seed(123)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
x_train <- x[trainIndex, ]
x_test <- x[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5
)
#Linear Regression Model
lm_fit <- train(
x_train, y_train,
method = "lm",
preProcess = c("center","scale"),
trControl = ctrl
)
#PCR
pcr_fit <- train(
x_train, y_train,
method = "pcr",
preProcess = c("center","scale"),
tuneLength = 20,
trControl = ctrl
)
pcr_fit$bestTune
## ncomp
## 18 18
#PLS
pls_fit <- train(
x_train, y_train,
method = "pls",
preProcess = c("center","scale"),
tuneLength = 20,
trControl = ctrl
)
pls_fit$bestTune
## ncomp
## 14 14
#Penalized Regression Model
grid <- expand.grid(
alpha = c(0, 0.5, 1),
lambda = 10^seq(-4, 1, length = 50)
)
glmnet_fit <- train(
x_train, y_train,
method = "glmnet",
preProcess = c("center","scale"),
tuneGrid = grid,
trControl = ctrl
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
glmnet_fit$bestTune
## alpha lambda
## 63 0.5 0.001676833
The optimal tuning parameters for each model were: Linear Regression: no tuning parameters PCR: optimal number of components = 18 PLS: optimal number of components = 14 Penalized Regression (glmnet): alpha = 0.5 (Elastic Net) lambda = 0.001676833
# Predictions
lm_pred <- predict(lm_fit, x_test)
pcr_pred <- predict(pcr_fit, x_test)
pls_pred <- predict(pls_fit, x_test)
glmnet_pred <- predict(glmnet_fit, x_test)
# Performance metrics
lm_perf <- postResample(lm_pred, y_test)
pcr_perf <- postResample(pcr_pred, y_test)
pls_perf <- postResample(pls_pred, y_test)
glmnet_perf <- postResample(glmnet_pred, y_test)
# Combine results
results <- rbind(
LM = lm_perf,
PCR = pcr_perf,
PLS = pls_perf,
GLMNET = glmnet_perf
)
results
## RMSE Rsquared MAE
## LM 1.747330 0.9779247 1.318518
## PCR 2.354502 0.9698380 2.034114
## PLS 2.079370 0.9760032 1.700468
## GLMNET 4.057059 0.9203825 3.656999
The linear regression model achieved the best performance, with the lowest RMSE (1.75) and the highest R2 (0.978), indicating superior predictive accuracy.The penalized regression model (elastic net) performed the worst, with a significantly higher RMSE (4.06) and lower R2 (0.92), indicating substantially poorer predictive ability.
Based on the test set performance, I would choose the linear regression model to predict fat content. This model achieved the lowest RMSE (1.75) and the highest R2 (0.978), indicating the most accurate predictions among all models considered
6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling) data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
library(AppliedPredictiveModeling)
data(permeability)
# predictors
x <- as.data.frame(fingerprints)
# identify near-zero variance predictors
nzv <- nearZeroVar(x)
x_filtered <- x[, -nzv]
# number of predictors left
ncol(x_filtered)
## [1] 388
# remove near-zero variance predictors
nzv <- nearZeroVar(fingerprints)
x <- fingerprints[, -nzv]
y <- permeability
# train/test split
set.seed(123)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
x_train <- x[trainIndex, ]
x_test <- x[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
# cross-validation
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5
)
# PLS model
set.seed(123)
pls_fit <- train(
x = x_train,
y = y_train,
method = "pls",
preProcess = c("center","scale"),
tuneLength = 20,
trControl = ctrl
)
pls_fit$bestTune
## ncomp
## 7 7
# resampled R^2
max(pls_fit$results$Rsquared)
## [1] 0.5190079
7 latent values are optimal, corresponding resampled estimate of R2 = 0.5190079
# predictions on test set
pls_pred <- predict(pls_fit, x_test)
postResample(pls_pred, y_test)
## RMSE Rsquared MAE
## 11.9744490 0.3618213 8.3021291
The test set estimate of R2 is approximately 0.362.
# Linear Regression
lm_fit <- train(
x_train, y_train,
method = "lm",
preProcess = c("center","scale"),
trControl = ctrl
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning: predictions failed for Fold10.Rep1: intercept=TRUE Error in qr.default(tR) : NA/NaN/Inf in foreign function call (arg 1)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning: predictions failed for Fold05.Rep4: intercept=TRUE Error in qr.default(tR) : NA/NaN/Inf in foreign function call (arg 1)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning: predictions failed for Fold03.Rep5: intercept=TRUE Error in qr.default(tR) : NA/NaN/Inf in foreign function call (arg 1)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
lm_pred <- predict(lm_fit, x_test)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
lm_perf <- postResample(lm_pred, y_test)
# PCR
pcr_fit <- train(
x_train, y_train,
method = "pcr",
preProcess = c("center","scale"),
tuneLength = 20,
trControl = ctrl
)
pcr_pred <- predict(pcr_fit, x_test)
pcr_perf <- postResample(pcr_pred, y_test)
#Penalized Regression Model
glmnet_fit <- train(
x_train, y_train,
method = "glmnet",
preProcess = c("center","scale"),
trControl = ctrl,
tuneLength = 10
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
glmnet_pred <- predict(glmnet_fit, x_test)
glmnet_perf <- postResample(glmnet_pred, y_test)
# Results
results <- rbind(
LM = lm_perf,
PCR = pcr_perf,
PLS = postResample(pls_pred, y_test),
GLMNET = glmnet_perf
)
results
## RMSE Rsquared MAE
## LM 29.90647 0.07853504 19.332111
## PCR 12.20768 0.29221078 8.225112
## PLS 11.97445 0.36182132 8.302129
## GLMNET 10.98607 0.38732280 7.125565
The Penalized Regression Model achieved the best predictive performance, with the lowest RMSE (10.99) and the highest R2 (0.387)
No, I would not recommend replacing the laboratory experiment. Although the glmnet model performed best, its test set R2 of about 0.387 indicates only modest predictive ability.