Infrared (IR) spectroscopy technology is used to determine the chemical makeup of a substance. The theory of IR spectroscopy holds that unique molecular structures absorb IR frequencies differently. In practice a spectrometer fires a series of IR frequencies into a sample material, and the device measures the absorbance of the sample at each individual frequency. This series of measurements creates a spectrum profile which can then be used to determine the chemical makeup of the sample material. A Tecator Infratec Food and Feed Analyzer instrument was used to analyze 215 samples of meat across 100 frequencies. A sample of these frequency profiles is displayed in Fig. 6.20. In addition to an IR profile, analytical chemistry determined the percent content of water, fat, and protein for each sample. If we can establish a predictive relationship between IR spectrum and fat content, then food scientists could predict a sample’s fat content with IR instead of using analytical chemistry. This would provide costs savings, since analytical chemistry is a more expensive, time-consuming process
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Loading required package: lattice
data(tecator)
str(absorp)
## num [1:215, 1:100] 2.62 2.83 2.58 2.82 2.79 ...
str(endpoints)
## num [1:215, 1:3] 60.5 46 71 72.8 58.3 44 44 69.3 61.4 61.4 ...
The matrix absorp contains the 100 absorbance values for the 215 samples, while matrix endpoints contain the percent of moisture, fat, and protein in columns 1–3, respectively. To be more specific
moisture <- endpoints[, 1]
fat <- endpoints[, 2]
protein <- endpoints[, 3]
X <- as.data.frame(absorp)
colnames(X) <- paste0("Freq_", 1:ncol(X))
pca_mod <- prcomp(X, center = TRUE, scale. = TRUE)
# Variance explained
var_exp <- cumsum(pca_mod$sdev^2) / sum(pca_mod$sdev^2)
which(var_exp >= 0.95)[12]
## [1] 12
Because the 100 IR frequency measurements are taken at closely spaced wavelengths, the predictors are highly correlated. PCA was used to determine the effective dimension of the absorbance data. The results show that approximately 12 principal components are sufficient to explain 95% of the total variance, indicating that the data lie in a much lower-dimensional space than the original 100 predictors.
y <- moisture
set.seed(123)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[trainIndex, ]
X_test <- X[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
ctrl <- trainControl(method = "cv", number = 10)
pcr_mod <- train(
X_train, y_train,
method = "pcr",
tuneLength = 20,
trControl = ctrl,
preProcess = c("center", "scale")
)
pcr_mod$bestTune
## ncomp
## 17 17
pls_mod <- train(
X_train, y_train,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProcess = c("center", "scale")
)
pls_mod$bestTune
## ncomp
## 18 18
ridge_mod <- train(
X_train, y_train,
method = "ridge",
tuneLength = 20,
trControl = ctrl,
preProcess = c("center", "scale")
)
ridge_mod$bestTune
## lambda
## 2 1e-04
The data were split into an 80% training set and 20% test set, with moisture percentage as the response variable. All predictors were centered and scaled prior to modeling. three models were fit: principal component regression (PCR), partial least squares (PLS), and ridge regression. The optimal number of components was 17 for PCR and 18 for PLS, while ridge regression selected a small penalty value of λ = 0.0001, indicating mild regularization.
models <- list(PCR = pcr_mod, PLS = pls_mod,
Ridge = ridge_mod)
sapply(models, function(m) {
RMSE(predict(m, X_test), y_test)
})
## PCR PLS Ridge
## 1.968903 1.490969 2.450888
Predictive performance was evaluated using test-set RMSE. PLS performed the best, with an RMSE of approximately 1.90, followed closely by PCR (1.97). Ridge regression achieved an RMSE of about 2.45.These results show that models designed to handle multicollinearity outperform ridge regression in this highly correlated setting.
I would use Partial Least Squares (PLS) to predict the percentage of moisture in a meat sample. PLS produced the lowest test-set error and is well suited for highly correlated spectral data because it constructs components using both the predictors and the response. This leads to more accurate and stable predictions than PCR or ridge regression in this setting.