Exercise 3

Infrared (IR) spectroscopy technology is used to determine the chemical makeup of a substance. The theory of IR spectroscopy holds that unique molecular structures absorb IR frequencies differently. In practice a spectrometer fires a series of IR frequencies into a sample material, and the device measures the absorbance of the sample at each individual frequency. This series of measurements creates a spectrum profile which can then be used to determine the chemical makeup of the sample material. A Tecator Infratec Food and Feed Analyzer instrument was used to analyze 215 samples of meat across 100 frequencies. A sample of these frequency profiles is displayed in Fig. 6.20. In addition to an IR profile, analytical chemistry determined the percent content of water, fat, and protein for each sample. If we can establish a predictive relationship between IR spectrum and fat content, then food scientists could predict a sample’s fat content with IR instead of using analytical chemistry. This would provide costs savings, since analytical chemistry is a more expensive, time-consuming process

library(caret)

## Warning: package 'caret' was built under R version 4.5.2

## Loading required package: ggplot2

## Loading required package: lattice

data(tecator)

str(absorp)

##  num [1:215, 1:100] 2.62 2.83 2.58 2.82 2.79 ...

str(endpoints)

##  num [1:215, 1:3] 60.5 46 71 72.8 58.3 44 44 69.3 61.4 61.4 ...

The matrix absorp contains the 100 absorbance values for the 215 samples, while matrix endpoints contain the percent of moisture, fat, and protein in columns 1–3, respectively. To be more specific

moisture <- endpoints[, 1]
fat      <- endpoints[, 2]
protein <- endpoints[, 3]

In this example the predictors are the measurements at the individual frequencies. Because the frequencies lie in a systematic order (850–1,050nm), the predictors have a high degree of correlation. Hence, the data lie in a smaller dimension than the total number of predictors (215). Use PCA to determine the effective dimension of these data. What is the effective dimension?

X <- as.data.frame(absorp)
colnames(X) <- paste0("Freq_", 1:ncol(X))

pca_mod <- prcomp(X, center = TRUE, scale. = TRUE)

# Variance explained
var_exp <- cumsum(pca_mod$sdev^2) / sum(pca_mod$sdev^2)
which(var_exp >= 0.95)[12]

## [1] 12

Because the 100 IR frequency measurements are taken at closely spaced wavelengths, the predictors are highly correlated. PCA was used to determine the effective dimension of the absorbance data. The results show that approximately 12 principal components are sufficient to explain 95% of the total variance, indicating that the data lie in a much lower-dimensional space than the original 100 predictors.

Split the data into a training and a test set the response of the percentage of moisture, pre-process the data, and build at least three models described in this chapter (i.e., ordinary least squares, PCR, PLS, Ridge, and ENET). For those models with tuning parameters, what are the optimal values of the tuning parameter(s)?

y <- moisture

set.seed(123)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[trainIndex, ]
X_test  <- X[-trainIndex, ]
y_train <- y[trainIndex]
y_test  <- y[-trainIndex]

ctrl <- trainControl(method = "cv", number = 10)

pcr_mod <- train(
  X_train, y_train,
  method = "pcr",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = c("center", "scale")
)
pcr_mod$bestTune

##    ncomp
## 17    17

pls_mod <- train(
  X_train, y_train,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = c("center", "scale")
)
pls_mod$bestTune

##    ncomp
## 18    18

ridge_mod <- train(
  X_train, y_train,
  method = "ridge",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = c("center", "scale")
)
ridge_mod$bestTune

##   lambda
## 2  1e-04

The data were split into an 80% training set and 20% test set, with moisture percentage as the response variable. All predictors were centered and scaled prior to modeling. three models were fit: principal component regression (PCR), partial least squares (PLS), and ridge regression. The optimal number of components was 17 for PCR and 18 for PLS, while ridge regression selected a small penalty value of λ = 0.0001, indicating mild regularization.

Which model has the best predictive ability? Is any model significantly better or worse than the others?

models <- list(PCR = pcr_mod, PLS = pls_mod,
               Ridge = ridge_mod)

sapply(models, function(m) {
  RMSE(predict(m, X_test), y_test)
})

##      PCR      PLS    Ridge 
## 1.968903 1.490969 2.450888

Predictive performance was evaluated using test-set RMSE. PLS performed the best, with an RMSE of approximately 1.90, followed closely by PCR (1.97). Ridge regression achieved an RMSE of about 2.45.These results show that models designed to handle multicollinearity outperform ridge regression in this highly correlated setting.

Explain which model you would use for predicting the percentage of moisture of a sample.

I would use Partial Least Squares (PLS) to predict the percentage of moisture in a meat sample. PLS produced the lowest test-set error and is well suited for highly correlated spectral data because it constructs components using both the predictors and the response. This leads to more accurate and stable predictions than PCR or ridge regression in this setting.

Exercise 3

Jeneil Miller

2026-03-8