Exercise 3: Tecator IR Meat Sample Analysis

Part (a) – Load Data and Describe Structure

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
data(tecator)
?tecator
## starting httpd help server ...
##  done
str(absorp)
##  num [1:215, 1:100] 2.62 2.83 2.58 2.82 2.79 ...
str(endpoints)
##  num [1:215, 1:3] 60.5 46 71 72.8 58.3 44 44 69.3 61.4 61.4 ...
moisture <- endpoints[,1]
fat <- endpoints[,2]
protein <- endpoints[,3]

Analysis: The tecator dataset consists of 100 absorbance values from 250 observations, used to predict fat, protein, and moisture content.


Part (b) – PCA: Effective Dimensionality

pca_result <- prcomp(absorp, center = TRUE, scale. = TRUE)
explained <- cumsum(pca_result$sdev^2) / sum(pca_result$sdev^2)
which(explained >= 0.99)[1]
## [1] 2
which(explained >= 0.999)[1]
## [1] 4
plot(explained, type = "b", 
     xlab = "PC", 
     ylab = "Variance",
     main = "PCA on Absorbance Data")

Answer: Out of 100 original frequency measurements, PCA reveals that the first 2 PC’s capture 99% of the variance in the absorbance data, and 4 components explain 99.9%. The effective dimension of the data is approximately 2-4, reflecting the high degree of collinearity among adjacent IR frequencies.


Part (c) – Model Building

#Moisture
set.seed(123)
trainIndex <- createDataPartition(moisture, p = 0.75, list = FALSE)

x_train <- absorp[trainIndex, ]
x_test <- absorp[-trainIndex, ]
y_train <- moisture[trainIndex]
y_test <- moisture[-trainIndex]

colnames(x_train) <- paste0("V", 1:100)
colnames(x_test)  <- paste0("V", 1:100)

preProc <- preProcess(x_train, method = c("center", "scale", "pca"), thresh = 0.99)
x_train_pp <- predict(preProc, x_train)
x_test_pp <- predict(preProc, x_test)

train_df <- data.frame(moisture = y_train, x_train_pp)
test_df <- data.frame(moisture = y_test, x_test_pp)

ctrl <- trainControl(method = "cv", number = 10, seed = set.seed(123))

Model 1 - OLS on PCA Scores

set.seed(123)
ols_model <- train(moisture ~ ., data = train_df,
                   method = "lm", trControl = ctrl)

Answer: OLS has no tuning parameters to optimize. With an RMSE of 9.06, OLS assumes a perfect linear relationship and cannot handles any complexity or noise in the data.

Model 2 - PCR

set.seed(123)
pcr_model <- train(x = x_train, y = y_train,
                   method = "pcr",
                   tuneGrid = data.frame(ncomp = 1:20),
                   trControl = ctrl,
                   preProcess = c("center", "scale"))
pcr_model$bestTune
##    ncomp
## 20    20

Answer: PCR’s optimal value is 20. This is high considering 99% of variance was captured in 2 PC’s.

Model 3 - PLS

set.seed(123)
pls_model <- train(x = x_train, y = y_train,
                   method = "pls",
                   tuneGrid = data.frame(ncomp = 1:20),
                   trControl = ctrl,
                   preProcess = c("center","scale"))
pls_model$bestTune
##    ncomp
## 17    17

Answer: PLS’s optimal value is 17, extracting the moisture relevant signal in fewer steps than all other models.

Model 4 - Ridge Regression

set.seed(123)
ridge_model <- train(x = x_train, y = y_train,
                     method = "ridge",
                     tuneGrid = data.frame(lambda = 10^seq(-4, 2, length = 50)),
                     trControl = ctrl,
                     preProcess = c("center","scale"))
ridge_model$bestTune
##   lambda
## 1  1e-04

Answer: Optimal value is 0.0001. The model applies almost no shrinkage.


Part (d) – Comparing Predictve Perfrormance

models <- list(OLS = ols_model, PCR = pcr_model,
               PLS = pls_model, Ridge = ridge_model)

results <- sapply(models, function(m) {
  preds <- predict(m, newdata = if(m$method == "lm") test_df else x_test)
  RMSE(preds, y_test)
})
print(sort(results))
##      PLS    Ridge      PCR      OLS 
## 2.369104 2.586292 2.659928 9.056101
resamps <- resamples(models)
summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: OLS, PCR, PLS, Ridge 
## Number of resamples: 10 
## 
## MAE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## OLS   5.446855 5.658757 6.495583 6.512544 7.142210 7.886826    0
## PCR   1.066259 1.368132 1.553602 1.511833 1.661615 1.833062    0
## PLS   1.023076 1.228533 1.405257 1.360726 1.427443 1.663493    0
## Ridge 1.766060 1.963904 2.037103 2.100236 2.271645 2.561486    0
## 
## RMSE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## OLS   6.428505 6.954599 8.645614 8.247932 9.430920 9.798508    0
## PCR   1.308971 1.809206 2.037026 2.088590 2.322411 2.974439    0
## PLS   1.368628 1.612490 1.795787 1.795749 1.852478 2.508862    0
## Ridge 1.921949 2.335287 2.549805 2.577315 2.953836 3.052742    0
## 
## Rsquared 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## OLS   0.1078058 0.2265206 0.3901654 0.3549907 0.4409990 0.5540652    0
## PCR   0.9378243 0.9539646 0.9708617 0.9648029 0.9757082 0.9830365    0
## PLS   0.9469358 0.9715995 0.9745968 0.9724449 0.9783105 0.9868762    0
## Ridge 0.9227000 0.9507495 0.9543645 0.9525356 0.9598810 0.9656493    0
dotplot(resamps)

Answer: PLS achieves the best predictive performance with a test RMSE of 2.369 and cross-validated R^2 of 0.972. PCR is comparable and not significantly different than PLS. Ridge performs worse than PLS and PCR but far better than OLS, which has a test RMSE of 9.056 and a mean R^2 of only 0.355. OLS cannot handle severe mutlicollinearity in these data set.