Applied Predictive Modeling - Chapter 6 Exercises: 6.1, 6.2

6.1. Infrared (IR) spectroscopy technology is used to determine the chemical makeup of a substance. The theory of IR spectroscopy holds that unique molecular structures absorb IR frequencies differently. In practice a spectrometer fires a series of IR frequencies into a sample material, and the device measures the absorbance of the sample at each individual frequency. This series of measurements creates a spectrum profile which can then be used to determine the chemical makeup of the sample material. A Tecator Infratec Food and Feed Analyzer instrument was used to analyze 215 samples of meat across 100 frequencies. A sample of these frequency profiles is displayed in Fig. 6.20. In addition to an IR profile, analytical chemistry determined the percent content of water, fat, and protein for each sample. If we can establish a predictive relationship between IR spectrum and fat content, then food scientists could predict a sample’s fat content with IR instead of using analytical chemistry. This would provide costs savings, since analytical chemistry is a more expensive, time-consuming process:

  1. Start R and use these commands to load the data:

library(caret) data(tecator) # use ?tecator to see more details

The matrix absorp contains the 100 absorbance values for the 215 samples, while matrix endpoints contains the percent of moisture, fat, and protein in columns 1–3, respectively.

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
data(tecator)
?tecator
  1. In this example the predictors are the measurements at the individual frequencies. Because the frequencies lie in a systematic order (850–1,050 nm), the predictors have a high degree of correlation. Hence, the data lie in a smaller dimension than the total number of predictors (215). Use PCA to determine the effective dimension of these data. What is the effective dimension?
x <- absorp
pca <- prcomp(x, center = TRUE, scale. = TRUE)
var_explained <- pca$sdev^2 / sum(pca$sdev^2)
cumsum(var_explained)[1:10]
##  [1] 0.9862619 0.9959590 0.9987522 0.9998965 0.9999611 0.9999874 0.9999946
##  [8] 0.9999984 0.9999991 0.9999996

PCA was applied to the 100 absorbance predictors to assess the dimensionality of the Tecator data. Because the frequency measurements are highly correlated, most of the variation is captured by only a few principal components. The first principal component explains approximately 98.6% of the variance, and the first two components explain about 99.6%. Therefore, the effective dimension of the data is approximately 2.

  1. Split the data into a training and a test set, pre-process the data, and build each variety of models described in this chapter. For those models with tuning parameters, what are the optimal values of the tuning parameter(s)?
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
x <- as.data.frame(absorp)
y <- endpoints[,2]

set.seed(123)

trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)

x_train <- x[trainIndex, ]
x_test  <- x[-trainIndex, ]
y_train <- y[trainIndex]
y_test  <- y[-trainIndex]

ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

#Linear Regression Model
lm_fit <- train(
  x_train, y_train,
  method = "lm",
  preProcess = c("center","scale"),
  trControl = ctrl
)

#PCR
pcr_fit <- train(
  x_train, y_train,
  method = "pcr",
  preProcess = c("center","scale"),
  tuneLength = 20,
  trControl = ctrl
)

pcr_fit$bestTune
##    ncomp
## 18    18
#PLS
pls_fit <- train(
  x_train, y_train,
  method = "pls",
  preProcess = c("center","scale"),
  tuneLength = 20,
  trControl = ctrl
)

pls_fit$bestTune
##    ncomp
## 14    14
#Penalized Regression Model
grid <- expand.grid(
  alpha = c(0, 0.5, 1),
  lambda = 10^seq(-4, 1, length = 50)
)

glmnet_fit <- train(
  x_train, y_train,
  method = "glmnet",
  preProcess = c("center","scale"),
  tuneGrid = grid,
  trControl = ctrl
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
glmnet_fit$bestTune
##    alpha      lambda
## 63   0.5 0.001676833

The optimal tuning parameters for each model were: Linear Regression: no tuning parameters PCR: optimal number of components = 18 PLS: optimal number of components = 14 Penalized Regression (glmnet): alpha = 0.5 (Elastic Net) lambda = 0.001676833

  1. Which model has the best predictive ability? Is any model significantly better or worse than the others?
# Predictions
lm_pred     <- predict(lm_fit, x_test)
pcr_pred    <- predict(pcr_fit, x_test)
pls_pred    <- predict(pls_fit, x_test)
glmnet_pred <- predict(glmnet_fit, x_test)

# Performance metrics
lm_perf     <- postResample(lm_pred, y_test)
pcr_perf    <- postResample(pcr_pred, y_test)
pls_perf    <- postResample(pls_pred, y_test)
glmnet_perf <- postResample(glmnet_pred, y_test)

# Combine results 
results <- rbind(
  LM = lm_perf,
  PCR = pcr_perf,
  PLS = pls_perf,
  GLMNET = glmnet_perf
)
results
##            RMSE  Rsquared      MAE
## LM     1.747330 0.9779247 1.318518
## PCR    2.354502 0.9698380 2.034114
## PLS    2.079370 0.9760032 1.700468
## GLMNET 4.057059 0.9203825 3.656999

The linear regression model achieved the best performance, with the lowest RMSE (1.75) and the highest R2 (0.978), indicating superior predictive accuracy.The penalized regression model (elastic net) performed the worst, with a significantly higher RMSE (4.06) and lower R2 (0.92), indicating substantially poorer predictive ability.

  1. Explain which model you would use for predicting the fat content of a sample.

Based on the test set performance, I would choose the linear regression model to predict fat content. This model achieved the lowest RMSE (1.75) and the highest R2 (0.978), indicating the most accurate predictions among all models considered

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:

library(AppliedPredictiveModeling) data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

library(AppliedPredictiveModeling)
data(permeability)
  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
# predictors
x <- as.data.frame(fingerprints)

# identify near-zero variance predictors
nzv <- nearZeroVar(x)
x_filtered <- x[, -nzv]

# number of predictors left
ncol(x_filtered)
## [1] 388
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
# remove near-zero variance predictors
nzv <- nearZeroVar(fingerprints)
x <- fingerprints[, -nzv]
y <- permeability

# train/test split
set.seed(123)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)

x_train <- x[trainIndex, ]
x_test  <- x[-trainIndex, ]
y_train <- y[trainIndex]
y_test  <- y[-trainIndex]

# cross-validation
ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

# PLS model
set.seed(123)
pls_fit <- train(
  x = x_train,
  y = y_train,
  method = "pls",
  preProcess = c("center","scale"),
  tuneLength = 20,
  trControl = ctrl
)

pls_fit$bestTune
##   ncomp
## 7     7
# resampled R^2
max(pls_fit$results$Rsquared)
## [1] 0.5190079

7 latent values are optimal, corresponding resampled estimate of R2 = 0.5190079

  1. Predict the response for the test set. What is the test set estimate of R2?
# predictions on test set
pls_pred <- predict(pls_fit, x_test)
postResample(pls_pred, y_test)
##       RMSE   Rsquared        MAE 
## 11.9744490  0.3618213  8.3021291

The test set estimate of R2 is approximately 0.362.

  1. Try building other models discussed in this chapter. Do any have better predictive performance?
# Linear Regression
lm_fit <- train(
  x_train, y_train,
  method = "lm",
  preProcess = c("center","scale"),
  trControl = ctrl
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning: predictions failed for Fold10.Rep1: intercept=TRUE Error in qr.default(tR) : NA/NaN/Inf in foreign function call (arg 1)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning: predictions failed for Fold05.Rep4: intercept=TRUE Error in qr.default(tR) : NA/NaN/Inf in foreign function call (arg 1)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning: predictions failed for Fold03.Rep5: intercept=TRUE Error in qr.default(tR) : NA/NaN/Inf in foreign function call (arg 1)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
lm_pred <- predict(lm_fit, x_test)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
lm_perf <- postResample(lm_pred, y_test)

# PCR
pcr_fit <- train(
  x_train, y_train,
  method = "pcr",
  preProcess = c("center","scale"),
  tuneLength = 20,
  trControl = ctrl
)

pcr_pred <- predict(pcr_fit, x_test)
pcr_perf <- postResample(pcr_pred, y_test)

#Penalized Regression Model
glmnet_fit <- train(
  x_train, y_train,
  method = "glmnet",
  preProcess = c("center","scale"),
  trControl = ctrl,
  tuneLength = 10
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
glmnet_pred <- predict(glmnet_fit, x_test)
glmnet_perf <- postResample(glmnet_pred, y_test)

# Results
results <- rbind(
  LM = lm_perf,
  PCR = pcr_perf,
  PLS = postResample(pls_pred, y_test),
  GLMNET = glmnet_perf
)

results
##            RMSE   Rsquared       MAE
## LM     29.90647 0.07853504 19.332111
## PCR    12.20768 0.29221078  8.225112
## PLS    11.97445 0.36182132  8.302129
## GLMNET 10.98607 0.38732280  7.125565

The Penalized Regression Model achieved the best predictive performance, with the lowest RMSE (10.99) and the highest R2 (0.387)

  1. Would you recommend any of your models to replace the permeability laboratory experiment?

No, I would not recommend replacing the laboratory experiment. Although the glmnet model performed best, its test set R2 of about 0.387 indicates only modest predictive ability.