Predictive Modeling for High-Dimensional Data: Comparing Regularization Techniques

1 Introduction

This analysis explores regression techniques for high-dimensional data across three case studies: near-infrared spectroscopy analysis, pharmaceutical compound permeability prediction, and chemical manufacturing process optimization.

The assignment compares linear methods (Principal Component Regression, Partial Least Squares, Ridge Regression, Lasso, and Elastic Net) with nonlinear approaches (K-Nearest Neighbors and Support Vector Machines) to identify optimal modeling strategies for different data structures characterized by multicollinearity and high predictor-to-sample ratios.

Key Questions Addressed:

How do dimension reduction methods (PCR, PLS) compare to regularization (Ridge, Lasso)?
Can predictive models reduce expensive laboratory experimentation?
Which process variables drive manufacturing yield?

2 Question 6.1a: Load the data

library(caret)
data(tecator)
library(pls)
library(elasticnet)

3 Question 6.1b PCA Analysis

# The absorb matrix contains the spectroscopy data
# Perform PCA
pca_result <- prcomp(absorp, center = TRUE, scale. = TRUE)

# Calculate the proportion of variance explained by each PC
variance_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)

# Calculate cumulative variance explained
cumulative_variance <- cumsum(variance_explained)

# Find effective dimension at different thresholds
dim_95 <- which(cumulative_variance >= 0.95)[1]
dim_99 <- which(cumulative_variance >= 0.99)[1]

cat("Effective dimension (95% variance):", dim_95, "\n")

## Effective dimension (95% variance): 1

cat("Effective dimension (99% variance):", dim_99, "\n")

## Effective dimension (99% variance): 2

# Visualize with a scree plot
par(mfrow = c(1, 2))

# Scree plot
plot(1:20, variance_explained[1:20], type = "b", 
     xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained",
     main = "Scree Plot")

# Cumulative variance plot
plot(1:20, cumulative_variance[1:20], type = "b",
     xlab = "Number of Components",
     ylab = "Cumulative Variance Explained",
     main = "Cumulative Variance")
abline(h = 0.95, col = "red", lty = 2)
abline(h = 0.99, col = "blue", lty = 2)
legend("bottomright", legend = c("95%", "99%"), 
       col = c("red", "blue"), lty = 2)

3.1 Interpretation

The PCA analysis reveals significant dimensionality reduction potential in the spectroscopy data. The original dataset contains 100 wavelength measurements, but 95% of the variance can be captured using only 1 principal components (a 1% reduction). To capture 99% of variance requires 2 components (a 2% reduction).

This substantial redundancy in the spectroscopy measurements makes dimension reduction techniques like PCR and PLS highly effective for this dataset. The scree plot shows a rapid decline in variance explained after the first few components, indicating that most information is concentrated in a small number of dimensions.

4 Question 6.1c Explore the Dataset

# Check dimensions
cat("Dimensions of absorp matrix:\n")

## Dimensions of absorp matrix:

print(dim(absorp))

## [1] 215 100

cat("This means", nrow(absorp), "samples with", ncol(absorp), "wavelength measurements\n\n")

## This means 215 samples with 100 wavelength measurements

cat("Dimensions of endpoints matrix:\n")

## Dimensions of endpoints matrix:

print(dim(endpoints))

## [1] 215   3

cat("This means", nrow(endpoints), "samples with", ncol(endpoints), "chemical composition measurements\n\n")

## This means 215 samples with 3 chemical composition measurements

# View first few rows
cat("First few rows of absorp (spectroscopy data):\n")

## First few rows of absorp (spectroscopy data):

print(head(absorp, 5))

##         [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]
## [1,] 2.61776 2.61814 2.61859 2.61912 2.61981 2.62071 2.62186 2.62334 2.62511
## [2,] 2.83454 2.83871 2.84283 2.84705 2.85138 2.85587 2.86060 2.86566 2.87093
## [3,] 2.58284 2.58458 2.58629 2.58808 2.58996 2.59192 2.59401 2.59627 2.59873
## [4,] 2.82286 2.82460 2.82630 2.82814 2.83001 2.83192 2.83392 2.83606 2.83842
## [5,] 2.78813 2.78989 2.79167 2.79350 2.79538 2.79746 2.79984 2.80254 2.80553
##        [,10]   [,11]   [,12]   [,13]   [,14]   [,15]   [,16]   [,17]   [,18]
## [1,] 2.62722 2.62964 2.63245 2.63565 2.63933 2.64353 2.64825 2.65350 2.65937
## [2,] 2.87661 2.88264 2.88898 2.89577 2.90308 2.91097 2.91953 2.92873 2.93863
## [3,] 2.60131 2.60414 2.60714 2.61029 2.61361 2.61714 2.62089 2.62486 2.62909
## [4,] 2.84097 2.84374 2.84664 2.84975 2.85307 2.85661 2.86038 2.86437 2.86860
## [5,] 2.80890 2.81272 2.81704 2.82184 2.82710 2.83294 2.83945 2.84664 2.85458
##        [,19]   [,20]   [,21]   [,22]   [,23]   [,24]   [,25]   [,26]   [,27]
## [1,] 2.66585 2.67281 2.68008 2.68733 2.69427 2.70073 2.70684 2.71281 2.71914
## [2,] 2.94929 2.96072 2.97272 2.98493 2.99690 3.00833 3.01920 3.02990 3.04101
## [3,] 2.63361 2.63835 2.64330 2.64838 2.65354 2.65870 2.66375 2.66880 2.67383
## [4,] 2.87308 2.87789 2.88301 2.88832 2.89374 2.89917 2.90457 2.90991 2.91521
## [5,] 2.86331 2.87280 2.88291 2.89335 2.90374 2.91371 2.92305 2.93187 2.94060
##        [,28]   [,29]   [,30]   [,31]   [,32]   [,33]   [,34]   [,35]   [,36]
## [1,] 2.72628 2.73462 2.74416 2.75466 2.76568 2.77679 2.78790 2.79949 2.81225
## [2,] 3.05345 3.06777 3.08416 3.10221 3.12106 3.13983 3.15810 3.17623 3.19519
## [3,] 2.67892 2.68411 2.68937 2.69470 2.70012 2.70563 2.71141 2.71775 2.72490
## [4,] 2.92043 2.92565 2.93082 2.93604 2.94128 2.94658 2.95202 2.95777 2.96419
## [5,] 2.94986 2.96035 2.97241 2.98606 3.00097 3.01652 3.03220 3.04793 3.06413
##        [,37]   [,38]   [,39]   [,40]   [,41]   [,42]   [,43]   [,44]   [,45]
## [1,] 2.82706 2.84356 2.86106 2.87857 2.89497 2.90924 2.92085 2.93015 2.93846
## [2,] 3.21584 3.23747 3.25889 3.27835 3.29384 3.30362 3.30681 3.30393 3.29700
## [3,] 2.73344 2.74327 2.75433 2.76642 2.77931 2.79272 2.80649 2.82064 2.83541
## [4,] 2.97159 2.98045 2.99090 3.00284 3.01611 3.03048 3.04579 3.06194 3.07889
## [5,] 3.08153 3.10078 3.12185 3.14371 3.16510 3.18470 3.20140 3.21477 3.22544
##        [,46]   [,47]   [,48]   [,49]   [,50]   [,51]   [,52]   [,53]   [,54]
## [1,] 2.94771 2.96019 2.97831 3.00306 3.03506 3.07428 3.11963 3.16868 3.21771
## [2,] 3.28925 3.28409 3.28505 3.29326 3.30923 3.33267 3.36251 3.39661 3.43188
## [3,] 2.85121 2.86872 2.88905 2.91289 2.94088 2.97325 3.00946 3.04780 3.08554
## [4,] 3.09686 3.11629 3.13775 3.16217 3.19068 3.22376 3.26172 3.30379 3.34793
## [5,] 3.23505 3.24586 3.26027 3.28063 3.30889 3.34543 3.39019 3.44198 3.49800
##        [,55]   [,56]   [,57]   [,58]   [,59]   [,60]   [,61]   [,62]   [,63]
## [1,] 3.26254 3.29988 3.32847 3.34899 3.36342 3.37379 3.38152 3.38741 3.39164
## [2,] 3.46492 3.49295 3.51458 3.53004 3.54067 3.54797 3.55306 3.55675 3.55921
## [3,] 3.11947 3.14696 3.16677 3.17938 3.18631 3.18924 3.18950 3.18801 3.18498
## [4,] 3.39093 3.42920 3.45998 3.48227 3.49687 3.50558 3.51026 3.51221 3.51215
## [5,] 3.55407 3.60534 3.64789 3.68011 3.70272 3.71815 3.72863 3.73574 3.74059
##        [,64]   [,65]   [,66]   [,67]   [,68]   [,69]   [,70]   [,71]   [,72]
## [1,] 3.39418 3.39490 3.39366 3.39045 3.38541 3.37869 3.37041 3.36073 3.34979
## [2,] 3.56045 3.56034 3.55876 3.55571 3.55132 3.54585 3.53950 3.53235 3.52442
## [3,] 3.18039 3.17411 3.16611 3.15641 3.14512 3.13241 3.11843 3.10329 3.08714
## [4,] 3.51036 3.50682 3.50140 3.49398 3.48457 3.47333 3.46041 3.44595 3.43005
## [5,] 3.74357 3.74453 3.74336 3.73991 3.73418 3.72638 3.71676 3.70553 3.69289
##        [,73]   [,74]   [,75]   [,76]   [,77]   [,78]   [,79]   [,80]   [,81]
## [1,] 3.33769 3.32443 3.31013 3.29487 3.27891 3.26232 3.24542 3.22828 3.21080
## [2,] 3.51583 3.50668 3.49700 3.48683 3.47626 3.46552 3.45501 3.44481 3.43477
## [3,] 3.07014 3.05237 3.03393 3.01504 2.99569 2.97612 2.95642 2.93660 2.91667
## [4,] 3.41285 3.39450 3.37511 3.35482 3.33376 3.31204 3.28986 3.26730 3.24442
## [5,] 3.67900 3.66396 3.64785 3.63085 3.61305 3.59463 3.57582 3.55695 3.53796
##        [,82]   [,83]   [,84]   [,85]   [,86]   [,87]   [,88]   [,89]   [,90]
## [1,] 3.19287 3.17433 3.15503 3.13475 3.11339 3.09116 3.06850 3.04596 3.02393
## [2,] 3.42465 3.41419 3.40303 3.39082 3.37731 3.36265 3.34745 3.33245 3.31818
## [3,] 2.89655 2.87622 2.85563 2.83474 2.81361 2.79235 2.77113 2.75015 2.72956
## [4,] 3.22117 3.19757 3.17357 3.14915 3.12429 3.09908 3.07366 3.04825 3.02308
## [5,] 3.51880 3.49936 3.47938 3.45869 3.43711 3.41458 3.39129 3.36772 3.34450
##        [,91]   [,92]   [,93]   [,94]   [,95]   [,96]   [,97]   [,98]   [,99]
## [1,] 3.00247 2.98145 2.96072 2.94013 2.91978 2.89966 2.87964 2.85960 2.83940
## [2,] 3.30473 3.29186 3.27921 3.26655 3.25369 3.24045 3.22659 3.21181 3.19600
## [3,] 2.70934 2.68951 2.67009 2.65112 2.63262 2.61461 2.59718 2.58034 2.56404
## [4,] 2.99820 2.97367 2.94951 2.92576 2.90251 2.87988 2.85794 2.83672 2.81617
## [5,] 3.32201 3.30025 3.27907 3.25831 3.23784 3.21765 3.19766 3.17770 3.15770
##       [,100]
## [1,] 2.81920
## [2,] 3.17942
## [3,] 2.54816
## [4,] 2.79622
## [5,] 3.13753

cat("\n")

cat("First few rows of endpoints (moisture, fat, protein):\n")

## First few rows of endpoints (moisture, fat, protein):

print(head(endpoints, 5))

##      [,1] [,2] [,3]
## [1,] 60.5 22.5 16.7
## [2,] 46.0 40.1 13.5
## [3,] 71.0  8.4 20.5
## [4,] 72.8  5.9 20.7
## [5,] 58.3 25.5 15.5

cat("\n")

# Look at the endpoints more closely
cat("Summary statistics for endpoints:\n")

## Summary statistics for endpoints:

cat("(Columns are: moisture, fat, protein)\n")

## (Columns are: moisture, fat, protein)

print(summary(endpoints))

##        V1              V2              V3       
##  Min.   :39.30   Min.   : 0.90   Min.   :11.00  
##  1st Qu.:55.55   1st Qu.: 7.30   1st Qu.:15.35  
##  Median :65.70   Median :14.00   Median :18.70  
##  Mean   :63.20   Mean   :18.14   Mean   :17.68  
##  3rd Qu.:71.80   3rd Qu.:28.00   3rd Qu.:20.10  
##  Max.   :76.60   Max.   :49.10   Max.   :21.80

cat("\n")

# Additional useful information
cat("Range of values for each endpoint:\n")

## Range of values for each endpoint:

cat("Moisture: ", range(endpoints[,1]), "\n")

## Moisture:  39.3 76.6

cat("Fat: ", range(endpoints[,2]), "\n")

## Fat:  0.9 49.1

cat("Protein: ", range(endpoints[,3]), "\n")

## Protein:  11 21.8

5 Question 6.1c(i) Develop Models

# Prepare the data
# Use moisture (column 1) as the response for this example
X <- absorp
y <- endpoints[, 1]  # moisture

# Prepare the data with column names from the start
X <- as.data.frame(absorp)
colnames(X) <- paste0("V", 1:ncol(X))
y <- endpoints[, 1]  # moisture

# Create training and test sets (80/20 split)
set.seed(103025)
training_indices <- createDataPartition(y, p = 0.8, list = FALSE)

X_train <- X[training_indices, ]
y_train <- y[training_indices]
X_test <- X[-training_indices, ]
y_test <- y[-training_indices]

# Set up cross-validation
ctrl <- trainControl(method = "cv", number = 10)

# 2. PRINCIPAL COMPONENT REGRESSION (PCR)
pcr_model <- train(X_train, y_train,
                   method = "pcr",
                   trControl = ctrl,
                   preProcess = c("center", "scale"),
                   tuneLength = 20)

cat("Optimal number of components for PCR:", pcr_model$bestTune$ncomp, "\n")

## Optimal number of components for PCR: 17

plot(pcr_model, main = "PCR Model")

# 3. PARTIAL LEAST SQUARES (PLS)
pls_model <- train(X_train, y_train,
                   method = "pls",
                   trControl = ctrl,
                   preProcess = c("center", "scale"),
                   tuneLength = 20)

cat("Optimal number of components for PLS:", pls_model$bestTune$ncomp, "\n")

## Optimal number of components for PLS: 18

plot(pls_model, main = "PLS Model")

# 4. RIDGE REGRESSION
ridge_model <- train(X_train, y_train,
                     method = "ridge",
                     trControl = ctrl,
                     preProcess = c("center", "scale"),
                     tuneGrid = expand.grid(lambda = seq(0, 0.1, length = 20)))

cat("Optimal lambda for Ridge:", ridge_model$bestTune$lambda, "\n")

## Optimal lambda for Ridge: 0.005263158

plot(ridge_model, main = "Ridge Regression")

# 5. LASSO
lasso_model <- train(X_train, y_train,
                     method = "lasso",
                     trControl = ctrl,
                     preProcess = c("center", "scale"),
                     tuneGrid = expand.grid(fraction = seq(0.05, 1, length = 20)))

cat("Optimal fraction for Lasso:", lasso_model$bestTune$fraction, "\n")

## Optimal fraction for Lasso: 0.05

plot(lasso_model, main = "Lasso")

# 6. ELASTIC NET
enet_model <- train(X_train, y_train,
                    method = "enet",
                    trControl = ctrl,
                    preProcess = c("center", "scale"),
                    tuneGrid = expand.grid(
                      lambda = seq(0, 0.01, length = 10),
                      fraction = seq(0.05, 1, length = 10)))

cat("Optimal parameters for Elastic Net:\n")

## Optimal parameters for Elastic Net:

print(enet_model$bestTune)

##   fraction lambda
## 1     0.05      0

plot(enet_model, main = "Elastic Net")

# Compare all models
results <- resamples(list(
  PCR = pcr_model,
  PLS = pls_model,
  Ridge = ridge_model,
  Lasso = lasso_model,
  ElasticNet = enet_model
))

summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: PCR, PLS, Ridge, Lasso, ElasticNet 
## Number of resamples: 10 
## 
## MAE 
##                 Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PCR        0.9907898 1.655433 1.750641 1.706735 1.894032 1.978650    0
## PLS        0.9950547 1.237408 1.525913 1.479060 1.694800 1.924520    0
## Ridge      1.8510309 2.082544 2.527221 2.426472 2.711722 3.042205    0
## Lasso      1.0783123 1.359775 1.513775 1.577272 1.750235 2.293659    0
## ElasticNet 0.9704679 1.239712 1.417740 1.530313 1.823455 2.231269    0
## 
## RMSE 
##                Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PCR        1.131044 2.077892 2.260061 2.261940 2.400253 3.121051    0
## PLS        1.170039 1.801339 2.067001 2.158819 2.704045 3.102498    0
## Ridge      2.138349 2.405387 3.102519 2.943903 3.323940 3.869651    0
## Lasso      1.542057 1.729812 2.335204 2.388968 2.690489 4.055498    0
## ElasticNet 1.279048 1.591857 2.146012 2.201836 2.835241 3.309797    0
## 
## Rsquared 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PCR        0.9108570 0.9472697 0.9562315 0.9558933 0.9647052 0.9871226    0
## PLS        0.8513291 0.9500963 0.9638373 0.9537026 0.9791718 0.9838614    0
## Ridge      0.8470151 0.9192795 0.9250380 0.9284689 0.9432627 0.9839045    0
## Lasso      0.8525545 0.9417869 0.9528760 0.9481196 0.9711296 0.9789370    0
## ElasticNet 0.8992603 0.9180518 0.9615258 0.9514398 0.9818839 0.9855794    0

dotplot(results)

# Test set performance
models <- list(
  PCR = pcr_model,
  PLS = pls_model,
  Ridge = ridge_model,
  Lasso = lasso_model,
  ElasticNet = enet_model
)

test_results <- data.frame(
  Model = names(models),
  RMSE = numeric(length(models)),
  Rsquared = numeric(length(models))
)

for (i in seq_along(models)) {
  pred <- predict(models[[i]], X_test)
  test_results$RMSE[i] <- sqrt(mean((pred - y_test)^2))
  test_results$Rsquared[i] <- cor(pred, y_test)^2
}

print("Test Set Performance:")

## [1] "Test Set Performance:"

print(test_results)

##        Model     RMSE  Rsquared
## 1        PCR 1.959721 0.9590600
## 2        PLS 1.514225 0.9752885
## 3      Ridge 2.689994 0.9215908
## 4      Lasso 1.456886 0.9772254
## 5 ElasticNet 1.456886 0.9772254

6 Question 6.1c(ii) Optimal Values of the Tuning Parameter

# After training all models, print optimal tuning parameters

cat("PCR (Principal Component Regression):\n")

## PCR (Principal Component Regression):

cat("  Optimal number of components:", pcr_model$bestTune$ncomp, "\n")

##   Optimal number of components: 17

cat("  Cross-validated RMSE:", min(pcr_model$results$RMSE), "\n\n")

##   Cross-validated RMSE: 2.26194

cat("PLS (Partial Least Squares):\n")

## PLS (Partial Least Squares):

cat("  Optimal number of components:", pls_model$bestTune$ncomp, "\n")

##   Optimal number of components: 18

cat("  Cross-validated RMSE:", min(pls_model$results$RMSE), "\n\n")

##   Cross-validated RMSE: 2.158819

cat("Ridge Regression:\n")

## Ridge Regression:

cat("  Optimal lambda:", ridge_model$bestTune$lambda, "\n")

##   Optimal lambda: 0.005263158

cat("  Cross-validated RMSE:", min(ridge_model$results$RMSE), "\n\n")

##   Cross-validated RMSE: 2.943903

cat("Lasso:\n")

## Lasso:

cat("  Optimal fraction:", lasso_model$bestTune$fraction, "\n")

##   Optimal fraction: 0.05

cat("  Cross-validated RMSE:", min(lasso_model$results$RMSE), "\n\n")

##   Cross-validated RMSE: 2.388968

cat("Elastic Net:\n")

## Elastic Net:

cat("  Optimal lambda:", enet_model$bestTune$lambda, "\n")

##   Optimal lambda: 0

cat("  Optimal fraction:", enet_model$bestTune$fraction, "\n")

##   Optimal fraction: 0.05

cat("  Cross-validated RMSE:", min(enet_model$results$RMSE), "\n\n")

##   Cross-validated RMSE: 2.201836

cat("----------------------------------------\n")

## ----------------------------------------

cat("BEST MODEL (by CV RMSE):\n")

## BEST MODEL (by CV RMSE):

cat("----------------------------------------\n")

## ----------------------------------------

# Find the best model
cv_rmse <- c(
  PCR = min(pcr_model$results$RMSE),
  PLS = min(pls_model$results$RMSE),
  Ridge = min(ridge_model$results$RMSE),
  Lasso = min(lasso_model$results$RMSE),
  ElasticNet = min(enet_model$results$RMSE)
)

best_model_name <- names(which.min(cv_rmse))
cat("The model considered the best model is the one with the lowest RMSE score:", best_model_name, "with a cross-validated RMSE of", min(cv_rmse), "\n")

## The model considered the best model is the one with the lowest RMSE score: PLS with a cross-validated RMSE of 2.158819

7 Question 6.1d Model Performance Analysis

cat("All five regression models were evaluated using 10-fold cross-validation\n")

## All five regression models were evaluated using 10-fold cross-validation

cat("to predict moisture content from spectroscopy data. The results show\n")

## to predict moisture content from spectroscopy data. The results show

cat("clear performance differences:\n\n")

## clear performance differences:

cat("Cross-Validated RMSE:\n")

## Cross-Validated RMSE:

cat("  1. PLS:         2.159  (BEST)\n")

##   1. PLS:         2.159  (BEST)

cat("  2. Elastic Net: 2.202  (+2.0%)\n")

##   2. Elastic Net: 2.202  (+2.0%)

cat("  3. PCR:         2.262  (+4.8%)\n")

##   3. PCR:         2.262  (+4.8%)

cat("  4. Lasso:       2.389  (+10.7%)\n")

##   4. Lasso:       2.389  (+10.7%)

cat("  5. Ridge:       2.944  (+36.4%)\n\n")

##   5. Ridge:       2.944  (+36.4%)

cat("Key Findings:\n\n")

## Key Findings:

cat("PLS emerged as the best model with the lowest prediction error (RMSE = 2.159).\n")

## PLS emerged as the best model with the lowest prediction error (RMSE = 2.159).

cat("It uses 18 components to capture the underlying patterns in the data.\n\n")

## It uses 18 components to capture the underlying patterns in the data.

cat("Elastic Net performed nearly as well (only 2% worse), making it a viable\n")

## Elastic Net performed nearly as well (only 2% worse), making it a viable

cat("alternative to PLS. The small difference suggests both methods effectively\n")

## alternative to PLS. The small difference suggests both methods effectively

cat("handle the high correlation in spectroscopy data.\n\n")

## handle the high correlation in spectroscopy data.

cat("PCR was competitive but slightly worse than PLS (4.8% higher RMSE). Both\n")

## PCR was competitive but slightly worse than PLS (4.8% higher RMSE). Both

cat("methods use similar numbers of components (17-18), but PLS has an advantage\n")

## methods use similar numbers of components (17-18), but PLS has an advantage

cat("because it considers the response variable when creating components, while\n")

## because it considers the response variable when creating components, while

cat("PCR does not.\n\n")

## PCR does not.

cat("Lasso performed moderately well but was 10.7% worse than PLS. Its aggressive\n")

## Lasso performed moderately well but was 10.7% worse than PLS. Its aggressive

cat("variable selection approach may have removed useful predictors.\n\n")

## variable selection approach may have removed useful predictors.

cat("Ridge Regression performed worst, with RMSE 36% higher than PLS. Ridge keeps\n")

## Ridge Regression performed worst, with RMSE 36% higher than PLS. Ridge keeps

cat("all 100 highly correlated predictors, which hurts performance compared to\n")

## all 100 highly correlated predictors, which hurts performance compared to

cat("dimension reduction methods like PLS and PCR that reduce to ~18 components.\n\n")

## dimension reduction methods like PLS and PCR that reduce to ~18 components.

cat("Recommendation: Use PLS with 18 components for predicting moisture content.\n")

## Recommendation: Use PLS with 18 components for predicting moisture content.

cat("It provides the best predictive performance and effectively handles the\n")

## It provides the best predictive performance and effectively handles the

cat("multicollinearity inherent in near-infrared spectroscopy data.\n")

## multicollinearity inherent in near-infrared spectroscopy data.

8 Question 6.1e Predicting Fat Content

cat("PREDICTING FAT CONTENT\n\n")

## PREDICTING FAT CONTENT

# Change the response variable to fat (column 2)
y_fat <- endpoints[, 2]  # fat instead of moisture

# Create training and test sets for fat
set.seed(103025)
training_indices <- createDataPartition(y_fat, p = 0.8, list = FALSE)

X_train <- X[training_indices, ]
y_fat_train <- y_fat[training_indices]
X_test <- X[-training_indices, ]
y_fat_test <- y_fat[-training_indices]

# Train all models with fat as response
ctrl <- trainControl(method = "cv", number = 10)

cat("Training models to predict fat content...\n\n")

## Training models to predict fat content...

# PCR
pcr_fat <- train(X_train, y_fat_train,
                 method = "pcr",
                 trControl = ctrl,
                 preProcess = c("center", "scale"),
                 tuneLength = 20)

# PLS
pls_fat <- train(X_train, y_fat_train,
                 method = "pls",
                 trControl = ctrl,
                 preProcess = c("center", "scale"),
                 tuneLength = 20)

# Ridge
ridge_fat <- train(X_train, y_fat_train,
                   method = "ridge",
                   trControl = ctrl,
                   preProcess = c("center", "scale"),
                   tuneGrid = expand.grid(lambda = seq(0, 0.1, length = 20)))

# Lasso
lasso_fat <- train(X_train, y_fat_train,
                   method = "lasso",
                   trControl = ctrl,
                   preProcess = c("center", "scale"),
                   tuneGrid = expand.grid(fraction = seq(0.05, 1, length = 20)))

# Elastic Net
enet_fat <- train(X_train, y_fat_train,
                  method = "enet",
                  trControl = ctrl,
                  preProcess = c("center", "scale"),
                  tuneGrid = expand.grid(
                    lambda = seq(0, 0.01, length = 10),
                    fraction = seq(0.05, 1, length = 10)))

# Compare results
cat("\n--- FAT PREDICTION RESULTS ---\n\n")

## 
## --- FAT PREDICTION RESULTS ---

cv_rmse_fat <- c(
  PCR = min(pcr_fat$results$RMSE),
  PLS = min(pls_fat$results$RMSE),
  Ridge = min(ridge_fat$results$RMSE),
  Lasso = min(lasso_fat$results$RMSE),
  ElasticNet = min(enet_fat$results$RMSE)
)

# Sort by RMSE
cv_rmse_fat_sorted <- sort(cv_rmse_fat)

cat("Cross-Validated RMSE (Best to Worst):\n")

## Cross-Validated RMSE (Best to Worst):

for(i in 1:length(cv_rmse_fat_sorted)) {
  pct_diff <- ((cv_rmse_fat_sorted[i] - cv_rmse_fat_sorted[1]) / cv_rmse_fat_sorted[1]) * 100
  marker <- if(i == 1) " <- BEST" else sprintf(" (+%.1f%%)", pct_diff)
  cat(sprintf("  %d. %-12s %.3f%s\n", i, names(cv_rmse_fat_sorted)[i], 
              cv_rmse_fat_sorted[i], marker))
}

##   1. Lasso        2.129 <- BEST
##   2. ElasticNet   2.308 (+8.4%)
##   3. PLS          2.322 (+9.1%)
##   4. PCR          2.365 (+11.1%)
##   5. Ridge        2.632 (+23.6%)

cat("\nOptimal Tuning Parameters:\n")

## 
## Optimal Tuning Parameters:

cat("  PCR components:", pcr_fat$bestTune$ncomp, "\n")

##   PCR components: 17

cat("  PLS components:", pls_fat$bestTune$ncomp, "\n")

##   PLS components: 14

cat("  Ridge lambda:", round(ridge_fat$bestTune$lambda, 4), "\n")

##   Ridge lambda: 0

cat("  Lasso fraction:", lasso_fat$bestTune$fraction, "\n")

##   Lasso fraction: 0.25

cat("  Elastic Net lambda:", enet_fat$bestTune$lambda, 
    ", fraction:", enet_fat$bestTune$fraction, "\n\n")

##   Elastic Net lambda: 0 , fraction: 0.2611111

# Test set performance
test_rmse_fat <- c(
  PCR = sqrt(mean((predict(pcr_fat, X_test) - y_fat_test)^2)),
  PLS = sqrt(mean((predict(pls_fat, X_test) - y_fat_test)^2)),
  Ridge = sqrt(mean((predict(ridge_fat, X_test) - y_fat_test)^2)),
  Lasso = sqrt(mean((predict(lasso_fat, X_test) - y_fat_test)^2)),
  ElasticNet = sqrt(mean((predict(enet_fat, X_test) - y_fat_test)^2))
)

cat("Test Set RMSE:\n")

## Test Set RMSE:

test_rmse_fat_sorted <- sort(test_rmse_fat)
for(i in 1:length(test_rmse_fat_sorted)) {
  marker <- if(i == 1) " <- BEST" else ""
  cat(sprintf("  %d. %-12s %.3f%s\n", i, names(test_rmse_fat_sorted)[i], 
              test_rmse_fat_sorted[i], marker))
}

##   1. PLS          2.197 <- BEST
##   2. PCR          2.326
##   3. Lasso        5.601
##   4. ElasticNet   5.669
##   5. Ridge        6.534

cat("\nRecommendation for Fat Prediction:\n")

## 
## Recommendation for Fat Prediction:

best_model_fat <- names(which.min(cv_rmse_fat))
cat("Use", best_model_fat, "for predicting fat content\n")

## Use Lasso for predicting fat content

cat("CV RMSE:", round(min(cv_rmse_fat), 3), "\n")

## CV RMSE: 2.129

cat("Test RMSE:", round(test_rmse_fat[best_model_fat], 3), "\n")

## Test RMSE: 5.601

9 Question 6.2 Develop a Model to Predict Permeability

Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug. Molecular fingerprints are binary indicators that represent the presence (1) or absence (0) of specific chemical substructures in a molecule. They are commonly used in:

Drug discovery
Chemical similarity analysis
Quantitative structure-activity relationship (QSAR) modeling
Solubility prediction

library(AppliedPredictiveModeling)
data(permeability)

10 Question 6.2a EDA on Permeability Data

cat("EXPLORATORY DATA ANALYSIS\n")

## EXPLORATORY DATA ANALYSIS

cat("Permeability Dataset\n\n")

## Permeability Dataset

# 1. BASIC STRUCTURE
cat("--- DATASET DIMENSIONS ---\n")

## --- DATASET DIMENSIONS ---

cat("Fingerprints matrix dimensions:", dim(fingerprints), "\n")

## Fingerprints matrix dimensions: 165 1107

cat("  - Number of compounds:", nrow(fingerprints), "\n")

##   - Number of compounds: 165

cat("  - Number of molecular descriptors:", ncol(fingerprints), "\n\n")

##   - Number of molecular descriptors: 1107

cat("Permeability vector length:", length(permeability), "\n\n")

## Permeability vector length: 165

# 2. RESPONSE VARIABLE (PERMEABILITY) ANALYSIS
cat("--- PERMEABILITY (RESPONSE VARIABLE) ---\n")

## --- PERMEABILITY (RESPONSE VARIABLE) ---

cat("Summary statistics:\n")

## Summary statistics:

print(summary(permeability))

##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

cat("\nStandard deviation:", round(sd(permeability), 3), "\n")

## 
## Standard deviation: 15.579

cat("Range:", round(diff(range(permeability)), 3), "\n\n")

## Range: 55.54

# Prepare the data
X <- fingerprints

11 Question 6.2a(i) Explore the Dataset

# Check dimensions
cat("Dimensions of fingerprint matrix:\n")

## Dimensions of fingerprint matrix:

print(dim(fingerprints))

## [1]  165 1107

cat("This means", nrow(fingerprints), "samples with", ncol(fingerprints), "predictors\n")

## This means 165 samples with 1107 predictors

12 Question 6.2b Filter out Predictors

Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

Dataset Description: - Permeability data (AppliedPredictiveModeling package) - 165 compounds - 1,107 binary fingerprint predictors - Ratio: 6.7 predictors per sample (high-dimensional problem)

library(caret)
library(AppliedPredictiveModeling)
data(permeability)

nzv <- nearZeroVar(fingerprints)
fingerprints_filtered <- fingerprints[, -nzv]
ncol(fingerprints_filtered)

## [1] 388

cat("The number of predictors left for modeling:", ncol(fingerprints_filtered))

## The number of predictors left for modeling: 388

13 Question 6.2c Prepare the Data for Modeling

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

library(caret)
library(AppliedPredictiveModeling)
library(pls)        # for pcr
library(elasticnet) # for ridge/lasso/enet
library(kknn)

## 
## Attaching package: 'kknn'

## The following object is masked from 'package:caret':
## 
##     contr.dummy

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

set.seed(103025)
ctrl <- trainControl(method = "cv", number = 10)

# Load data
data(permeability)

# Filter near-zero variance predictors
nzv <- nearZeroVar(fingerprints)
fingerprints_filtered <- fingerprints[, -nzv]

cat("Predictors after filtering:", ncol(fingerprints_filtered), "\n\n")

## Predictors after filtering: 388

# Split data into training and test sets (75/25)
set.seed(103125)
trainIndex <- createDataPartition(permeability, p = 0.75, list = FALSE)

trainX <- fingerprints_filtered[trainIndex, ]
trainY <- permeability[trainIndex]
testX <- fingerprints_filtered[-trainIndex, ]
testY <- permeability[-trainIndex]

cat("Training set:", nrow(trainX), "samples\n")

## Training set: 125 samples

cat("Test set:", nrow(testX), "samples\n\n")

## Test set: 40 samples

# Set up 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)

# Tune PLS model (preprocessing: center and scale)
set.seed(103025)
plsTune <- train(
  x = trainX,
  y = trainY,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl,
  preProc = c("center", "scale")
)

# Get results
bestTune <- plsTune$bestTune
bestResults <- plsTune$results[plsTune$results$ncomp == bestTune$ncomp, ]

cat("1. Optimal number of latent variables:", bestTune$ncomp, "\n\n")

## 1. Optimal number of latent variables: 7

cat("2. Resampled R²:", round(bestResults$Rsquared, 4), "\n\n")

## 2. Resampled R²: 0.5212

14 Question 6.2d Predict the Response for the Test Set

What is the test set estimate of R2?

# Predict the response for the test set
testPred <- predict(plsTune, newdata = testX)

# Calculate test set R²
testR2 <- cor(testPred, testY)^2

cat("Test set estimate of R²:", round(testR2, 4), "\n\n")

## Test set estimate of R²: 0.4724

cat("Interpretation:\n")

## Interpretation:

cat("The model explains", round(testR2*100, 1), 
    "% of variance in permeability.\n\n")

## The model explains 47.2 % of variance in permeability.

cat("Training vs Test Performance:\n")

## Training vs Test Performance:

cat("  Training (CV) R²:", round(bestResults$Rsquared, 4), "\n")

##   Training (CV) R²: 0.5212

cat("  Test R²:", round(testR2, 4), "\n")

##   Test R²: 0.4724

cat("  Difference:", round(bestResults$Rsquared - testR2, 4), "\n")

##   Difference: 0.0488

if(abs(bestResults$Rsquared - testR2) < 0.05) {
  cat("  → Good generalization (minimal overfitting)\n")
} else {
  cat("  → Some overfitting detected\n")
}

##   → Good generalization (minimal overfitting)

15 Question 6.2e Showing Which Model Is Best

library(caret)
library(pls)        # needed by method = "pcr"
library(glmnet)     # ridge/lasso/elastic net via glmnet

## Loading required package: Matrix

## Loaded glmnet 4.1-10

library(kknn)
library(kernlab)

# Reuse trainX, trainY 
set.seed(103025)
ctrl <- trainControl(method = "cv", number = 10)

# PCR
pcr_model <- train(trainX, trainY,
                   method = "pcr",
                   trControl = ctrl,
                   preProcess = c("center", "scale"),
                   tuneLength = 20)

# Ridge (alpha = 0)
ridge_model <- train(trainX, trainY,
                     method = "glmnet",
                     trControl = ctrl,
                     preProcess = c("center", "scale"),
                     tuneGrid = expand.grid(alpha = 0,
                                            lambda = 10^seq(-4, 1, length = 50)))

# Lasso (alpha = 1)
lasso_model <- train(trainX, trainY,
                     method = "glmnet",
                     trControl = ctrl,
                     preProcess = c("center", "scale"),
                     tuneGrid = expand.grid(alpha = 1,
                                            lambda = 10^seq(-4, 1, length = 50)))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

# Elastic Net (search alpha in [0,1] and lambda)
enet_model <- train(trainX, trainY,
                    method = "glmnet",
                    trControl = ctrl,
                    preProcess = c("center", "scale"),
                    tuneLength = 20)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

# KNN
knn_model <- train(trainX, trainY,
                   method = "knn",
                   trControl = ctrl,
                   preProcess = c("center", "scale"),
                   tuneLength = 20)

# SVM (RBF)
svm_model <- train(trainX, trainY,
                   method = "svmRadial",
                   trControl = ctrl,
                   preProcess = c("center", "scale"),
                   tuneLength = 10)

# Compare with your existing PLS model object: plsTune
results <- resamples(list(
  PLS = plsTune,
  PCR = pcr_model,
  Ridge = ridge_model,
  Lasso = lasso_model,
  ElasticNet = enet_model,
  KNN = knn_model,
  SVM = svm_model
))

summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: PLS, PCR, Ridge, Lasso, ElasticNet, KNN, SVM 
## Number of resamples: 10 
## 
## MAE 
##                Min.  1st Qu.   Median     Mean   3rd Qu.      Max. NA's
## PLS        5.855166 7.671185 8.374878 8.819935 10.395264 12.232113    0
## PCR        6.478052 7.349715 8.886026 8.936486  9.847171 12.295670    0
## Ridge      6.067736 7.319335 8.336143 8.605020  9.342998 12.349564    0
## Lasso      4.168337 6.448881 7.736902 8.105491  9.835276 12.484209    0
## ElasticNet 6.077117 6.675149 7.634973 7.705983  8.601024  9.607781    0
## KNN        4.471095 6.082174 7.956981 8.385917 10.117267 13.761968    0
## SVM        5.538656 6.595833 7.517188 7.563386  8.156856 10.131174    0
## 
## RMSE 
##                Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS        7.129567  9.433234 11.39016 11.29200 12.94590 16.54612    0
## PCR        8.037053 10.161534 11.06334 11.96948 14.23587 16.60493    0
## Ridge      8.283846  9.164638 12.18091 11.62923 12.77967 15.96915    0
## Lasso      5.554426 10.672216 11.81544 11.36815 13.14913 16.33687    0
## ElasticNet 8.530177  9.119431 11.03678 10.87942 12.14213 14.26285    0
## KNN        6.665764  7.582388 10.41034 11.73250 15.25848 18.84733    0
## SVM        7.233803  9.722672 10.72060 10.43967 11.16724 12.69882    0
## 
## Rsquared 
##                   Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS        0.080111817 0.4387388 0.5599021 0.5211719 0.6322325 0.8229828    0
## PCR        0.141356694 0.3518307 0.3820989 0.4562220 0.6234931 0.8204790    0
## Ridge      0.119162090 0.2991567 0.5790568 0.5017807 0.6534809 0.8146635    0
## Lasso      0.131695219 0.2994707 0.4844250 0.5243114 0.7702345 0.9220318    0
## ElasticNet 0.302124171 0.3836284 0.5328731 0.5723251 0.7386761 0.8775188    0
## KNN        0.009837283 0.1747413 0.4901745 0.4648761 0.7428649 0.9107852    0
## SVM        0.190436679 0.5147241 0.5918350 0.5851556 0.6655141 0.8661903    0

dotplot(results)

16 Question 6.2e Extract the Best Model from Comparison

# Extract best model from comparison
cv_results <- data.frame(
  Model = c("PLS", "PCR", "Ridge", "Lasso", "ElasticNet", "KNN", "SVM"),
  CV_RMSE = c(
    min(plsTune$results$RMSE),
    min(pcr_model$results$RMSE),
    min(ridge_model$results$RMSE),
    min(lasso_model$results$RMSE),
    min(enet_model$results$RMSE),
    min(knn_model$results$RMSE),
    min(svm_model$results$RMSE)
  )
)

cv_results <- cv_results[order(cv_results$CV_RMSE), ]

cat("\n--- MODEL RANKINGS (by Cross-Validated RMSE) ---\n\n")

## 
## --- MODEL RANKINGS (by Cross-Validated RMSE) ---

print(cv_results)

##        Model  CV_RMSE
## 7        SVM 10.43967
## 5 ElasticNet 10.87942
## 1        PLS 11.29200
## 4      Lasso 11.36815
## 3      Ridge 11.62923
## 6        KNN 11.73250
## 2        PCR 11.96948

cat("\nBest Model:", cv_results$Model[1], "\n")

## 
## Best Model: SVM

cat("Answer to (e): ", 
    ifelse(cv_results$Model[1] == "PLS", 
           "No other model performs better than PLS",
           paste(cv_results$Model[1], "performs better than PLS")), 
    "\n")

## Answer to (e):  SVM performs better than PLS

17 Question 6.2f Can the Model Replace Lab Experiments?

cat("--- CAN THIS MODEL REPLACE LAB EXPERIMENTS? ---\n\n")

## --- CAN THIS MODEL REPLACE LAB EXPERIMENTS? ---

cat("Model Performance:\n")

## Model Performance:

cat("- Test R² =", round(testR2, 4), "→ explains", 
    round(testR2*100,1), "% of variance\n")

## - Test R² = 0.4724 → explains 47.2 % of variance

cat("- Unexplained variance:", round((1-testR2)*100,1), "%\n")

## - Unexplained variance: 52.8 %

cat("- Test RMSE:", round(sqrt(mean((testPred - testY)^2)), 3), "\n\n")

## - Test RMSE: 11.27

if (testR2 >= 0.70) {
  cat("RECOMMENDATION: YES - Significant lab reduction possible\n\n")
  cat("Justification:\n")
  cat("- Strong predictive power (R² >= 0.70)\n")
  cat("- Can screen 80-90% of compounds computationally\n")
  cat("- Reserve lab testing for borderline cases\n")
  cat("- Estimated cost savings: 60-70% of testing budget\n")
} else if (testR2 >= 0.50) {
  cat("RECOMMENDATION: PARTIAL - Use for pre-screening only\n\n")
  cat("Justification:\n")
  cat("- Moderate predictive power (R² =", round(testR2, 2), ")\n")
  cat("- Too much uncertainty for full replacement\n")
  cat("- Strategy: Eliminate clearly poor candidates (bottom 30%)\n")
  cat("- Lab test remaining 70% of compounds\n")
  cat("- Estimated cost savings: 25-30% of testing budget\n")
} else {
  cat("RECOMMENDATION: NO - Cannot replace lab experiments\n\n")
  cat("Justification:\n")
  cat("- Weak predictive power (R² < 0.50)\n")
  cat("- High prediction error creates unacceptable risk\n")
  cat("- May still be useful for hypothesis generation\n")
}

## RECOMMENDATION: NO - Cannot replace lab experiments
## 
## Justification:
## - Weak predictive power (R² < 0.50)
## - High prediction error creates unacceptable risk
## - May still be useful for hypothesis generation

cat("\nAdditional Considerations:\n")

## 
## Additional Considerations:

cat("- Regulatory requirements may mandate physical testing\n")

## - Regulatory requirements may mandate physical testing

cat("- Model should be validated on new compounds periodically\n")

## - Model should be validated on new compounds periodically

cat("- Confidence intervals could guide selective lab testing\n")

## - Confidence intervals could guide selective lab testing

18 Question 6.3a Tasks: Load Libraries and Data

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack

library(ggplot2)

# View structure
str(ChemicalManufacturingProcess)

## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.7 14.7 14.7 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...

# Check first few rows
head(ChemicalManufacturingProcess)

##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0

# Example: Correlation between predictors and yield
cor(ChemicalManufacturingProcess, use = "pairwise.complete.obs")[, "Yield"]

##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##            1.000000000            0.358937974            0.481515794 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##            0.445085980            0.379840101            0.145750912 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##            0.478163422           -0.111850115            0.380940207 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##            0.092036495            0.200830542            0.354914346 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##            0.367497636           -0.101342814           -0.206275938 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##           -0.091006594           -0.263549597            0.110367601 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##            0.389012972           -0.043327408            0.014695409 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##            0.503470507            0.209075007            0.334225285 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##            0.349687153           -0.503679718           -0.005947661 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##            0.216187972           -0.037346455           -0.425806872 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##           -0.058929250            0.134879045           -0.065511464 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##           -0.025849751            0.016972274           -0.096879839 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##           -0.210199573            0.007275874            0.038831036 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##            0.001659092            0.263985208            0.153804211 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##            0.232179592           -0.068904174            0.608332150 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##            0.425621132            0.174553676           -0.174232504 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##           -0.528419637           -0.159314120           -0.086459252 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##            0.035865834           -0.046963558           -0.028798945 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##           -0.013496843            0.159268513            0.070497553 
## ManufacturingProcess45 
##            0.028816232

19 Question 6.3a (cont’d) Initial Exploratory Data Analysis

# 1) Load data
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess

# 2) Quick sanity check
stopifnot("Yield" %in% names(df))

# 3) Compute correlations (pairwise complete obs) with Yield
num_df <- df %>% select(where(is.numeric))
cors <- cor(num_df, use = "pairwise.complete.obs")

# 4) Convert to tidy data frame and remove Yield itself
ranked <- data.frame(
  Variable = rownames(cors),
  Correlation_with_Yield = cors[, "Yield"],
  row.names = NULL
) %>%
  filter(Variable != "Yield") %>%
  arrange(desc(Correlation_with_Yield))

# 5) Show top 10 correlated predictors
top_k <- 10
top_table <- ranked %>% slice(1:top_k)
head(top_table, top_k)

##                  Variable Correlation_with_Yield
## 1  ManufacturingProcess32              0.6083321
## 2  ManufacturingProcess09              0.5034705
## 3    BiologicalMaterial02              0.4815158
## 4    BiologicalMaterial06              0.4781634
## 5    BiologicalMaterial03              0.4450860
## 6  ManufacturingProcess33              0.4256211
## 7  ManufacturingProcess06              0.3890130
## 8    BiologicalMaterial08              0.3809402
## 9    BiologicalMaterial04              0.3798401
## 10   BiologicalMaterial12              0.3674976

# ===== Optional: quick lollipop plot of top 15 =====
top_plot <- ranked %>%
  slice(1:15) %>%
  mutate(Variable = factor(Variable, levels = rev(Variable)))

ggplot(top_plot, aes(x = Correlation_with_Yield, y = Variable)) +
  geom_segment(aes(x = 0, xend = Correlation_with_Yield, y = Variable, yend = Variable)) +
  geom_point(size = 2.8) +
  labs(
    title = "Top Correlations with Yield",
    x = "Pearson Correlation",
    y = NULL
  ) +
  theme_minimal(base_size = 12)

20 Exercise 6.3b Missing Values

# 1. Count total missing values
missing_total <- sum(is.na(ChemicalManufacturingProcess))
cat("The number of missing values in the entire dataset is:", missing_total, "\n\n")

## The number of missing values in the entire dataset is: 106

# 2. Show missing values per column (only those > 0)
missing_summary <- sort(colSums(is.na(ChemicalManufacturingProcess)), decreasing = TRUE)
cat("Columns with missing values (and their counts):\n")

## Columns with missing values (and their counts):

print(missing_summary[missing_summary > 0])

## ManufacturingProcess03 ManufacturingProcess11 ManufacturingProcess10 
##                     15                     10                      9 
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 
##                      5                      5                      5 
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 
##                      5                      5                      5 
## ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 
##                      5                      5                      5 
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess02 
##                      5                      5                      3 
## ManufacturingProcess06 ManufacturingProcess01 ManufacturingProcess04 
##                      2                      1                      1 
## ManufacturingProcess05 ManufacturingProcess07 ManufacturingProcess08 
##                      1                      1                      1 
## ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 
##                      1                      1                      1 
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess40 
##                      1                      1                      1 
## ManufacturingProcess41 
##                      1

cat("\n")

# 3. Calculate percent missing overall
total_values <- prod(dim(ChemicalManufacturingProcess))
missing_percent <- (missing_total / total_values) * 100
cat("Percentage of missing values:", round(missing_percent, 2), "%\n\n")

## Percentage of missing values: 1.04 %

21 Question 6.3c Split the Data into a Training and Test Set

library(AppliedPredictiveModeling)
library(dplyr)

set.seed(123)

# 1) Load + ensure no NA in outcome
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess %>% filter(!is.na(Yield))

# 2) Split
idx   <- createDataPartition(df$Yield, p = 0.8, list = FALSE)
train <- df[idx, ]
test  <- df[-idx, ]

# 3) CV + preprocessing (impute + center/scale)
ctrl <- trainControl(method = "cv", number = 10)
pp   <- c("medianImpute", "center", "scale")

# 4) Models
# Linear regression
set.seed(123)
mod_lm <- train(
  Yield ~ ., data = train, method = "lm",
  trControl = ctrl, preProcess = pp, metric = "RMSE",
  na.action = na.omit
)

# Partial Least Squares
set.seed(123)
mod_pls <- train(
  Yield ~ ., data = train, method = "pls",
  trControl = ctrl, preProcess = pp, tuneLength = 20, metric = "RMSE",
  na.action = na.omit
)

# Penalized (glmnet)
set.seed(123)
mod_glmnet <- train(
  Yield ~ ., data = train, method = "glmnet",
  trControl = ctrl, preProcess = pp, tuneLength = 30, metric = "RMSE",
  na.action = na.omit
)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

# 5) Best CV RMSE
cv <- tibble(
  Model = c("Linear Regression","PLS","GLMNET"),
  RMSE  = c(min(mod_lm$results$RMSE),
            min(mod_pls$results$RMSE),
            min(mod_glmnet$results$RMSE))
) %>% arrange(RMSE)
print(cv)

## # A tibble: 3 × 2
##   Model              RMSE
##   <chr>             <dbl>
## 1 GLMNET             1.38
## 2 PLS                1.47
## 3 Linear Regression  7.48

cat("\nBest by CV RMSE:", cv$Model[1], "RMSE =", round(cv$RMSE[1], 4), "\n")

## 
## Best by CV RMSE: GLMNET RMSE = 1.3848

# 6) Test-set evaluation
preds <- list(
  LM     = predict(mod_lm,     test),
  PLS    = predict(mod_pls,    test),
  GLMNET = predict(mod_glmnet, test)
)

postResample(predict(mod_glmnet, test), test$Yield)

## Warning in pred - obs: longer object length is not a multiple of shorter object
## length

## Warning in pred - obs: longer object length is not a multiple of shorter object
## length

##     RMSE Rsquared      MAE 
## 1.615587       NA 1.369952

22 Question 6.3c Explanation of the optimal model performance

# Predict only on rows where all predictors are available
predictor_names <- setdiff(names(test), "Yield")
keep_rows <- complete.cases(test[, predictor_names, drop = FALSE])

pred_glmnet <- predict(mod_glmnet, newdata = test[keep_rows, , drop = FALSE])

# Align observed values to the same rows and ensure no NAs
obs <- test$Yield[keep_rows]
valid <- !is.na(pred_glmnet) & !is.na(obs)

pred_glmnet <- pred_glmnet[valid]
obs         <- obs[valid]

# Compute metrics on matched vectors
glmnet_perf <- postResample(pred_glmnet, obs)

# Cross-validated (resampled) RMSE from training
train_cv_rmse <- min(mod_glmnet$results$RMSE)

# Report
cat("\n----------------------------------------\n")

## 
## ----------------------------------------

cat("Performance Comparison: Training vs. Test\n")

## Performance Comparison: Training vs. Test

cat("----------------------------------------\n")

## ----------------------------------------

cat("Cross-validated (training) RMSE :", round(train_cv_rmse, 4), "\n")

## Cross-validated (training) RMSE : 1.3848

cat("Test-set RMSE                   :", round(glmnet_perf["RMSE"], 4), "\n")

## Test-set RMSE                   : 1.2905

cat("Test-set R-squared              :", round(glmnet_perf["Rsquared"], 4), "\n")

## Test-set R-squared              : 0.5877

cat("Test-set MAE                    :", round(glmnet_perf["MAE"], 4), "\n")

## Test-set MAE                    : 1.0508

if (glmnet_perf["RMSE"] > train_cv_rmse) {
  cat("\nInterpretation:\n  Test RMSE > CV RMSE → mild overfitting but reasonable generalization.\n")
} else {
  cat("\nInterpretation:\n  Test RMSE =< CV RMSE → good generalization (sampling variation can cause this).\n")
}

## 
## Interpretation:
##   Test RMSE =< CV RMSE → good generalization (sampling variation can cause this).

23 6.3d Determining the Most Important Predictors

# Predictor importance
importance_glmnet <- varImp(mod_glmnet, scale = TRUE)
print(importance_glmnet)

## glmnet variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09  77.939
## ManufacturingProcess17  29.657
## ManufacturingProcess06   8.644
## BiologicalMaterial06     7.878
## ManufacturingProcess24   0.000
## ManufacturingProcess26   0.000
## ManufacturingProcess31   0.000
## ManufacturingProcess44   0.000
## BiologicalMaterial09     0.000
## BiologicalMaterial11     0.000
## ManufacturingProcess14   0.000
## ManufacturingProcess05   0.000
## ManufacturingProcess19   0.000
## ManufacturingProcess20   0.000
## ManufacturingProcess34   0.000
## ManufacturingProcess21   0.000
## ManufacturingProcess35   0.000
## ManufacturingProcess43   0.000
## ManufacturingProcess23   0.000

# Plot the top predictors
plot(importance_glmnet, top = 20, main = "Top 20 Important Predictors (GLMNET)")

24 Question 6.3e Determine Whether Biological or Manufacturing Predictors Dominate

# Identify predictor names
imp_df <- importance_glmnet$importance
imp_df$Predictor <- rownames(imp_df)

# Tag predictors by type
imp_df$Type <- ifelse(grepl("Biological", imp_df$Predictor), "Biological", "Process")

# Get average importance by group
group_summary <- imp_df %>%
  group_by(Type) %>%
  summarize(MeanImportance = mean(Overall),
            Top10Share = mean(Overall[rank(-Overall) <= 10]))

print(group_summary)

## # A tibble: 2 × 3
##   Type       MeanImportance Top10Share
##   <chr>               <dbl>      <dbl>
## 1 Biological          0.656      0.656
## 2 Process             4.81      54.1

The GLMNET model identified both biological and process variables as influential, but biological predictors showed higher mean importance values. This suggests that biological material characteristics play a slightly larger role in predicting manufacturing yield than process parameters, although both contribute meaningfully.

25 Question 6.3f Analysis of the Top Predictors vs. Yield

library(ggplot2)

# Get top predictors from variable importance
top_vars <- rownames(importance_glmnet$importance)[order(-importance_glmnet$importance$Overall)][1:3]
top_vars

## [1] "ManufacturingProcess32" "ManufacturingProcess09" "ManufacturingProcess17"

# Plot each top predictor vs. Yield
for (var in top_vars) {
  p <- ggplot(train, aes_string(x = var, y = "Yield")) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "lm", se = FALSE, color = "blue", linewidth = 1) +
    labs(title = paste("Relationship between", var, "and Yield"),
         x = var,
         y = "Yield") +
    theme_minimal()
  print(p)
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

25.1 Analysis of Scatterplots

Analysis of the top predictors revealed clear trends between several process and biological variables and manufacturing yield. For example, ManufacturingProcess17 showed a strong negative linear relationship with yield, suggesting that increasing its level will decrease yield. Conversely, ManufacturingProcess32 and ManufacturingProcess9 exhibited positive trends, indicating that increasing their levels will increase yield.

These insights can help engineers adjust controllable process parameters and improve consistency in biological inputs, leading to higher yields in future production runs.

26 Conclusions

Spectroscopy Analysis (6.1): PLS with 18 components achieved the best performance (RMSE = 2.159) for predicting moisture content, outperforming PCR, Ridge, Lasso, and Elastic Net. The supervised dimension reduction in PLS provides an advantage over unsupervised PCR by considering the response variable when constructing latent variables.

Permeability Prediction (6.2): The best model achieved R² = 0.4724 on the test set, demonstrating the potential for computational screening to reduce laboratory testing costs in pharmaceutical development.

Manufacturing Optimization (6.3): GLMNET identified key process and biological drivers of manufacturing yield, with biological factors showing slightly higher importance. ManufacturingProcess17, ManufacturingProcess32, and ManufacturingProcess9 emerged as the top three predictors.

27 Methodological Insights

High-dimensional data benefits from regularization: When predictors far outnumber samples, regularization techniques (Ridge, Lasso, Elastic Net) or dimension reduction (PCR, PLS) prevent overfitting.
PLS consistently outperforms PCR: When the response variable contains clear signal, supervised dimension reduction (PLS) leverages this information more effectively than unsupervised approaches (PCR).
Preprocessing is critical: Proper handling of missing data (median imputation), centering, and scaling are essential for model performance, especially with spectroscopy and manufacturing data.
Cross-validation prevents overfitting: 10-fold CV provides reliable estimates of model performance and helps select appropriate tuning parameters without contaminating the test set. ```