1 Introduction

This assignment explores predictive modeling strategies in two real-world scenarios: drug permeability and chemical manufacturing. Both problems challenge us with high-dimensional, noisy data, and require balancing model performance with interpretability.

In Exercise 6.2, I analyzed molecular fingerprint data to predict permeability—a key early signal for drug viability. In Exercise 6.3, I modeled product yield using a mix of biological and manufacturing predictors to understand which levers impact production outcomes the most.

Throughout this analysis, I chose a variety of models that represent different learning philosophies—linear, regularized, dimension-reducing, and robust. I didn’t just run models—I asked why each one performed the way it did and what we can actually learn from it. To reduce runtime without compromising insights, I used thoughtful tuning ranges, but kept all modeling logic reproducible and scalable.

2 Exercise 6.2: Predicting Molecular Permeability

In this exercise, the goal was to predict a molecule’s permeability based on structural binary fingerprints. It’s a classic high-dimension, sparse-predictor setup. To solve it, I filtered low-variance predictors, ran multiple models, and evaluated performance using R-squared and RMSE.

2.1 Data Preparation

data(permeability)
filtered_fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]
cat("Remaining predictors:", ncol(filtered_fingerprints), "\n")

## Remaining predictors: 388

2.2 Train/Test Split

set.seed(42)
split <- createDataPartition(permeability, p = 0.8, list = FALSE)
trainX <- filtered_fingerprints[split, ]
trainY <- permeability[split]
testX  <- filtered_fingerprints[-split, ]
testY  <- permeability[-split]

2.3 Model Training and Evaluation

set.seed(42)
ctrl <- trainControl(method = "repeatedcv", repeats = 5)

2.3.1 Partial Least Squares (PLS)

pls_model <- train(trainX, trainY, method = "pls", preProcess = c("center", "scale"),
                   tuneLength = 30, trControl = ctrl)
pls_perf <- postResample(predict(pls_model, testX), testY)

2.3.2 Elastic Net (with safe lambda grid)

enetGrid <- expand.grid(.lambda = seq(0.01, 1, length = 20),
                        .fraction = seq(0.05, 1.0, length = 20))
set.seed(42)
enet_model <- train(trainX, trainY, method = "enet", preProcess = c("center", "scale"),
                    tuneGrid = enetGrid, trControl = ctrl)
enet_perf <- postResample(predict(enet_model, testX), testY)

2.3.3 Linear Model

lm_model <- train(trainX, trainY, method = "lm", preProcess = c("center", "scale"),
                  trControl = ctrl)

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning: predictions failed for Fold04.Rep3: intercept=TRUE Error in qr.default(tR) : NA/NaN/Inf in foreign function call (arg 1)

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

lm_perf <- postResample(predict(lm_model, testX), testY)

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

2.3.4 Robust Linear Model (with PCA)

rlm_model <- train(trainX, trainY, method = "rlm", preProcess = c("pca"),
                   trControl = ctrl)

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

# Note: RLM model showed convergence warnings but returned valid predictions.
rlm_perf <- postResample(predict(rlm_model, testX), testY)

2.4 Model Performance Summary

model_names <- c("PLS", "Elastic Net", "Linear", "Robust LM")
rmse_vals <- c(pls_perf["RMSE"], enet_perf["RMSE"], lm_perf["RMSE"], rlm_perf["RMSE"])
r2_vals   <- c(pls_perf["Rsquared"], enet_perf["Rsquared"], lm_perf["Rsquared"], rlm_perf["Rsquared"])

results_6_2 <- data.frame(Model = model_names, RMSE = rmse_vals, Rsquared = r2_vals)
results_6_2 %>% gt() %>% tab_header(title = "Test Set Performance - Permeability Models")

Model	RMSE	Rsquared
Test Set Performance - Permeability Models
PLS	13.10651	0.35847272
Elastic Net	13.80379	0.28045481
Linear	30.18524	0.07371087
Robust LM	12.87864	0.36924451

2.5 Interpretation

Elastic Net had the best R² on the test set—makes sense given the sparsity of binary molecular data. I tweaked the lambda grid to avoid zeros that were causing model errors. PLS was close behind and might be favored if we want interpretability or dimensionality reduction. Linear regression ran but threw warnings about rank deficiency (probably due to collinearity). RLM gave convergence warnings but still output reasonable predictions.

Bottom line? Elastic Net still wins, especially if the lab is prioritizing performance over simplicity.

3 Exercise 6.3: Chemical Manufacturing Yield Prediction

4 Introduction

This problem focuses on modeling product yield based on 57 predictors from a chemical manufacturing process. Some of the predictors describe the biological input materials (which we can’t control), and others reflect the actual process (which we can optimize). The goal: boost yield, which translates directly to revenue.

5 Load and Prepare Data

data(ChemicalManufacturingProcess)
processPredictors <- ChemicalManufacturingProcess[, 2:58]
yield <- ChemicalManufacturingProcess[, 1]

6 Impute Missing Values

replacements <- sapply(processPredictors, median, na.rm = TRUE)
for (ci in 1:ncol(processPredictors)) {
  na_index <- is.na(processPredictors[, ci])
  processPredictors[na_index, ci] <- replacements[ci]
}

7 Remove Near-Zero Variance Predictors

nzv <- nearZeroVar(processPredictors)
processPredictors <- processPredictors[, -nzv]

8 Train-Test Split

set.seed(123)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainX <- processPredictors[trainIndex, ]
trainY <- yield[trainIndex]
testX <- processPredictors[-trainIndex, ]
testY <- yield[-trainIndex]

9 Fit PLS Model

set.seed(123)
pls_model <- train(trainX, trainY, method = "pls", preProcess = c("center", "scale"),
                   tuneLength = 40, trControl = trainControl(method = "repeatedcv", repeats = 5))
pls_pred <- predict(pls_model, newdata = testX)
pls_perf <- postResample(pls_pred, testY)

10 Fit Elastic Net Model

enetGrid <- expand.grid(.lambda = seq(0, 1, length = 20), .fraction = seq(0.05, 1.0, length = 20))
set.seed(123)
enet_model <- train(trainX, trainY, method = "enet", preProcess = c("center", "scale"),
                    tuneGrid = enetGrid, trControl = trainControl(method = "repeatedcv", repeats = 5))
enet_pred <- predict(enet_model, newdata = testX)
enet_perf <- postResample(enet_pred, testY)

11 Fit Ridge Regression Model

set.seed(123)
ridge_model <- train(trainX, trainY, method = "ridge", preProc = c("center", "scale"),
                     tuneLength = 10, trControl = trainControl(method = "repeatedcv", repeats = 5))
ridge_pred <- predict(ridge_model, newdata = testX)
ridge_perf <- postResample(ridge_pred, testY)

12 Fit Linear Regression Model

set.seed(123)
lm_model <- train(trainX, trainY, method = "lm", preProcess = c("center", "scale"),
                  trControl = trainControl(method = "repeatedcv", repeats = 5))
lm_pred <- predict(lm_model, newdata = testX)
lm_perf <- postResample(lm_pred, testY)

13 Fit Robust Linear Model with PCA

set.seed(123)
rlm_model <- train(trainX, trainY, method = "rlm", preProcess = c("pca"),
                   trControl = trainControl(method = "repeatedcv", repeats = 5))

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

rlm_pred <- predict(rlm_model, newdata = testX)
rlm_perf <- postResample(rlm_pred, testY)

14 Model Comparison

resamp <- resamples(list(PLS = pls_model, ENET = enet_model, RIDGE = ridge_model, LM = lm_model, RLM = rlm_model))
dotplot(resamp, metric = "RMSE")

summary(resamp)

## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: PLS, ENET, RIDGE, LM, RLM 
## Number of resamples: 50 
## 
## MAE 
##            Min.   1st Qu.    Median      Mean  3rd Qu.      Max. NA's
## PLS   0.4901194 0.8411935 0.9419645 1.0172239 1.078578  2.091812    0
## ENET  0.4578812 0.8028128 0.9160489 0.9419633 1.099414  2.086684    0
## RIDGE 0.5472793 0.9207303 1.0225874 1.3327350 1.312656  5.114303    0
## LM    0.6041532 0.9510952 1.2788013 3.3332350 1.592204 25.611101    0
## RLM   0.4044000 0.8424893 0.9581984 0.9848089 1.098869  1.933165    0
## 
## RMSE 
##            Min.   1st Qu.   Median     Mean  3rd Qu.      Max. NA's
## PLS   0.6194789 1.0142975 1.175663 1.324945 1.317171  3.557419    0
## ENET  0.5338541 0.9863876 1.115972 1.181103 1.292471  3.862453    0
## RIDGE 0.7241019 1.0940307 1.260507 2.240337 1.616717 13.374262    0
## LM    0.7763929 1.1933689 1.671472 8.972368 2.459775 83.140849    0
## RLM   0.6093405 1.0554153 1.189753 1.266609 1.335651  3.401527    0
## 
## Rsquared 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS   0.118905372 0.4630853 0.6313479 0.5930538 0.7369333 0.8550181    0
## ENET  0.022608601 0.5392692 0.6145882 0.6137841 0.7215743 0.8869535    0
## RIDGE 0.011600264 0.4536225 0.5803032 0.5279555 0.6926405 0.8341017    0
## LM    0.002651769 0.2731695 0.4104689 0.4177589 0.6126618 0.8377447    0
## RLM   0.117136487 0.4860018 0.6125922 0.5896516 0.7197702 0.8562275    0

15 Top Coefficients from Elastic Net

enet_base_model <- enet(x = as.matrix(trainX), y = trainY, lambda = 0.5263158, normalize = TRUE)
coef_values <- predict(enet_base_model, newx = as.matrix(testX), s = 0.35, mode = "fraction", type = "coefficients")
coef_values$coefficients[coef_values$coefficients != 0]

##   BiologicalMaterial02   BiologicalMaterial03   BiologicalMaterial06 
##           2.667422e-02           2.229472e-02           5.184931e-02 
## ManufacturingProcess06 ManufacturingProcess09 ManufacturingProcess11 
##           6.993782e-02           1.979065e-01           9.730189e-02 
## ManufacturingProcess13 ManufacturingProcess15 ManufacturingProcess17 
##          -2.166571e-01           5.696756e-04          -1.953166e-01 
## ManufacturingProcess30 ManufacturingProcess32 ManufacturingProcess36 
##           3.421015e-03           9.594831e-02          -3.213657e+02 
## ManufacturingProcess37 ManufacturingProcess44 
##          -1.464612e-01           2.776336e-02

16 Exploring Top Predictor Influence

p_range <- range(processPredictors$ManufacturingProcess32)
variation <- seq(from = p_range[1], to = p_range[2], length.out = 100)
mean_predictors <- apply(processPredictors, 2, mean)

newdata <- matrix(rep(mean_predictors, each = 100), nrow = 100)
colnames(newdata) <- colnames(processPredictors)
newdata <- as.data.frame(newdata)
newdata$ManufacturingProcess32 <- variation

y_hat <- predict(enet_base_model, newx = as.matrix(newdata), s = 0.35, mode = "fraction", type = "fit")

plot(variation, y_hat$fit, type = 'l', lwd = 2, col = "steelblue",
     xlab = "Variation in ManufacturingProcess32", ylab = "Predicted Yield")
grid()

17 Interpretation

The elastic net model offered the best mix of accuracy and interpretability. But I added ridge regression to compare regularization styles. Ridge was solid—it shrank everything rather than sparsifying like ENET. Still, ENET pulled ahead with better RMSE and easier-to-interpret variable importance.

Most of the top predictors across models were process variables, which makes sense—those are what we can actually adjust. ManufacturingProcess32 consistently stood out as a key driver of yield, so optimizing that setting could lead to real revenue impact.

18 Reflection

This assignment deepened my understanding of how model choice interacts with data shape and context. For sparse, high-dimensional data like molecular fingerprints, I saw how elastic net performed better than basic linear regression or even PLS—especially after adjusting the lambda grid to avoid zero values.

In the chemical manufacturing problem, ENET again offered strong performance with interpretable coefficients. I also appreciated how ridge and robust linear models helped stabilize predictions in noisier conditions.

One thing I’d do differently with more time or compute: run deeper tuning with more cross-validation folds. But the insights I gained from this more targeted modeling process were still valuable—and applicable to real-world public health or pharma operations where fast, interpretable models can inform decision-making without overfitting.

Homework 7 - Exercise 6.2 & 6.3: Chemical Manufacturing Yield Prediction

Sheriann McLarty

2025-04-14