This assignment explores predictive modeling strategies in two real-world scenarios: drug permeability and chemical manufacturing. Both problems challenge us with high-dimensional, noisy data, and require balancing model performance with interpretability.
In Exercise 6.2, I analyzed molecular fingerprint data to predict permeability—a key early signal for drug viability. In Exercise 6.3, I modeled product yield using a mix of biological and manufacturing predictors to understand which levers impact production outcomes the most.
Throughout this analysis, I chose a variety of models that represent different learning philosophies—linear, regularized, dimension-reducing, and robust. I didn’t just run models—I asked why each one performed the way it did and what we can actually learn from it. To reduce runtime without compromising insights, I used thoughtful tuning ranges, but kept all modeling logic reproducible and scalable.
In this exercise, the goal was to predict a molecule’s permeability based on structural binary fingerprints. It’s a classic high-dimension, sparse-predictor setup. To solve it, I filtered low-variance predictors, ran multiple models, and evaluated performance using R-squared and RMSE.
data(permeability)
filtered_fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]
cat("Remaining predictors:", ncol(filtered_fingerprints), "\n")
## Remaining predictors: 388
enetGrid <- expand.grid(.lambda = seq(0.01, 1, length = 20),
.fraction = seq(0.05, 1.0, length = 20))
set.seed(42)
enet_model <- train(trainX, trainY, method = "enet", preProcess = c("center", "scale"),
tuneGrid = enetGrid, trControl = ctrl)
enet_perf <- postResample(predict(enet_model, testX), testY)
lm_model <- train(trainX, trainY, method = "lm", preProcess = c("center", "scale"),
trControl = ctrl)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning: predictions failed for Fold04.Rep3: intercept=TRUE Error in qr.default(tR) : NA/NaN/Inf in foreign function call (arg 1)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
model_names <- c("PLS", "Elastic Net", "Linear", "Robust LM")
rmse_vals <- c(pls_perf["RMSE"], enet_perf["RMSE"], lm_perf["RMSE"], rlm_perf["RMSE"])
r2_vals <- c(pls_perf["Rsquared"], enet_perf["Rsquared"], lm_perf["Rsquared"], rlm_perf["Rsquared"])
results_6_2 <- data.frame(Model = model_names, RMSE = rmse_vals, Rsquared = r2_vals)
results_6_2 %>% gt() %>% tab_header(title = "Test Set Performance - Permeability Models")
Test Set Performance - Permeability Models | ||
Model | RMSE | Rsquared |
---|---|---|
PLS | 13.10651 | 0.35847272 |
Elastic Net | 13.80379 | 0.28045481 |
Linear | 30.18524 | 0.07371087 |
Robust LM | 12.87864 | 0.36924451 |
Elastic Net had the best R² on the test set—makes sense given the sparsity of binary molecular data. I tweaked the lambda grid to avoid zeros that were causing model errors. PLS was close behind and might be favored if we want interpretability or dimensionality reduction. Linear regression ran but threw warnings about rank deficiency (probably due to collinearity). RLM gave convergence warnings but still output reasonable predictions.
Bottom line? Elastic Net still wins, especially if the lab is prioritizing performance over simplicity.
This problem focuses on modeling product yield based on 57 predictors from a chemical manufacturing process. Some of the predictors describe the biological input materials (which we can’t control), and others reflect the actual process (which we can optimize). The goal: boost yield, which translates directly to revenue.
enetGrid <- expand.grid(.lambda = seq(0, 1, length = 20), .fraction = seq(0.05, 1.0, length = 20))
set.seed(123)
enet_model <- train(trainX, trainY, method = "enet", preProcess = c("center", "scale"),
tuneGrid = enetGrid, trControl = trainControl(method = "repeatedcv", repeats = 5))
enet_pred <- predict(enet_model, newdata = testX)
enet_perf <- postResample(enet_pred, testY)
set.seed(123)
rlm_model <- train(trainX, trainY, method = "rlm", preProcess = c("pca"),
trControl = trainControl(method = "repeatedcv", repeats = 5))
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
resamp <- resamples(list(PLS = pls_model, ENET = enet_model, RIDGE = ridge_model, LM = lm_model, RLM = rlm_model))
dotplot(resamp, metric = "RMSE")
##
## Call:
## summary.resamples(object = resamp)
##
## Models: PLS, ENET, RIDGE, LM, RLM
## Number of resamples: 50
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.4901194 0.8411935 0.9419645 1.0172239 1.078578 2.091812 0
## ENET 0.4578812 0.8028128 0.9160489 0.9419633 1.099414 2.086684 0
## RIDGE 0.5472793 0.9207303 1.0225874 1.3327350 1.312656 5.114303 0
## LM 0.6041532 0.9510952 1.2788013 3.3332350 1.592204 25.611101 0
## RLM 0.4044000 0.8424893 0.9581984 0.9848089 1.098869 1.933165 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.6194789 1.0142975 1.175663 1.324945 1.317171 3.557419 0
## ENET 0.5338541 0.9863876 1.115972 1.181103 1.292471 3.862453 0
## RIDGE 0.7241019 1.0940307 1.260507 2.240337 1.616717 13.374262 0
## LM 0.7763929 1.1933689 1.671472 8.972368 2.459775 83.140849 0
## RLM 0.6093405 1.0554153 1.189753 1.266609 1.335651 3.401527 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.118905372 0.4630853 0.6313479 0.5930538 0.7369333 0.8550181 0
## ENET 0.022608601 0.5392692 0.6145882 0.6137841 0.7215743 0.8869535 0
## RIDGE 0.011600264 0.4536225 0.5803032 0.5279555 0.6926405 0.8341017 0
## LM 0.002651769 0.2731695 0.4104689 0.4177589 0.6126618 0.8377447 0
## RLM 0.117136487 0.4860018 0.6125922 0.5896516 0.7197702 0.8562275 0
enet_base_model <- enet(x = as.matrix(trainX), y = trainY, lambda = 0.5263158, normalize = TRUE)
coef_values <- predict(enet_base_model, newx = as.matrix(testX), s = 0.35, mode = "fraction", type = "coefficients")
coef_values$coefficients[coef_values$coefficients != 0]
## BiologicalMaterial02 BiologicalMaterial03 BiologicalMaterial06
## 2.667422e-02 2.229472e-02 5.184931e-02
## ManufacturingProcess06 ManufacturingProcess09 ManufacturingProcess11
## 6.993782e-02 1.979065e-01 9.730189e-02
## ManufacturingProcess13 ManufacturingProcess15 ManufacturingProcess17
## -2.166571e-01 5.696756e-04 -1.953166e-01
## ManufacturingProcess30 ManufacturingProcess32 ManufacturingProcess36
## 3.421015e-03 9.594831e-02 -3.213657e+02
## ManufacturingProcess37 ManufacturingProcess44
## -1.464612e-01 2.776336e-02
p_range <- range(processPredictors$ManufacturingProcess32)
variation <- seq(from = p_range[1], to = p_range[2], length.out = 100)
mean_predictors <- apply(processPredictors, 2, mean)
newdata <- matrix(rep(mean_predictors, each = 100), nrow = 100)
colnames(newdata) <- colnames(processPredictors)
newdata <- as.data.frame(newdata)
newdata$ManufacturingProcess32 <- variation
y_hat <- predict(enet_base_model, newx = as.matrix(newdata), s = 0.35, mode = "fraction", type = "fit")
plot(variation, y_hat$fit, type = 'l', lwd = 2, col = "steelblue",
xlab = "Variation in ManufacturingProcess32", ylab = "Predicted Yield")
grid()
The elastic net model offered the best mix of accuracy and interpretability. But I added ridge regression to compare regularization styles. Ridge was solid—it shrank everything rather than sparsifying like ENET. Still, ENET pulled ahead with better RMSE and easier-to-interpret variable importance.
Most of the top predictors across models were process
variables, which makes sense—those are what we can actually
adjust. ManufacturingProcess32
consistently stood out as a
key driver of yield, so optimizing that setting could lead to real
revenue impact.
This assignment deepened my understanding of how model choice interacts with data shape and context. For sparse, high-dimensional data like molecular fingerprints, I saw how elastic net performed better than basic linear regression or even PLS—especially after adjusting the lambda grid to avoid zero values.
In the chemical manufacturing problem, ENET again offered strong performance with interpretable coefficients. I also appreciated how ridge and robust linear models helped stabilize predictions in noisier conditions.
One thing I’d do differently with more time or compute: run deeper tuning with more cross-validation folds. But the insights I gained from this more targeted modeling process were still valuable—and applicable to real-world public health or pharma operations where fast, interpretable models can inform decision-making without overfitting.