DATA624 Assignment 7

6.2. Developing a model to predict permeability (see Sect.1.4) could save sig nificant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
library(caret)
library(glmnet)
library(pls)
library(dplyr)
library(ggplot2)
library(lattice)
library(RANN)
library(purrr)
library(e1071)
library(patchwork)
library(corrplot)
options(scipen = 999)

data(permeability)
data("ChemicalManufacturingProcess")

dim(fingerprints)

## [1]  165 1107

length(permeability)

## [1] 165

summary(permeability)

##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

The predictor matrix has 165 rows and 1,107 columns, while the response has length 165. So each row is one compound, and each predictor is a binary indicator showing whether a certain molecular substructure is present.

hist(
  permeability,
  main = "Distribution of Permeability",
  xlab = "Permeability"
)

The response variable is clearly right-skewed. That means most compounds have relatively low permeability (0 - 10), and rest of observed outcomes has much larger premeability values. So the data are not symmetric at all.

That does not automatically mean I have to mathematically transform the response, but it is something worth to pay attention to because it can make large errors on high-permeability compounds more influential.

This plot shows the main issue with these predictors: a very large portion of them are barely present. As we see almost 700 predictors has frequency less or equal to 0.1. In other words, a lot of columns carry almost no information because they are nearly all zeros.

That is exactly why nearZeroVar() makes sense here.

(b) Remove low-frequency predictors using `nearZeroVar()`

nzv_full <- nearZeroVar(fingerprints)
length(nzv_full)

## [1] 719

predictors_left_full <- ncol(fingerprints[, -nzv_full])
predictors_left_full

## [1] 388

Strictly following the order of the assignment, 388 predictors are left after removing near-zero-variance predictors from the full fingerprint matrix.

Modeling version using training data only

Now I switch to the modeling workflow. Here I do the split first, then estimate the filter on the training set only, because that is the safer practice.

set.seed(100)
train_index <- createDataPartition(permeability, p = 0.80, list = FALSE)

train_x <- fingerprints[train_index, ]
test_x  <- fingerprints[-train_index, ]
train_y <- permeability[train_index]
test_y  <- permeability[-train_index]

nrow(train_x)

## [1] 133

nrow(test_x)

## [1] 32

nzv_train <- nearZeroVar(train_x)
#length(nzv_train)

train_x_nzv <- train_x[, -nzv_train]
test_x_nzv  <- test_x[, -nzv_train]

ncol(train_x_nzv)

## [1] 396

After applying nzv function on the training set only, 396 predictors remain for the actual modeling.

Cross-validation setup

set.seed(123)
ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

I used repeated 10-fold cross-validation instead of just one round of 10-fold CV. Because the sample is small, and with only 165 observations the resampling results can repeat around a lot from one split to another. Right -skewed distribution is another reason - repeated 10-fold CV helps to mix it up between folds. Repeating CV gives a more stable estimate than relying on one single partitioning.

PLS Model

PLS helps by building latent components that try to keep the predictor information that is most useful for the response.

set.seed(123)
pls_fit <- train(
  x = train_x_nzv,
  y = train_y,
  method = "pls",
  tuneGrid = expand.grid(ncomp = 1:20),
  trControl = ctrl,
  preProcess = c("center", "scale")
)

pls_fit

## Partial Least Squares 
## 
## 133 samples
## 396 predictors
## 
## Pre-processing: centered (396), scaled (396) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.84059  0.3522603   9.656234
##    2     11.66566  0.4643995   8.106552
##    3     11.80324  0.4598323   8.787785
##    4     11.75169  0.4718315   8.874875
##    5     11.50464  0.4897680   8.637181
##    6     11.61033  0.4841033   8.654016
##    7     11.57332  0.4877868   8.867224
##    8     11.58504  0.4891956   8.900336
##    9     11.79853  0.4788735   9.037576
##   10     11.86291  0.4768328   9.097023
##   11     12.01244  0.4676747   9.106455
##   12     12.19512  0.4573812   9.293734
##   13     12.39931  0.4473301   9.425587
##   14     12.80076  0.4268150   9.751304
##   15     13.09983  0.4104290   9.965521
##   16     13.38718  0.3943337  10.123217
##   17     13.63205  0.3828407  10.251011
##   18     13.63477  0.3856468  10.307576
##   19     13.78128  0.3765195  10.360060
##   20     13.88779  0.3720827  10.378178
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.

plot(pls_fit)

pls_best_ncomp <- pls_fit$bestTune$ncomp
pls_cv_row <- pls_fit$results |> filter(ncomp == pls_best_ncomp)
pls_cv_row

pls_cv_row$Rsquared

## [1] 0.489768

Repeated CV for PLS model has:

optimal number of latent variables: 5
resampled estimate of \(R^2\): 0.49

\(R^2\) around 0.5 means the model explains a half of variability, and there is still a lot left unexplained.

d. Predict the response for the test set. What is the test set estimate of R2?

pls_pred <- predict(pls_fit, newdata = test_x_nzv)
pls_test_metrics <- postResample(pred = pls_pred, obs = test_y)
pls_test_metrics

##       RMSE   Rsquared        MAE 
## 11.9920208  0.4659888  9.0730362

pls_test_metrics["Rsquared"]

##  Rsquared 
## 0.4659888

The PLS model gets a test-set \(R^2\) of 0.466.

Test-set \(R^2\) is also low, and it is telling that PLS model is just explaining around half of variability.

plot_pls_test_pred <-  xyplot(test_y ~ pls_pred,
       type = c("p", "g"),
       xlab = "Predicted",
       ylab = "Observed",
       main = "Observed vs Predicted: PLS")

pls_resid <- test_y - pls_pred
plot_pls_resid_pred <-  xyplot(pls_resid ~ pls_pred,
       type = c("p", "g"),
       xlab = "Predicted",
       ylab = "Residuals",
       main = "Residuals vs Predicted: PLS")

print(plot_pls_test_pred, split = c(1,1,2,1), more = TRUE)
print(plot_pls_resid_pred, split = c(2,1,2,1))

The observed vs predicted plot shows some relationship, so the model is not guessing randomly. But it is also pretty scattered. Especially there are a lot of overpredicted values. That tells me the model has some predictive ability, just not enough to rely on predictions to make decisions.

The residual plot is also worth paying attention to. The spread does not look perfectly constant, and the errors seem to get more uneven for larger predicted values. To me that suggests the model overpredicting actual low value permeatabilities.

The code below collapses the whole training matrix into one long vector, then returns the unique values across all compounds and all predictors.

unique(as.vector(as.matrix(train_x_nzv)))

## [1] 0 1

After near-zero-variance filtering, all predictors still remain binary sparse fingerprints, so each variable still has limited information beyond presence or absence of a particular molecular fragment. As we see across the entire filtered training set, all predictor values are only 0 and 1. This makes the latent PLS components less capable of separating permeability strong enough. Because of that, the model captures only moderate variation in the response, which is why the \(R^2\) levels off around 0.5 for both train and test sets.

plsImpVars <- varImp(pls_fit)

plot(plsImpVars, top = 25)

The variable importance plot shows that “X6” predictor contibutes strongly to the model, while most of the remaining variables have much smaller and very similar importance. In my opinion, this means the prediction signal is concentrated in a small number of predictors instead of being spread evenly across all variables. This also helps explain why methods like PCR and PLS only improve up to a moderate \(R^2\) level and then start to level off.

Below I tried to remove “X6”

train_x_nzv_reduced <- as.data.frame(
  train_x_nzv[, !colnames(train_x_nzv) %in% "X6", drop = FALSE]
)

set.seed(123)
pls_fit_reduced <- train(
  x = train_x_nzv_reduced,
  y = train_y,
  method = "pls",
  tuneGrid = expand.grid(ncomp = 1:20),
  trControl = ctrl,
  preProcess = c("center", "scale")
)
pls_fit_reduced

## Partial Least Squares 
## 
## 133 samples
## 395 predictors
## 
## Pre-processing: centered (395), scaled (395) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.88284  0.3482764   9.691436
##    2     11.74746  0.4580154   8.191121
##    3     11.93164  0.4486950   8.902326
##    4     11.95298  0.4540722   9.015583
##    5     11.82080  0.4632357   8.921155
##    6     11.98935  0.4521655   8.924487
##    7     12.05868  0.4495213   9.179019
##    8     12.14314  0.4459191   9.334392
##    9     12.38088  0.4351459   9.455941
##   10     12.44215  0.4325274   9.555917
##   11     12.54461  0.4238030   9.503967
##   12     12.61845  0.4202059   9.507296
##   13     12.79600  0.4110140   9.641411
##   14     13.25941  0.3881722  10.053404
##   15     13.52470  0.3737405  10.252928
##   16     13.86880  0.3571612  10.506097
##   17     14.12372  0.3474299  10.687166
##   18     14.15501  0.3490414  10.734832
##   19     14.29293  0.3419871  10.808897
##   20     14.43904  0.3365980  10.890312
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.

After removing the most important predictor X6, the PLS model performance dropped from about \(R^2\) = 0.50 to \(R^2\) = 0.46. In my opinion, this confirms that X6 was carrying a large part of the predictive signal. The optimal number of components also dropped from 5 to 2, which suggests that once the strongest variable is removed, the remaining predictors contain much weaker shared information.

(e) Try other models from this chapter

Because this is a high-dimensional problem, I compared PLS with several other models:

PCR
linear regression after PCA preprocessing
ridge regression
lasso regression
elastic net

PCR

set.seed(124)
pcr_fit <- train(
  x = train_x_nzv,
  y = train_y,
  method = "pcr",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = c("center", "scale")
)

pcr_fit

## Principal Component Analysis 
## 
## 133 samples
## 396 predictors
## 
## Pre-processing: centered (396), scaled (396) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 120, 119, 119, 118, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     14.75991  0.1366024  11.471445
##    2     14.81556  0.1454097  11.491445
##    3     13.31486  0.3191248  10.113654
##    4     12.29331  0.3909504   9.080689
##    5     12.30541  0.3929584   9.073885
##    6     12.29685  0.4009028   8.998754
##    7     12.20436  0.4092524   8.854813
##    8     11.99252  0.4204403   8.527498
##    9     12.03173  0.4226557   8.611716
##   10     12.02888  0.4218802   8.823020
##   11     12.05373  0.4182559   8.820339
##   12     12.13574  0.4129682   8.921829
##   13     11.92096  0.4397459   8.692089
##   14     11.99718  0.4326545   8.761314
##   15     12.12534  0.4225632   8.874849
##   16     12.09480  0.4232677   8.852109
##   17     12.20311  0.4125943   9.005548
##   18     11.84678  0.4567672   8.875113
##   19     11.92404  0.4504026   8.927245
##   20     11.75685  0.4564399   8.802743
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 20.

pcr_best_ncomp <- pcr_fit$bestTune$ncomp 
pcr_cv_row <- pcr_fit$results |> filter(ncomp == pcr_best_ncomp) 
pcr_pred <- predict(pcr_fit, newdata = test_x_nzv) 
pcr_test_metrics <- postResample(pred = pcr_pred, obs = test_y)

Repeated CV for PCR model has:

optimal number of principal components: 20
resampled estimate of \(R^2\): 0.456

As we see PCR needed a large number of components (20) before performance stabilized, which suggests that permeatability signal is distributed across many weak orthogonal directions rather than a few dominant ones. Even with 20 components, the repeated CV \(R^2\) stayed below the PLS model, which makes sense because PCR uses PCA unsupervised dimension reduction and PLS uses supervised dimension reduction.

Linear regression with PCA preprocessing

set.seed(125)
lm_pca_fit <- train(
  x = train_x_nzv,
  y = train_y,
  method = "lm",
  preProcess = c("center", "scale", "pca"),
  trControl = ctrl
)

lm_pca_fit

## Linear Regression 
## 
## 133 samples
## 396 predictors
## 
## Pre-processing: centered (396), scaled (396), principal component
##  signal extraction (396) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 121, 119, 119, 119, 121, 120, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   11.74287  0.4664128  8.878695
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

lm_pca_cv_row <- lm_pca_fit$results[1, ]
lm_pca_pred <- predict(lm_pca_fit, newdata = test_x_nzv)
lm_pca_test_metrics <- postResample(pred = lm_pca_pred, obs = test_y)

This is basically a simpler dimensionality-reduction approach. It can work, but it is not as directly tailored to the response as PLS.

Ridge regression

set.seed(126)
ridge_grid <- expand.grid(
  alpha = 0,
  lambda = 10^seq(-4, 1, length = 30)
)

ridge_fit <- train(
  x = train_x_nzv,
  y = train_y,
  method = "glmnet",
  tuneGrid = ridge_grid,
  trControl = ctrl,
  preProcess = c("center", "scale")
)

ridge_fit

## glmnet 
## 
## 133 samples
## 396 predictors
## 
## Pre-processing: centered (396), scaled (396) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 120, 120, 119, 120, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda         RMSE     Rsquared   MAE    
##    0.0001000000  11.3053  0.5050116  8.23259
##    0.0001487352  11.3053  0.5050116  8.23259
##    0.0002212216  11.3053  0.5050116  8.23259
##    0.0003290345  11.3053  0.5050116  8.23259
##    0.0004893901  11.3053  0.5050116  8.23259
##    0.0007278954  11.3053  0.5050116  8.23259
##    0.0010826367  11.3053  0.5050116  8.23259
##    0.0016102620  11.3053  0.5050116  8.23259
##    0.0023950266  11.3053  0.5050116  8.23259
##    0.0035622479  11.3053  0.5050116  8.23259
##    0.0052983169  11.3053  0.5050116  8.23259
##    0.0078804628  11.3053  0.5050116  8.23259
##    0.0117210230  11.3053  0.5050116  8.23259
##    0.0174332882  11.3053  0.5050116  8.23259
##    0.0259294380  11.3053  0.5050116  8.23259
##    0.0385662042  11.3053  0.5050116  8.23259
##    0.0573615251  11.3053  0.5050116  8.23259
##    0.0853167852  11.3053  0.5050116  8.23259
##    0.1268961003  11.3053  0.5050116  8.23259
##    0.1887391822  11.3053  0.5050116  8.23259
##    0.2807216204  11.3053  0.5050116  8.23259
##    0.4175318937  11.3053  0.5050116  8.23259
##    0.6210169419  11.3053  0.5050116  8.23259
##    0.9236708572  11.3053  0.5050116  8.23259
##    1.3738237959  11.3053  0.5050116  8.23259
##    2.0433597179  11.3053  0.5050116  8.23259
##    3.0391953823  11.3053  0.5050116  8.23259
##    4.5203536564  11.3053  0.5050116  8.23259
##    6.7233575365  11.3053  0.5050116  8.23259
##   10.0000000000  11.3053  0.5050116  8.23259
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 10.

Ridge selected the largest lambda in the tuning grid (10). I actually think this makes sense for this dataset because we still have 396 sparse binary predictors but only 133 training compounds. The fact that model metric’s stayed almost identical across the lambda grid suggests the model is very stable under ridge shrinkage, and even strong penalization does not remove the signal.

To me that means permeability information is likely spread across many weak correlated fingerprint variables, so shrinking everything heavily works better than trying to let individual coefficients become large.

ridge_cv_row <- ridge_fit$results |>
  filter(alpha == ridge_fit$bestTune$alpha,
         lambda == ridge_fit$bestTune$lambda)

ridge_pred <- predict(ridge_fit, newdata = test_x_nzv)
ridge_test_metrics <- postResample(pred = ridge_pred, obs = test_y)

Ridge keeps all predictors but shrinks their coefficients. That is often useful when there are many correlated predictors, which is probably the case here.

Lasso regression

set.seed(127)
lasso_grid <- expand.grid(
  alpha = 1,
  lambda = 10^seq(-4, 1, length = 30)
)

lasso_fit <- train(
  x = train_x_nzv,
  y = train_y,
  method = "glmnet",
  tuneGrid = lasso_grid,
  trControl = ctrl,
  preProcess = c("center", "scale")
)

lasso_fit

## glmnet 
## 
## 133 samples
## 396 predictors
## 
## Pre-processing: centered (396), scaled (396) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 119, 118, 120, 121, 121, 118, ... 
## Resampling results across tuning parameters:
## 
##   lambda         RMSE      Rsquared   MAE      
##    0.0001000000  12.52526  0.4332455   9.138503
##    0.0001487352  12.52526  0.4332455   9.138503
##    0.0002212216  12.52526  0.4332455   9.138503
##    0.0003290345  12.52526  0.4332455   9.138503
##    0.0004893901  12.52526  0.4332455   9.138503
##    0.0007278954  12.52526  0.4332455   9.138503
##    0.0010826367  12.52526  0.4332455   9.138503
##    0.0016102620  12.52526  0.4332455   9.138503
##    0.0023950266  12.52526  0.4332455   9.138503
##    0.0035622479  12.52526  0.4332455   9.138503
##    0.0052983169  12.52526  0.4332455   9.138503
##    0.0078804628  12.52526  0.4332455   9.138503
##    0.0117210230  12.52526  0.4332455   9.138503
##    0.0174332882  12.52526  0.4332455   9.138503
##    0.0259294380  12.52526  0.4332455   9.138503
##    0.0385662042  12.52526  0.4332455   9.138503
##    0.0573615251  12.52526  0.4332455   9.138503
##    0.0853167852  12.52521  0.4332488   9.138467
##    0.1268961003  12.07953  0.4560169   8.871398
##    0.1887391822  11.57633  0.4803641   8.574253
##    0.2807216204  11.22308  0.5022838   8.380654
##    0.4175318937  11.08616  0.5111008   8.233230
##    0.6210169419  11.06680  0.5140045   8.057704
##    0.9236708572  11.13524  0.5134621   7.977813
##    1.3738237959  11.28591  0.5107602   8.016846
##    2.0433597179  11.60674  0.4981745   8.481342
##    3.0391953823  12.04261  0.4883514   9.032464
##    4.5203536564  12.70386  0.4833776   9.829793
##    6.7233575365  13.78409  0.4580500  10.898601
##   10.0000000000  15.28518  0.1432231  12.166406
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.6210169.

lasso_cv_row <- lasso_fit$results |>
  filter(alpha == lasso_fit$bestTune$alpha,
         lambda == lasso_fit$bestTune$lambda)

lasso_pred <- predict(lasso_fit, newdata = test_x_nzv)
lasso_test_metrics <- postResample(pred = lasso_pred, obs = test_y)

Lasso selected lambda of 0.621, which tells me the model benefited from actual variable selection, but not from an overly aggressive penalty. At much larger lambda values, performance dropped sharply, which suggests lasso started removing predictors that still carried useful permeability signal. So this chosen lambda (0.621) seems like the best point where the model removes noise while still keeping the most informative molecular fragments.

Elastic net

set.seed(128)
enet_grid <- expand.grid(
  alpha = seq(0.1, 0.9, by = 0.2),
  lambda = 10^seq(-4, 1, length = 25)
)

enet_fit <- train(
  x = train_x_nzv,
  y = train_y,
  method = "glmnet",
  tuneGrid = enet_grid,
  trControl = ctrl,
  preProcess = c("center", "scale")
)

enet_cv_row <- enet_fit$results |>
  filter(alpha == enet_fit$bestTune$alpha,
         lambda == enet_fit$bestTune$lambda)

enet_pred <- predict(enet_fit, newdata = test_x_nzv)
enet_test_metrics <- postResample(pred = enet_pred, obs = test_y)

best_alpha  <- enet_fit$bestTune$alpha
best_lambda <- enet_fit$bestTune$lambda

near_best <- enet_fit$results |>
  filter(alpha == best_alpha) |>
  mutate(distance = abs(log10(lambda) - log10(best_lambda))) |>
  arrange(distance) |>
  slice(1:16)

near_best

enet_fit$bestTune

best_row <- enet_fit$results |>
  filter(alpha == enet_fit$bestTune$alpha,
         lambda == enet_fit$bestTune$lambda)

best_row

Elastic net fit selected alpha = 0.1 and lambda = 3.83, and it puts the model much closer to ridge (90%) than lasso (10%). I think this strongly suggests that the permeability signal is not concentrated in a tiny set of predictors. Instead, it seems distributed across groups of correlated variables, where several related molecular fragments all contribute weakly. The small lasso component still helps remove some redundant predictors, but the strong ridge part keeps correlated groups together.

The models above show that ridge and elastic net both did strong coefficients shrinkage, while lasso preferred only moderate sparsity. This can suggests permeability is affected by many weak structural fragments acting together, rather than a very small number of dominant fragments.

I think to make more accurate predictions we need more data explaining chemical structures of compounds.

Models comparison table

comparison <- tibble(
  Model = c("PLS", "PCR", "LM with PCA", "Ridge", "Lasso", "Elastic Net"),
  Tuning = c(
    paste0("ncomp = ", pls_best_ncomp),
    paste0("ncomp = ", pcr_best_ncomp),
    "PCA preprocessing",
    paste0("alpha = ", ridge_fit$bestTune$alpha, ", lambda = ", signif(ridge_fit$bestTune$lambda, 3)),
    paste0("alpha = ", lasso_fit$bestTune$alpha, ", lambda = ", signif(lasso_fit$bestTune$lambda, 3)),
    paste0("alpha = ", enet_fit$bestTune$alpha, ", lambda = ", signif(enet_fit$bestTune$lambda, 3))
  ),
  CV_RMSE = c(
    pls_cv_row$RMSE,
    pcr_cv_row$RMSE,
    lm_pca_cv_row$RMSE,
    ridge_cv_row$RMSE,
    lasso_cv_row$RMSE,
    enet_cv_row$RMSE
  ),
  CV_R2 = c(
    pls_cv_row$Rsquared,
    pcr_cv_row$Rsquared,
    lm_pca_cv_row$Rsquared,
    ridge_cv_row$Rsquared,
    lasso_cv_row$Rsquared,
    enet_cv_row$Rsquared
  ),
  Test_RMSE = c(
    unname(pls_test_metrics["RMSE"]),
    unname(pcr_test_metrics["RMSE"]),
    unname(lm_pca_test_metrics["RMSE"]),
    unname(ridge_test_metrics["RMSE"]),
    unname(lasso_test_metrics["RMSE"]),
    unname(enet_test_metrics["RMSE"])
  ),
  Test_R2 = c(
    unname(pls_test_metrics["Rsquared"]),
    unname(pcr_test_metrics["Rsquared"]),
    unname(lm_pca_test_metrics["Rsquared"]),
    unname(ridge_test_metrics["Rsquared"]),
    unname(lasso_test_metrics["Rsquared"]),
    unname(enet_test_metrics["Rsquared"])
  )
) |>
  arrange(desc(Test_R2))

comparison

best_model_name <- comparison$Model[1]
best_model_name

## [1] "LM with PCA"

The comparison table shows that LM with PCA performed best on this test split, with elastic net finishing very close behind. PLS, elastic net, lasso, and ridge all have a very similar metrics range, which suggests there is no single clearly dominant linear modeling approach for this dataset.

Optional check: Box-Cox transformation of the response

Because the response is right-skewed, I also checked to see if transforming the response helps.

bxcx_obj <- BoxCoxTrans(train_y)
bxcx_obj

## Box-Cox Transformation
## 
## 133 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.235   1.550   4.910  12.106  14.985  55.600 
## 
## Largest/Smallest: 237 
## Sample Skewness: 1.41 
## 
## Estimated Lambda: 0 
## With fudge factor, Lambda = 0 will be used for transformations

train_y_bxcx <- predict(bxcx_obj, train_y)

set.seed(123)
pls_fit_bxcx <- train(
  x = train_x_nzv,
  y = train_y_bxcx,
  method = "pls",
  tuneGrid = expand.grid(ncomp = 1:20),
  trControl = ctrl,
  preProcess = c("center", "scale")
)

pls_fit_bxcx

## Partial Least Squares 
## 
## 133 samples
## 396 predictors
## 
## Pre-processing: centered (396), scaled (396) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     1.2594573  0.2923041  1.0471830
##    2     1.1595521  0.4154859  0.9200698
##    3     1.1190671  0.4463232  0.8971201
##    4     1.1128837  0.4579955  0.9109995
##    5     1.0850724  0.4933878  0.8705014
##    6     1.0838479  0.4945933  0.8481236
##    7     1.0553894  0.5177276  0.8294601
##    8     1.0276255  0.5419267  0.8292663
##    9     0.9950824  0.5710377  0.8006815
##   10     0.9946062  0.5786583  0.8038927
##   11     0.9966091  0.5823463  0.8063206
##   12     1.0006916  0.5845509  0.8135320
##   13     1.0198226  0.5763524  0.8268684
##   14     1.0294385  0.5753100  0.8321296
##   15     1.0414950  0.5732547  0.8438280
##   16     1.0598655  0.5639492  0.8573918
##   17     1.0674877  0.5625044  0.8627286
##   18     1.0626379  0.5681585  0.8634473
##   19     1.0608449  0.5718966  0.8631657
##   20     1.0651788  0.5716072  0.8669718
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.

pred_bxcx <- predict(pls_fit_bxcx, newdata = test_x_nzv)
pred_bxcx_orig <- predict(bxcx_obj, pred_bxcx, inverse = TRUE)

bxcx_test_metrics_original <- postResample(pred = pred_bxcx_orig, obs = test_y)
bxcx_test_metrics_original

##       RMSE   Rsquared        MAE 
## 21.6934987  0.4211546 14.4127705

BoxCox transformation of the response did not help on the original scale of the test data. So even the response is skewed, the transformation is not an improvement.

f. Would I recommend replacing the lab experiment?

No, I would not recommend replacing the permeability laboratory experiment with any of these models. As we saw earlier, PLS model has a tendency to overfit actual low-value observations

Even the better models show only moderate predictive performance. That is maybe good enough to be useful, but not good enough for full replacement in a pharmaceutical setting where the cost of being wrong can be dangerous.

I would try to see if more data is available to improve models. Also, other families of models could be more effective here, such as regression trees

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used toassess the quality of the raw material before processing. On the other hand,manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data:

data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)

## [1] 176  58

yield <- ChemicalManufacturingProcess[, 1]

processPredictors <- ChemicalManufacturingProcess[, -1]

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8)

sum(is.na(processPredictors))

## [1] 106

The first step was checking the predictor matrix for missing values. There were 106 missing cells across all measurements, which is still a relatively small amount compared with the full dataset (57 predictors and 176 observations) . To impute NA’s I choosed “bagged tree” imputation because it can better preserve nonlinear relationships between variables.

set.seed(123)
imputation <- preProcess(
  processPredictors,
  method = 'bagImpute'
)

processPredictors_imputed <- predict(imputation, processPredictors)
sum(is.na(processPredictors_imputed))

## [1] 0

#?map_dfr
summary_change <- bind_rows(lapply(names(processPredictors), function(v) {
  tibble(
    variable = v,
    mean_before = mean(processPredictors[[v]], na.rm = TRUE),
    mean_after = mean(processPredictors_imputed[[v]], na.rm = TRUE),
    median_before = median(processPredictors[[v]], na.rm = TRUE),
    median_after = median(processPredictors_imputed[[v]], na.rm = TRUE),
    sd_before = sd(processPredictors[[v]], na.rm = TRUE),
    sd_after = sd(processPredictors_imputed[[v]], na.rm = TRUE),
    missing_before = sum(is.na(processPredictors[[v]])),
    missing_after = sum(is.na(processPredictors_imputed[[v]]))
  )
})) |>
  mutate(
    mean_pct_change = ifelse(mean_before == 0, NA, 100 * (mean_after - mean_before) / mean_before),
    median_pct_change = ifelse(median_before == 0, NA, 100 * (median_after - median_before) / median_before),
    sd_pct_change = ifelse(sd_before == 0, NA, 100 * (sd_after - sd_before) / sd_before)
  ) |>
  filter(missing_before > 0) |>
  select(variable, missing_before, mean_pct_change, median_pct_change, sd_pct_change)

summary_change <- summary_change |>
  mutate(
    mean_pct_change = paste0(round(mean_pct_change, 3), "%")
  )

summary_change

After filling in the missing values, I compared the summary statistics before and after imputation. The means, medians, and standard deviations changed only slightly, which suggests the imputation process preserved the original structure of the data fairly well and did not introduce major distortions.

nzv_vars <- nearZeroVar(processPredictors_imputed)
nzv_vars

## [1] 7

processPredictors_imputed <- processPredictors_imputed[, -nzv_vars]

Near-zero variance filtering removed 1 predictors that contained almost no useful variation.

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I split data into training and test sets using a 75/25 split.

set.seed(123)

train_index <- createDataPartition(yield, p= 0.75, list = FALSE)

train_procPred <- processPredictors_imputed[train_index,]
test_procPred <- processPredictors_imputed[-train_index,]

train_yield <- yield[train_index]
test_yield <- yield[-train_index]

skewness(train_yield)

## [1] 0.3346915

hist(train_yield, breaks = 20)

Before fitting the model, I checked the distribution of the response variable. The yield variable only showed mild right skew, with skewness equal to 0.335, and the histogram looked close to symmetric. Because of this, I kept the response on its original scale to make the final predictions easier to interpret.

cor_matrix <- cor(train_procPred, use = "pairwise.complete.obs")
corrplot(
  cor_matrix,
  order = 'hclust',
  tl.cex = 0.2
)

The correlation matrix shows several strong clusters among variables. Since elastic net is specifically designed to handle correlated predictors through shrinkage, this supports using it as the main modeling approach.

high_corr <- findCorrelation(cor_matrix, cutoff = 0.95)

length(high_corr)

## [1] 6

#colnames(train_procPred)[high_corr]

Using a 0.95 cutoff, 6 highly correlated predictors were identified as potential redundant variables.

set.seed(123)
preprocess <- preProcess(train_procPred, method = c('center', 'scale'))

train_procPred_pp <- predict(preprocess, train_procPred)
test_procPred_pp <- predict(preprocess, test_procPred)

skews <- sapply(train_procPred_pp, skewness, na.rm = TRUE)

skews[abs(skews) > 1]

##   BiologicalMaterial04   BiologicalMaterial10 ManufacturingProcess01 
##               1.980845               2.632838              -3.340772 
## ManufacturingProcess02 ManufacturingProcess05 ManufacturingProcess06 
##              -1.354178               1.490613               1.468871 
## ManufacturingProcess12 ManufacturingProcess16 ManufacturingProcess18 
##               1.567466             -10.817291             -11.013409 
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess25 
##             -10.949601               1.643909             -11.078430 
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess29 
##             -11.101530             -10.982786              -9.159683 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess38 
##              -4.718153             -10.413862              -1.757712 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##              -4.256738               1.567466               2.069985 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##              -6.091542               8.587932              -5.454922 
## ManufacturingProcess45 
##              -4.430247

ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

I selected elastic net as the main model because it combines regularization with variable selection, since it is optimal for correlated process measurements.
Also I will use Yeo-Johnson transformation to normalize skewed predictors from above.

set.seed(123)
 enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
                         .fraction = seq(.01, 1, length = 20))
enet_tune <- train(train_procPred_pp, train_yield,
                   method = 'enet',
                   tuneGrid = enetGrid,
                   trControl = ctrl,
                   preProcess = 'YeoJohnson'
                   )
plot(enet_tune)

enet_tune$results[row.names(enet_tune$bestTune), ]

In this elastic net model, the two tuning parameters represent the ridge and lasso parts of the regularization. The lambda parameter controls the ridge portion, which shrinks correlated coefficients toward each other and helps stabilize the model. Since the best value was lambda = 0, the final model did not need much ridge-style shrinkage.

The fraction parameter expalins lasso side, because it controls how far the model moves along the coefficient path as variables enter the model one by one. The selected value of fraction = 0.062 means the model stopped very early, keeping only the strongest predictors and filtering many weaker variables out.

So here, the final elastic net solution behaved much closer to lasso-penalized model with a small subset of important predictors are used to make yield predictions.

Repeated cross-validation for the final elastic net model has:

optimal lambda: 0
optimal fraction: 0.062
resampled estimate of RMSE: 1.171
resampled estimate of \(R^2\): 0.615

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

set.seed(123)
enet_pred <- predict(enet_tune, newdata = test_procPred_pp)

postResample(enet_pred, test_yield)

##      RMSE  Rsquared       MAE 
## 1.0996674 0.6045486 0.9147507

Final test set performance:

RMSE: 1.1
\(R^2\): 0.605
MAE: 0.915

diag_df <- data.frame(
  observed = test_yield,
  predicted = enet_pred,
  residuals = test_yield - enet_pred
)

p1 <- ggplot(diag_df, aes(predicted, observed)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  labs(title = "Predicted vs Observed")

p2 <- ggplot(diag_df, aes(predicted, residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(title = "Residuals vs Predicted")

p1 + p2

The predicted vs observed plot shows that most of the points stay pretty close to the diagonal line, so overall the model is predicting yield reasonably well. It seems to capture both lower and higher yield batches without any major systematic bias, even though there are still few observations where the predictions are a bit off.

The residuals are centered around zero and do not show any obvious shape or trend, which suggests the elastic net model is capturing most of the important signal in the data. The variance also stays fairly similar across the fitted values.

Overall these diagnostic plots support that the model is stable.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

I explored variable importance to better understand which biological and manufacturing measurements are most strongly associated with final yield.

var_imp <- varImp(enet_tune, scale = TRUE)
plot(var_imp, top = 15)

var_imp$importance |>
  tibble::rownames_to_column("Predictor") |>
  arrange(desc(Overall)) |>
  slice(1:15)

top_vars <- var_imp$importance |>
  tibble::rownames_to_column("Predictor") |>
  arrange(desc(Overall)) |>
  slice(1:15)

table(grepl("^Biological", top_vars$Predictor))

## 
## FALSE  TRUE 
##    10     5

Among the top 15 most important predictors, 10 are manufacturing process variables and 5 are biological material variables. This suggests that both groups contribute to yield prediction, but the manufacturing measurements appear to dominate the model.

Overall, this suggest that biological material quality matters early, but the strongest optimization may come from controlling the manufacturing process settings more effectively.

top_names <- top_vars$Predictor[1:9]

plots <- lapply(top_names, function(v) {
  ggplot(
    data.frame(x = train_procPred_pp[[v]], y = train_yield),
    aes(x, y)
  ) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    labs(title = v, x = v, y = "Yield")
})

patchwork::wrap_plots(plots)

The scatterplots of the top predictors against yield show several useful linear relationships. ManufacturingProcess32 and ManufacturingProcess09 both show clear positive relationships, meaning higher values of these predictors produces higher yield values. BiologicalMaterial02, BiologicalMaterial03, and BiologicalMaterial06 also show positive trends, which suggests that certain raw material properties may consistently lead to stronger batches.

At the same time, some process variables such as ManufacturingProcess17, ManufacturingProcess13, and ManufacturingProcess36 show negative relationships with yield. This means that when these settings increase, the yield has a tendency to decrease. It is possible that these predictors may point to less efficient production conditions or process settings that push the batch away from the ideal range.

This information can be very useful for improving future manufacturing runs. The biological predictors cannot be changed, so in my opinion they can be used before production starts to evaluate the incoming material quality and find lots that may produce weaker yield. The manufacturing predictors are even more valuable because they can be controled during production. Production personnel can use these relationships to adjust machine settings to get higher yield and adjust manufacture settings that deliver negative effects.

DATA624 Assignment 7

Farhod Ibragimov

2026-04-10

6.2. Developing a model to predict permeability (see Sect.1.4) could save sig nificant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(b) Remove low-frequency predictors using `nearZeroVar()`

Modeling version using training data only

Cross-validation setup

PLS Model

d. Predict the response for the test set. What is the test set estimate of R2?

(e) Try other models from this chapter

PCR

Linear regression with PCA preprocessing

Ridge regression

Lasso regression

Elastic net

Models comparison table

Optional check: Box-Cox transformation of the response

DATA624 Assignment 7

Farhod Ibragimov

2026-04-10

6.2. Developing a model to predict permeability (see Sect.1.4) could save sig nificant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(b) Remove low-frequency predictors using nearZeroVar()

Modeling version using training data only

Cross-validation setup

PLS Model

d. Predict the response for the test set. What is the test set estimate of R2?

(e) Try other models from this chapter

PCR

Linear regression with PCA preprocessing

Ridge regression

Lasso regression

Elastic net

Models comparison table

Optional check: Box-Cox transformation of the response

f. Would I recommend replacing the lab experiment?

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8)

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

(b) Remove low-frequency predictors using `nearZeroVar()`