Exercise 6.2

#(a) Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

#Start R and use these commands to load the data:

install.packages("AppliedPredictiveModeling")

## Installing package into 'C:/Users/zahid/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'AppliedPredictiveModeling' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\zahid\AppData\Local\Temp\Rtmp8y0dHV\downloaded_packages

library(AppliedPredictiveModeling)
data(permeability)

#The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

#(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package.How many predictors are left for modeling?

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

data(permeability)

# Filter near-zero variance predictors
nzv_cols <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv_cols]

# Check remaining predictors
ncol(filtered_fingerprints)

## [1] 388

#Comment: The original fingerprint dataset contained over 1,000 predictors. After applying the nearZeroVar function to remove sparse predictors with very low frequency, 388 predictors remained for modeling. This reduction helps improve model stability and reduces noise from uninformative features.

#(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(123)
# Split Data (80/20 split)
training_rows <- createDataPartition(permeability, p = 0.8, list = FALSE)

train_x <- filtered_fingerprints[training_rows, ]
train_y <- permeability[training_rows]
test_x <- filtered_fingerprints[-training_rows, ]
test_y <- permeability[-training_rows]

# Tune PLS using 10-fold Cross-Validation
ctrl <- trainControl(method = "cv", number = 10)
pls_fit <- train(train_x, train_y, 
                 method = "pls", 
                 tuneLength = 20, 
                 trControl = ctrl, 
                 preProc = c("center", "scale"))

print(pls_fit)

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.31894  0.3442124  10.254018
##    2     11.78898  0.4830504   8.534741
##    3     11.98818  0.4792649   9.219285
##    4     12.04349  0.4923322   9.448926
##    5     11.79823  0.5193195   9.049121
##    6     11.53275  0.5335956   8.658301
##    7     11.64053  0.5229621   8.878265
##    8     11.86459  0.5144801   9.265252
##    9     11.98385  0.5188205   9.218594
##   10     12.55634  0.4808614   9.610747
##   11     12.69674  0.4758068   9.702325
##   12     13.01534  0.4538906   9.956623
##   13     13.12637  0.4367362   9.878017
##   14     13.44865  0.4140715  10.065088
##   15     13.60135  0.4034269  10.188150
##   16     13.79361  0.3943904  10.247160
##   17     14.00756  0.3845119  10.412776
##   18     14.18113  0.3711378  10.587027
##   19     14.25674  0.3703610  10.575726
##   20     14.33121  0.3723176  10.679764
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.

#Comment: The optimal number of latent variables for the PLS model was 6, selected based on the lowest cross-validated RMSE. The corresponding resampled R² was approximately 0.534, indicating that the model explains about 53% of the variability in permeability.

#(d) Predict the response for the test set. What is the test set estimate of R2?

pls_pred <- predict(pls_fit, test_x)
test_results <- postResample(pred = pls_pred, obs = test_y)
print(test_results)

##       RMSE   Rsquared        MAE 
## 12.3486900  0.3244542  8.2881075

#Comment: The test set estimate of R² is approximately 0.324, indicating that the model explains about 32% of the variability in permeability on unseen data.

#(e) Try building other models discussed in this chapter. Do any have better predictive performance?

enet_grid <- expand.grid(lambda = seq(0, 0.1, length = 10), 
                         fraction = seq(0.1, 1, length = 10))

enet_fit <- train(train_x, train_y,
                  method = "enet",
                  tuneGrid = enet_grid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))

# Compare PLS vs Enet
results <- resamples(list(PLS = pls_fit, ENet = enet_fit))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: PLS, ENet 
## Number of resamples: 10 
## 
## MAE 
##          Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS  5.940748 7.855195 8.101540 8.658301 8.587837 12.53980    0
## ENet 5.504259 7.317846 8.244954 8.067092 8.951861 11.20909    0
## 
## RMSE 
##          Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS  7.479049 10.662269 11.18215 11.53275 11.66271 16.49145    0
## ENet 7.188166  9.663742 11.86149 11.35483 12.47250 15.18330    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS  0.2072535 0.3875424 0.5626977 0.5335956 0.6291758 0.8326590    0
## ENet 0.1557624 0.5202124 0.5401826 0.5579759 0.7192849 0.8020521    0

#Comment: Elastic Net was also applied and compared to the PLS model using cross-validation. The Elastic Net model showed slightly better predictive performance, with a lower RMSE (11.35 vs. 11.53) and higher R² (0.558 vs. 0.534). This suggests that Elastic Net, which incorporates regularization and feature selection, is better suited for this high-dimensional dataset. However, the improvement over PLS is modest.

#(f) Would you recommend any of your models to replace the permeability laboratory experiment?

#Answer: None of the models should replace the permeability laboratory experiment. Although Elastic Net and PLS showed moderate predictive performance, their accuracy is not sufficient for high-stakes decision-making. However, these models can be valuable as screening tools to prioritize compounds and reduce the number of laboratory experiments.

#Exercise: 6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand,manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

#(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

#The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

#(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

install.packages("RANN")

## Installing package into 'C:/Users/zahid/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'RANN' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'RANN'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\zahid\AppData\Local\R\win-library\4.4\00LOCK\RANN\libs\x64\RANN.dll to
## C:\Users\zahid\AppData\Local\R\win-library\4.4\RANN\libs\x64\RANN.dll:
## Permission denied

## Warning: restored 'RANN'

## 
## The downloaded binary packages are in
##  C:\Users\zahid\AppData\Local\Temp\Rtmp8y0dHV\downloaded_packages

library(caret)
library(ggplot2)
library(lattice)
library(RANN)
data(ChemicalManufacturingProcess)


yield <- ChemicalManufacturingProcess[, 1]
predictors <- ChemicalManufacturingProcess[, -1]


preProcValues <- preProcess(predictors, method = c("knnImpute", "center", "scale"))
predictors_imputed <- predict(preProcValues, predictors)


nzv <- nearZeroVar(predictors_imputed)
predictors_final <- predictors_imputed[, -nzv]

#(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

set.seed(123)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)

trainX <- predictors_final[trainIndex, ]
trainY <- yield[trainIndex]
testX  <- predictors_final[-trainIndex, ]
testY  <- yield[-trainIndex]

# 10-fold Cross-Validation
ctrl <- trainControl(method = "cv", number = 10)

# Tuning Elastic Net
enet_model <- train(trainX, trainY,
                    method = "enet",
                    tuneLength = 10,
                    trControl = ctrl)

print(enet_model)

## Elasticnet 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 130, 130, 129, 131, 129, ... 
## Resampling results across tuning parameters:
## 
##   lambda        fraction   RMSE      Rsquared   MAE      
##   0.0000000000  0.0500000  1.162457  0.6302366  0.9659203
##   0.0000000000  0.1555556  1.624090  0.5740326  1.1016731
##   0.0000000000  0.2611111  2.172042  0.4681495  1.3134206
##   0.0000000000  0.3666667  2.673456  0.4586086  1.4625992
##   0.0000000000  0.4722222  2.904593  0.4796554  1.5144560
##   0.0000000000  0.5777778  3.169973  0.4665650  1.6038977
##   0.0000000000  0.6833333  3.513778  0.4484407  1.7162967
##   0.0000000000  0.7888889  4.119522  0.4236920  1.9140239
##   0.0000000000  0.8944444  5.406957  0.4020988  2.2829273
##   0.0000000000  1.0000000  6.623412  0.3859416  2.6342655
##   0.0001000000  0.0500000  1.197476  0.6201030  0.9937248
##   0.0001000000  0.1555556  1.541637  0.5974676  1.0485180
##   0.0001000000  0.2611111  2.277669  0.4695801  1.3358144
##   0.0001000000  0.3666667  2.611982  0.4542408  1.4437179
##   0.0001000000  0.4722222  2.895287  0.4672357  1.5253701
##   0.0001000000  0.5777778  3.060119  0.4764711  1.5654680
##   0.0001000000  0.6833333  3.366583  0.4609917  1.6610858
##   0.0001000000  0.7888889  3.633038  0.4494763  1.7443807
##   0.0001000000  0.8944444  4.769442  0.4268663  2.0777453
##   0.0001000000  1.0000000  5.876385  0.4068849  2.3977869
##   0.0002371374  0.0500000  1.224863  0.6150536  1.0138559
##   0.0002371374  0.1555556  1.448479  0.6076219  1.0127776
##   0.0002371374  0.2611111  2.126141  0.4799791  1.2815008
##   0.0002371374  0.3666667  2.512611  0.4569811  1.4118723
##   0.0002371374  0.4722222  2.974121  0.4476288  1.5480017
##   0.0002371374  0.5777778  3.059638  0.4626462  1.5769520
##   0.0002371374  0.6833333  3.287062  0.4619110  1.6470307
##   0.0002371374  0.7888889  3.493901  0.4572261  1.7075754
##   0.0002371374  0.8944444  4.275192  0.4471082  1.9081412
##   0.0002371374  1.0000000  5.261457  0.4321526  2.1950478
##   0.0005623413  0.0500000  1.278826  0.6061906  1.0501610
##   0.0005623413  0.1555556  1.276574  0.6106429  0.9735481
##   0.0005623413  0.2611111  1.808977  0.5393830  1.1710066
##   0.0005623413  0.3666667  2.352260  0.4629374  1.3589245
##   0.0005623413  0.4722222  2.880352  0.4449154  1.5204756
##   0.0005623413  0.5777778  3.176122  0.4384185  1.6132025
##   0.0005623413  0.6833333  3.297466  0.4403188  1.6546910
##   0.0005623413  0.7888889  3.455860  0.4388490  1.7082081
##   0.0005623413  0.8944444  3.920418  0.4354470  1.8445195
##   0.0005623413  1.0000000  4.650372  0.4337607  2.0408812
##   0.0013335214  0.0500000  1.349810  0.5992674  1.1003054
##   0.0013335214  0.1555556  1.110780  0.6299393  0.9171078
##   0.0013335214  0.2611111  1.546707  0.5909330  1.0612200
##   0.0013335214  0.3666667  1.927995  0.4974117  1.2222472
##   0.0013335214  0.4722222  2.621192  0.4528742  1.4370876
##   0.0013335214  0.5777778  3.018948  0.4395891  1.5642410
##   0.0013335214  0.6833333  3.276753  0.4311298  1.6468233
##   0.0013335214  0.7888889  3.462447  0.4271098  1.7097552
##   0.0013335214  0.8944444  3.686669  0.4233651  1.7819944
##   0.0013335214  1.0000000  4.229520  0.4194950  1.9380670
##   0.0031622777  0.0500000  1.426968  0.5901000  1.1625949
##   0.0031622777  0.1555556  1.114269  0.6330259  0.9196184
##   0.0031622777  0.2611111  1.491414  0.6079121  1.0269517
##   0.0031622777  0.3666667  1.573291  0.5927843  1.0718053
##   0.0031622777  0.4722222  1.962374  0.4916575  1.2335808
##   0.0031622777  0.5777778  2.703301  0.4524955  1.4570412
##   0.0031622777  0.6833333  3.104655  0.4371119  1.5851612
##   0.0031622777  0.7888889  3.273284  0.4299071  1.6436501
##   0.0031622777  0.8944444  3.452162  0.4237933  1.7068811
##   0.0031622777  1.0000000  3.916367  0.4175553  1.8448462
##   0.0074989421  0.0500000  1.499443  0.5753818  1.2198721
##   0.0074989421  0.1555556  1.156303  0.6217081  0.9594777
##   0.0074989421  0.2611111  1.217707  0.6129227  0.9564565
##   0.0074989421  0.3666667  1.464055  0.6083026  1.0157943
##   0.0074989421  0.4722222  1.585804  0.5856705  1.0840347
##   0.0074989421  0.5777778  1.974011  0.5022922  1.2319465
##   0.0074989421  0.6833333  2.661154  0.4610730  1.4369719
##   0.0074989421  0.7888889  3.045554  0.4440303  1.5574325
##   0.0074989421  0.8944444  3.249902  0.4345275  1.6256102
##   0.0074989421  1.0000000  3.568483  0.4268695  1.7247913
##   0.0177827941  0.0500000  1.563105  0.5516121  1.2691678
##   0.0177827941  0.1555556  1.207829  0.6124433  0.9957812
##   0.0177827941  0.2611111  1.115903  0.6299691  0.9196915
##   0.0177827941  0.3666667  1.351446  0.6096418  0.9925795
##   0.0177827941  0.4722222  1.447349  0.6105483  1.0125624
##   0.0177827941  0.5777778  1.622133  0.5867054  1.0908620
##   0.0177827941  0.6833333  2.000549  0.5214116  1.2286574
##   0.0177827941  0.7888889  2.525713  0.4789993  1.3868417
##   0.0177827941  0.8944444  2.853491  0.4604756  1.4910585
##   0.0177827941  1.0000000  3.102190  0.4491409  1.5696494
##   0.0421696503  0.0500000  1.614973  0.5199139  1.3104144
##   0.0421696503  0.1555556  1.279605  0.5979457  1.0500512
##   0.0421696503  0.2611111  1.150831  0.6170942  0.9506582
##   0.0421696503  0.3666667  1.179080  0.6167094  0.9403294
##   0.0421696503  0.4722222  1.406846  0.6136163  1.0031273
##   0.0421696503  0.5777778  1.513663  0.6075660  1.0332003
##   0.0421696503  0.6833333  1.721965  0.5880162  1.1148445
##   0.0421696503  0.7888889  2.036814  0.5378296  1.2271160
##   0.0421696503  0.8944444  2.328270  0.5051963  1.3207502
##   0.0421696503  1.0000000  2.567199  0.4873329  1.3974579
##   0.1000000000  0.0500000  1.650508  0.4883115  1.3385425
##   0.1000000000  0.1555556  1.349181  0.5837055  1.0981379
##   0.1000000000  0.2611111  1.190222  0.6018143  0.9790391
##   0.1000000000  0.3666667  1.130054  0.6219074  0.9295935
##   0.1000000000  0.4722222  1.285986  0.6127049  0.9735357
##   0.1000000000  0.5777778  1.396679  0.6192855  0.9958688
##   0.1000000000  0.6833333  1.558630  0.5991587  1.0439828
##   0.1000000000  0.7888889  1.746436  0.5852357  1.1182031
##   0.1000000000  0.8944444  1.974115  0.5548594  1.1973046
##   0.1000000000  1.0000000  2.079982  0.5337945  1.2402643
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1555556 and lambda
##  = 0.001333521.

#(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

enet_pred <- predict(enet_model, testX)
postResample(pred = enet_pred, obs = testY)

##      RMSE  Rsquared       MAE 
## 1.2170467 0.5681174 1.0222292

#Comment: The Elastic Net model achieved a test set R² of approximately 0.568, with an RMSE of 1.217. This performance is very similar to the resampled training R² of about 0.558, indicating that the model generalizes well and does not exhibit significant overfitting.

#(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

# Calculate and plot importance
importance <- varImp(enet_model)
plot(importance, top = 20)

#Comment: The most important predictors include both manufacturing process and biological variables. However, process variables (e.g., ManufacturingProcess32) appear slightly more influential overall, indicating a modest dominance of process predictors.

#(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

#The top predictors show how changes in key process and biological variables affect the response (yield). By examining these relationships, we can identify which factors increase or decrease yield. This information can be used to optimize manufacturing conditions (e.g., adjusting key process parameters) and control biological inputs to improve yield in future production runs.

Data 624: Homework 7

Mohammad Zahid Chowdhury

2026-04-12

Exercise 6.2