Data624_Homework7_ahmed

6.2. Developing a model to predict permeability (see Sect. 1.4) could save sig- nificant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a su!cient permeability to become a drug:

Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability) The matrix fingerprints contains the 1,107 binary molecular predic- tors for the 165 compounds, while permeability contains permeability response.

library(AppliedPredictiveModeling)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

data(permeability)

x <- fingerprints
y <- permeability

The fingerprint predictors indicate the presence or absence of substruc- tures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nzv <- nearZeroVar(x)
x_filtered <- x[, -nzv]

ncol(x_filtered)

## [1] 388

388 predictors remain

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(123)

trainIndex <- createDataPartition(y, p = .8, list = FALSE)

x_train <- x_filtered[trainIndex, ]
x_test  <- x_filtered[-trainIndex, ]

y_train <- y[trainIndex]
y_test  <- y[-trainIndex]

ctrl <- trainControl(method = "cv", number = 10)

pls_model <- train(
  x_train, y_train,
  method = "pls",
  preProcess = c("center","scale"),
  tuneLength = 20,
  trControl = ctrl
)

pls_model

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.31894  0.3442124  10.254018
##    2     11.78898  0.4830504   8.534741
##    3     11.98818  0.4792649   9.219285
##    4     12.04349  0.4923322   9.448926
##    5     11.79823  0.5193195   9.049121
##    6     11.53275  0.5335956   8.658301
##    7     11.64053  0.5229621   8.878265
##    8     11.86459  0.5144801   9.265252
##    9     11.98385  0.5188205   9.218594
##   10     12.55634  0.4808614   9.610747
##   11     12.69674  0.4758068   9.702325
##   12     13.01534  0.4538906   9.956623
##   13     13.12637  0.4367362   9.878017
##   14     13.44865  0.4140715  10.065088
##   15     13.60135  0.4034269  10.188150
##   16     13.79361  0.3943904  10.247160
##   17     14.00756  0.3845119  10.412776
##   18     14.18113  0.3711378  10.587027
##   19     14.25674  0.3703610  10.575726
##   20     14.33121  0.3723176  10.679764
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.

The optimal number of latent variables is 6, selected based on the lowest RMSE.

The corresponding resampled estimate of R² = 0.5336 (approximately 0.534).

Predict the response for the test set. What is the test set estimate of R2?

pred <- predict(pls_model, x_test)

postResample(pred, y_test)

##       RMSE   Rsquared        MAE 
## 12.3486900  0.3244542  8.2881075

The response was predicted for the test set using the tuned PLS model.

The test set estimate of R² = 0.324.

Try building other models discussed in this chapter. Do any have better predictive performance?

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-10

ridge_model <- train(
  x_train, y_train,
  method = "glmnet",
  preProcess = c("center", "scale"),
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 20)),
  trControl = ctrl
)

lasso_model <- train(
  x_train, y_train,
  method = "glmnet",
  preProcess = c("center", "scale"),
  tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 20)),
  trControl = ctrl
)

ridge_model

## glmnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 118, 121, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE    
##   0.00100000  11.33669  0.5731857  8.54889
##   0.05357895  11.33669  0.5731857  8.54889
##   0.10615789  11.33669  0.5731857  8.54889
##   0.15873684  11.33669  0.5731857  8.54889
##   0.21131579  11.33669  0.5731857  8.54889
##   0.26389474  11.33669  0.5731857  8.54889
##   0.31647368  11.33669  0.5731857  8.54889
##   0.36905263  11.33669  0.5731857  8.54889
##   0.42163158  11.33669  0.5731857  8.54889
##   0.47421053  11.33669  0.5731857  8.54889
##   0.52678947  11.33669  0.5731857  8.54889
##   0.57936842  11.33669  0.5731857  8.54889
##   0.63194737  11.33669  0.5731857  8.54889
##   0.68452632  11.33669  0.5731857  8.54889
##   0.73710526  11.33669  0.5731857  8.54889
##   0.78968421  11.33669  0.5731857  8.54889
##   0.84226316  11.33669  0.5731857  8.54889
##   0.89484211  11.33669  0.5731857  8.54889
##   0.94742105  11.33669  0.5731857  8.54889
##   1.00000000  11.33669  0.5731857  8.54889
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.

lasso_model

## glmnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 120, 120, 119, 118, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE     
##   0.00100000  12.74117  0.4723781  9.176338
##   0.05357895  12.74117  0.4723781  9.176338
##   0.10615789  12.69336  0.4741610  9.150030
##   0.15873684  12.09979  0.5050067  8.816790
##   0.21131579  11.76073  0.5248985  8.680782
##   0.26389474  11.59842  0.5329640  8.614591
##   0.31647368  11.52846  0.5358785  8.577002
##   0.36905263  11.48823  0.5381836  8.545427
##   0.42163158  11.45519  0.5399699  8.508475
##   0.47421053  11.41375  0.5428282  8.466997
##   0.52678947  11.36201  0.5462513  8.427929
##   0.57936842  11.36583  0.5459328  8.442941
##   0.63194737  11.39464  0.5439284  8.474881
##   0.68452632  11.43031  0.5417073  8.498274
##   0.73710526  11.47169  0.5388543  8.509875
##   0.78968421  11.50303  0.5370308  8.503900
##   0.84226316  11.52698  0.5356889  8.486556
##   0.89484211  11.53477  0.5347036  8.459214
##   0.94742105  11.52859  0.5347867  8.427965
##   1.00000000  11.52381  0.5349053  8.402687
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.5267895.

6.2 (e)

Additional models were built using Ridge and Lasso regression.

Results:

PLS Model
- R² (CV): 0.534
Ridge Regression
- R² (CV): 0.573
Lasso Regression
- R² (CV): 0.546

Answer:

Yes. Ridge regression performed better than the PLS model, with the highest
resampled R² = 0.573. Lasso also performed slightly better than PLS but worse than Ridge.

Would you recommend any of your models to replace the permeability laboratory experiment?

postResample(predict(ridge_model, x_test), y_test)

##       RMSE   Rsquared        MAE 
## 11.0203929  0.3266984  7.6511575

postResample(predict(lasso_model, x_test), y_test)

##       RMSE   Rsquared        MAE 
## 10.9899991  0.3902119  7.3778673

Test set performance:

Ridge Regression
- R² = 0.327
Lasso Regression
- R² = 0.390

Although the Lasso model performed best, the test set R² values are relatively low,
indicating limited predictive accuracy.

No. The predictive performance is not strong enough to replace the permeability
laboratory experiment. The models may be useful for screening compounds, but not
as a full replacement for laboratory measurements.

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro- cess. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch: (a) Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(chemicalManufacturing) The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(AppliedPredictiveModeling)
library(caret)

data(ChemicalManufacturingProcess)

x <- ChemicalManufacturingProcess[, -1]  
y <- ChemicalManufacturingProcess$Yield

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

preProc <- preProcess(x, method = "medianImpute")

x_imputed <- predict(preProc, x)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

set.seed(123)

trainIndex <- createDataPartition(y, p = .8, list = FALSE)

x_train <- x_imputed[trainIndex, ]
x_test  <- x_imputed[-trainIndex, ]

y_train <- y[trainIndex]
y_test  <- y[-trainIndex]

ctrl <- trainControl(method = "cv", number = 10)

pls_model <- train(
  x_train, y_train,
  method = "pls",
  preProcess = c("center","scale"),
  tuneLength = 20,
  trControl = ctrl
)

pls_model

## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 130, 130, 129, 131, 129, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.469135  0.4413084  1.1554181
##    2     2.041088  0.5084869  1.2301350
##    3     1.267233  0.5914708  0.9759364
##    4     1.307502  0.6017702  1.0043410
##    5     1.751768  0.5622009  1.1346052
##    6     1.783366  0.5694847  1.1446377
##    7     1.906613  0.5578632  1.1838996
##    8     1.966814  0.5397329  1.2163865
##    9     2.161597  0.5137488  1.2932272
##   10     2.497343  0.4954145  1.3993903
##   11     2.831450  0.4626421  1.5298526
##   12     3.363236  0.4204766  1.6920564
##   13     3.767530  0.3965149  1.8190061
##   14     3.934273  0.3878164  1.8747674
##   15     4.039276  0.3867160  1.8974451
##   16     4.106236  0.3838261  1.9225634
##   17     4.104890  0.3833394  1.9193643
##   18     4.118901  0.3826933  1.9284710
##   19     4.124366  0.3823171  1.9367371
##   20     4.235183  0.3831376  1.9693438
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

A Partial Least Squares (PLS) model was trained using 10-fold cross-validation.

The optimal number of components was 3, selected based on the lowest RMSE.

The corresponding performance metrics were:

Optimal latent variables: 3
RMSE: 1.267
R²: 0.591

The optimal value of the performance metric occurs at 3 latent variables, with a cross-validated R² = 0.591.

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

pred <- predict(pls_model, x_test)

postResample(pred, y_test)

##      RMSE  Rsquared       MAE 
## 1.3836482 0.4658466 1.1737971

The test set performance metrics are:

RMSE: 1.384
R²: 0.466

The test set performance is worse than the resampled training performance:

Training R²: 0.591
Test R²: 0.466

The test set R² is 0.466, which is lower than the resampled training R² of 0.591, indicating a decrease in predictive performance on new data.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

print("The most important predictors are:")

## [1] "The most important predictors are:"

varImp(pls_model)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess17   88.56
## ManufacturingProcess13   88.43
## ManufacturingProcess09   88.21
## ManufacturingProcess36   87.39
## ManufacturingProcess06   70.42
## BiologicalMaterial06     62.98
## ManufacturingProcess33   62.79
## BiologicalMaterial03     61.77
## BiologicalMaterial02     61.50
## BiologicalMaterial08     60.52
## ManufacturingProcess11   57.13
## BiologicalMaterial11     55.73
## BiologicalMaterial12     55.64
## BiologicalMaterial01     53.52
## BiologicalMaterial04     51.96
## ManufacturingProcess28   49.54
## ManufacturingProcess12   46.99
## ManufacturingProcess37   45.75
## BiologicalMaterial10     42.94

Most of the top predictors are manufacturing process variables, with only a few biological material variables appearing in the top list.

The manufacturing process predictors dominate the list of important variables, suggesting that adjusting process conditions may have the greatest impact on improving yield.

Explore the relationships between each of the top predictors and the re- sponse. How could this information be helpful in improving yield in future runs of the manufacturing process?

top_vars <- varImp(pls_model)$importance
top_vars$Predictor <- rownames(top_vars)
top_vars <- top_vars[order(-top_vars$Overall), ]

# top 10 names
top_names <- top_vars$Predictor[1:10]
top_vars[1:10, ]

##                          Overall              Predictor
## ManufacturingProcess32 100.00000 ManufacturingProcess32
## ManufacturingProcess17  88.55696 ManufacturingProcess17
## ManufacturingProcess13  88.42528 ManufacturingProcess13
## ManufacturingProcess09  88.21006 ManufacturingProcess09
## ManufacturingProcess36  87.38600 ManufacturingProcess36
## ManufacturingProcess06  70.42444 ManufacturingProcess06
## BiologicalMaterial06    62.97962   BiologicalMaterial06
## ManufacturingProcess33  62.78626 ManufacturingProcess33
## BiologicalMaterial03    61.76596   BiologicalMaterial03
## BiologicalMaterial02    61.50390   BiologicalMaterial02

# scatter plots
par(mfrow = c(2, 5))
for (v in top_names) {
  plot(x_train[, v], y_train, xlab = v, ylab = "Yield", main = v)
  abline(lm(y_train ~ x_train[, v]), col = "red")
}

6.3 (f)

The plots show that several manufacturing process variables have clear linear relationships with yield. Some predictors such as ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess06 show positive relationships, indicating that increasing these values tends to increase yield. Others such as ManufacturingProcess17 and ManufacturingProcess13 show negative relationships, suggesting that higher values may decrease yield.

Biological material variables such as BiologicalMaterial06, BiologicalMaterial03, and BiologicalMaterial02 also show moderate positive relationships with yield, indicating that raw material quality affects final output.

This information can help improve yield by:

Adjusting manufacturing process variables that positively impact yield
Avoiding process settings that reduce yield
Screening biological raw materials before production
Optimizing manufacturing conditions for higher yield

Data624_Homework7_ahmed_hassan

Ahmed Hassan

2026-04-12

6.2 (e)

6.3 (f)