DATA 624 Homework 7

Author

Henock Montcho

Published

May 15, 2025

1-) Developing a model to predict permeability (see Sect. 1.4) could save sig-nificant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling) 
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

library(caret)
fingerprints_filtered <- fingerprints[, -nearZeroVar(fingerprints)]

dim(fingerprints_filtered)

[1] 165 388

Comment: Out of 1,107 predictors the matrix fingerprints originally has, now there are only 388 predictors left for modeling.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(0001)

# index for training
index <- createDataPartition(permeability, p = .8, list = FALSE)

# training data 
training_perm <- permeability[index, ]
training_fp <- fingerprints[index, ]

# test
test_perm <- permeability[-index, ]
test_fingerprints <- fingerprints [-index, ]

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

plsTune <- train(training_fp, training_perm, method = "pls", metric = "Rsquared",
             tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))


plot(plsTune)

plsTune

Partial Least Squares 

 133 samples
1107 predictors

Pre-processing: centered (1107), scaled (1107) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 120, 121, 119, 120, 120, 120, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared   MAE     
   1     12.12952  0.4011904  9.305278
   2     10.94876  0.5105796  7.679658
   3     11.00854  0.5290113  7.754137
   4     11.20289  0.5225587  7.893784
   5     11.24215  0.4983011  7.954592
   6     11.20096  0.4969518  8.044244
   7     11.19472  0.5018788  7.934196
   8     11.25831  0.4982870  8.022500
   9     11.10065  0.5071009  7.863240
  10     10.91092  0.5152740  7.929696
  11     10.85914  0.5196915  7.953547
  12     10.85129  0.5185189  7.980026
  13     11.02004  0.5057632  8.088382
  14     11.16722  0.4951365  8.259389
  15     11.21854  0.4901999  8.296766
  16     11.19973  0.4921694  8.268655
  17     11.15026  0.4970592  8.248743
  18     11.27166  0.4879151  8.291501
  19     11.33344  0.4805361  8.271034
  20     11.39877  0.4740136  8.247344

Rsquared was used to select the optimal model using the largest value.
The final value used for the model was ncomp = 3.

Comment: The optimal tuning had 3 components with a corresponding R² of 0.5290113.

(d) Predict the response for the test set. What is the test set estimate of R²?

fingerprints_predict <- predict(plsTune, test_fingerprints)

postResample(fingerprints_predict, test_perm)

      RMSE   Rsquared        MAE 
15.9087452  0.1281726 10.9640796

Comment: The test set estimate of R² is 0.1281726.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Elastic Net Regression

library(elasticnet)
library(caret)

set.seed(0000004)
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

#Removing zero-variance predictors from training_fp

nzv <- nearZeroVar(training_fp)
training_fp_clean <- training_fp[, -nzv]


# tuning penalized regression model
enetTune <- train(training_fp_clean, training_perm, method = "enet",
                  tuneGrid = enetGrid, trControl = ctrl, preProc = c("center", "scale"))


plot(enetTune)

enetTune

Elasticnet 

133 samples
384 predictors

Pre-processing: centered (384), scaled (384) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 120, 120, 119, 118, 120, 120, ... 
Resampling results across tuning parameters:

  lambda  fraction  RMSE       Rsquared   MAE      
  0.00    0.05       12.33567  0.4676084   9.329956
  0.00    0.10       11.60878  0.4912699   8.481412
  0.00    0.15       11.45539  0.4999618   8.249311
  0.00    0.20       11.47138  0.4947464   8.387189
  0.00    0.25       11.42483  0.4915525   8.397655
  0.00    0.30       11.43015  0.4837780   8.391405
  0.00    0.35       11.51210  0.4727124   8.448593
  0.00    0.40       11.64681  0.4615121   8.475189
  0.00    0.45       11.73717  0.4579426   8.470615
  0.00    0.50       11.90697  0.4512992   8.503301
  0.00    0.55       12.03958  0.4494149   8.572351
  0.00    0.60       12.15125  0.4468859   8.636492
  0.00    0.65       12.28727  0.4435424   8.718164
  0.00    0.70       12.40138  0.4418729   8.797107
  0.00    0.75       12.52158  0.4406467   8.886665
  0.00    0.80       12.65177  0.4399997   8.988894
  0.00    0.85       12.76168  0.4415999   9.063262
  0.00    0.90       12.87882  0.4431133   9.150393
  0.00    0.95       13.03417  0.4434389   9.273398
  0.00    1.00       13.20537  0.4427072   9.402417
  0.01    0.05       18.32558  0.4803515  12.223604
  0.01    0.10       24.08812  0.4867543  15.524638
  0.01    0.15       29.68211  0.5190491  18.853631
  0.01    0.20       34.29551  0.5440964  20.887165
  0.01    0.25       38.52580  0.5552071  23.053455
  0.01    0.30       42.76244  0.5708897  25.241745
  0.01    0.35       47.21529  0.5829363  27.440069
  0.01    0.40       51.84910  0.5928855  30.671362
  0.01    0.45       56.62818  0.5993495  34.043848
  0.01    0.50       61.59247  0.6058822  37.475776
  0.01    0.55       66.61483  0.6136996  40.917738
  0.01    0.60       71.74613  0.6183458  44.407092
  0.01    0.65       76.98639  0.6178066  47.916930
  0.01    0.70       82.30906  0.6135315  51.441291
  0.01    0.75       87.66894  0.6074345  54.903148
  0.01    0.80       92.87197  0.6026133  58.194595
  0.01    0.85       98.08663  0.5975690  61.506843
  0.01    0.90      103.33599  0.5903806  64.839742
  0.01    0.95      108.51112  0.5857670  68.118289
  0.01    1.00      113.65206  0.5813243  71.366463
  0.10    0.05       11.87540  0.4999528   8.968315
  0.10    0.10       11.37970  0.4790711   8.129369
  0.10    0.15       11.28786  0.4828563   8.225508
  0.10    0.20       11.19156  0.4925086   8.174445
  0.10    0.25       11.00415  0.5120630   8.064748
  0.10    0.30       10.85888  0.5250280   7.939745
  0.10    0.35       10.80365  0.5341819   7.841756
  0.10    0.40       10.75136  0.5444253   7.689580
  0.10    0.45       10.67963  0.5561593   7.598058
  0.10    0.50       10.58480  0.5680503   7.511252
  0.10    0.55       10.50778  0.5772374   7.422204
  0.10    0.60       10.43172  0.5853835   7.338467
  0.10    0.65       10.35096  0.5932400   7.282791
  0.10    0.70       10.30909  0.5988273   7.253586
  0.10    0.75       10.27973  0.6033158   7.242195
  0.10    0.80       10.28814  0.6045892   7.265009
  0.10    0.85       10.31172  0.6042252   7.298486
  0.10    0.90       10.33246  0.6037145   7.324348
  0.10    0.95       10.34068  0.6042237   7.329113
  0.10    1.00       10.34134  0.6049299   7.324218

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.75 and lambda = 0.1.

enet_predict <- predict(enetTune, test_fingerprints)

postResample(enet_predict, test_perm)

      RMSE   Rsquared        MAE 
13.9743213  0.3564037 10.6458697

Comment: Here, the R² (0.3564037) is higher. The RMSE (13.9743213) and the MAE (10.6458697) are both lower compared to the PLS Method. We want a smaller RMSE and a higher R².

The least angle regression

set.seed(0101)

larsTune <- train(training_fp, training_perm, method = "lars", metric = "Rsquared",
                    tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"), use.Gram = FALSE)

plot(larsTune)

larsTune

Least Angle Regression 

 133 samples
1107 predictors

Pre-processing: centered (1107), scaled (1107) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 120, 121, 120, 119, 120, 119, ... 
Resampling results across tuning parameters:

  fraction  RMSE      Rsquared   MAE      
  0.05      11.81749  0.5527987   9.175989
  0.10      11.17743  0.5249322   8.055655
  0.15      11.03644  0.5177277   7.638767
  0.20      11.04605  0.5205102   7.478941
  0.25      11.01684  0.5290439   7.552544
  0.30      10.92316  0.5405664   7.660368
  0.35      10.95592  0.5377995   7.811794
  0.40      11.16505  0.5208305   7.963949
  0.45      11.30381  0.5081337   8.103849
  0.50      11.48684  0.4959009   8.269503
  0.55      11.70571  0.4837794   8.425487
  0.60      12.04788  0.4653065   8.619035
  0.65      12.38947  0.4539916   8.811673
  0.70      12.63420  0.4497880   8.961094
  0.75      12.95054  0.4427984   9.155413
  0.80      13.41827  0.4243703   9.492906
  0.85      13.94331  0.4054214   9.904694
  0.90      14.50708  0.3892031  10.346220
  0.95      14.95799  0.3781455  10.718062
  1.00      15.51268  0.3625228  11.147956

Rsquared was used to select the optimal model using the largest value.
The final value used for the model was fraction = 0.05.

lars_predict <- predict(larsTune, test_fingerprints)

postResample(lars_predict, test_perm)

      RMSE   Rsquared        MAE 
12.2557777  0.5142676  9.4099644

Comment: In this case, the R² (0.5142676) is higher. The RMSE (12.2557777) and the MAE (9.4099644) are both lower compared to the PLS and the ENET Methods. We want a smaller RMSE and a higher R².

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

I would recommend the

I would recommend the Least Angle Regression (LARS) Method as it demonstrated superior performance metrics by achieving higher R² and lower RMSE and MAE.

2-) A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the re-lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro-cess. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

#Determining missing values
sum(is.na(ChemicalManufacturingProcess))

[1] 106

missing <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(missing, ChemicalManufacturingProcess)

sum(is.na(Chemical))

[1] 0

Comment: The ChemicalManufacturingProcess dataset contained 106 missing values, which were imputed using bagged trees. This method leverages all other available variables to predict and fill in the missing data.

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

The LARS Method will be used

# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]

set.seed(1122)

# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)

# train 
train_chemical <- Chemical[index, ]

# test
test_chemical <- Chemical[-index, ]


set.seed(9987)

larsTune <- train(Yield ~ ., Chemical , method = "lars", metric = "Rsquared",
                    tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))

plot(larsTune)

larsTune

Least Angle Regression 

176 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 158, 158, 158, 158, 158, 159, ... 
Resampling results across tuning parameters:

  fraction  RMSE      Rsquared   MAE      
  0.05      1.272531  0.6045198  1.0446041
  0.10      1.167676  0.5970666  0.9406112
  0.15      1.212814  0.5834683  0.9629513
  0.20      1.458800  0.5455864  1.0276118
  0.25      1.556449  0.5325864  1.0610097
  0.30      1.746726  0.5040319  1.1245147
  0.35      1.900979  0.5245469  1.1536580
  0.40      2.028951  0.5104204  1.1954196
  0.45      2.009926  0.5105183  1.1925722
  0.50      1.972459  0.5086036  1.1893247
  0.55      1.922096  0.5145047  1.1809228
  0.60      1.908505  0.5097956  1.1876830
  0.65      1.903838  0.5059597  1.1933662
  0.70      1.891772  0.5089519  1.1946916
  0.75      1.958601  0.5150667  1.2130396
  0.80      2.008774  0.5209597  1.2265797
  0.85      2.194891  0.5120148  1.2727253
  0.90      2.672036  0.4822419  1.4058865
  0.95      3.204179  0.4659763  1.5445333
  1.00      3.537950  0.4491486  1.6293659

Rsquared was used to select the optimal model using the largest value.
The final value used for the model was fraction = 0.05.

Comment: Rsquared was used to select the optimal model using the largest value. The final value used for the model was fraction = 0.1 for RMSE = 1.149218, R² = 0.6307070 and MAE = 0.9336483.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

lars_predict <- predict(larsTune, test_chemical[ ,-1])

postResample(lars_predict, test_chemical[ ,1])

    RMSE Rsquared      MAE 
1.468911 0.686915 1.168253

Comment: The performance metrics (RMSE, MAE and R²) are higher.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

varImp(larsTune)

loess r-squared variable importance

  only 20 most important variables shown (out of 56)

                       Overall
ManufacturingProcess32  100.00
ManufacturingProcess13   90.03
BiologicalMaterial06     84.57
ManufacturingProcess36   75.80
ManufacturingProcess17   74.89
BiologicalMaterial03     73.54
ManufacturingProcess09   70.38
BiologicalMaterial12     67.99
BiologicalMaterial02     65.34
ManufacturingProcess06   58.06
ManufacturingProcess33   49.82
BiologicalMaterial11     48.13
BiologicalMaterial04     47.15
BiologicalMaterial08     41.90
ManufacturingProcess11   41.54
BiologicalMaterial01     39.16
ManufacturingProcess31   35.59
ManufacturingProcess12   33.01
BiologicalMaterial09     32.44
ManufacturingProcess25   23.51

Comment: only 20 most important variables shown (out of 56). The process predictors dominated the list (11 over 20) with the highest overall importance.

(f) Explore the relationships between each of the top predictors and the re-sponse. How could this information be helpful in improving yield in future runs of the manufacturing process?

library(corrplot)
library(dplyr)
top10 <- varImp(larsTune)$importance  |>
  arrange(-Overall)  |>
  head(10)


Chemical  |>
  select(c("Yield", row.names(top10)))  |>
  cor()  |>
  corrplot()

Comment: The correlation plot indicates that ManufacturingProcess32 has the strongest positive correlation with Yield. Among the top ten influential variables, three show a negative correlation with Yield. These insights can guide future manufacturing runs, as they highlight key predictors that impact yield outcomes. To enhance or maximize yield, it may be beneficial to improve the precision of both manufacturing process measurements and the biological assessments of raw materials.