Question 6.2

Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.

Part A

Start R and use these commands to load the data:

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

Part B

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

Response

When using the nearZeroVar function, it reduces the predictors from 182655 to 705.

nearZeroVar(fingerprints)
##   [1]    7    8    9   10   13   14   17   18   19   22   23   24   30   31   32
##  [16]   33   34   45   77   81   82   83   84   85   89   90   91   92   95  100
##  [31]  104  105  106  107  109  110  112  113  114  115  116  117  119  120  122
##  [46]  123  124  128  131  132  134  135  136  137  139  140  144  145  147  148
##  [61]  149  151  155  160  161  164  165  166  216  217  218  219  220  222  243
##  [76]  252  259  273  275  277  282  283  287  288  289  292  346  347  348  349
##  [91]  350  351  352  353  354  363  364  365  369  375  379  384  391  393  397
## [106]  399  402  404  405  407  408  409  410  411  412  413  414  415  416  417
## [121]  418  419  420  421  422  423  424  425  426  427  428  429  430  431  432
## [136]  433  434  435  436  437  438  439  440  441  442  443  444  445  446  447
## [151]  448  449  450  451  452  453  454  455  456  457  458  459  460  461  462
## [166]  463  464  465  466  467  468  469  470  471  472  473  474  475  476  477
## [181]  478  479  480  481  482  483  484  485  486  487  488  489  490  491  492
## [196]  493  494  495  498  500  501  502  513  523  525  526  527  528  530  531
## [211]  532  533  534  535  536  537  538  539  540  541  542  543  544  545  546
## [226]  547  548  550  552  555  562  563  564  566  567  569  570  572  575  578
## [241]  579  580  581  582  583  584  585  586  587  588  589  596  605  606  607
## [256]  608  609  610  611  612  614  615  616  617  618  619  620  622  623  624
## [271]  625  626  627  628  629  630  631  632  633  634  635  636  637  638  639
## [286]  640  641  642  643  644  645  646  647  648  649  650  651  652  653  654
## [301]  655  656  657  658  659  660  661  662  663  664  665  666  667  668  669
## [316]  670  671  672  673  674  675  676  677  678  680  681  682  683  684  685
## [331]  686  687  688  689  690  691  692  693  694  695  696  697  706  707  708
## [346]  709  710  711  712  713  714  715  716  717  718  720  721  722  723  724
## [361]  725  726  727  728  729  730  731  734  735  736  737  738  739  740  741
## [376]  742  743  744  745  746  747  748  749  756  757  758  759  760  761  762
## [391]  763  764  765  766  767  768  769  770  771  772  777  778  779  781  783
## [406]  784  785  786  787  788  789  790  791  794  796  797  799  802  803  804
## [421]  807  808  809  810  811  814  815  816  817  818  819  820  821  822  823
## [436]  824  825  826  827  828  829  830  831  832  833  834  835  836  837  838
## [451]  839  840  841  842  843  844  845  846  847  848  849  850  851  852  853
## [466]  854  855  856  857  858  859  860  861  862  863  864  865  866  867  868
## [481]  869  870  871  872  873  874  875  876  877  878  879  880  881  882  883
## [496]  884  885  886  887  888  889  890  891  892  893  894  895  896  897  898
## [511]  899  900  901  902  903  904  905  906  907  908  909  910  911  912  913
## [526]  914  915  916  917  918  919  920  921  922  923  924  925  926  927  928
## [541]  929  930  931  932  933  934  935  936  937  938  939  940  941  942  943
## [556]  944  945  946  947  948  949  950  951  952  953  954  955  956  957  958
## [571]  959  960  961  962  963  964  965  966  967  968  969  970  971  972  973
## [586]  974  975  976  977  978  979  980  981  982  983  984  985  986  987  988
## [601]  989  990  991  992  993  994  995  996  997  998  999 1000 1001 1002 1003
## [616] 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018
## [631] 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033
## [646] 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048
## [661] 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063
## [676] 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078
## [691] 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093
## [706] 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107

Part C

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

Response

Since the values in our dataset are binary, comprised of 0s and 1s, centering or scaling are not necessary.

Based on the results of our PLS model, it determined that the optimal number of latent variables is 5. The corresponding R squared is 0.5311563

permeability_data <- data.frame(permeability = permeability, fingerprints)
nzv <- nearZeroVar(permeability_data)
permeability_data_filtered <- permeability_data[, -nzv]
set.seed(123)

trainIndex <- createDataPartition(permeability_data_filtered$permeability, p = 0.75, list = FALSE)
trainData <- permeability_data_filtered[trainIndex, ]
testData <- permeability_data_filtered[-trainIndex, ]
# Set up cross-validation 10-fold
ctrl <- trainControl(method = "cv", number = 10) # 10-fold 
set.seed(123)
# Train a PLS model on the training data
plsModel <- train(
  permeability ~ .,           
  data = trainData,           
  method = "pls",             
  tuneLength = 10,            
  trControl = ctrl
)

# Summarize the model
print(plsModel)
## Partial Least Squares 
## 
## 125 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 112, 113, 113, 113, 113, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.87785  0.3223096  10.642327
##    2     11.93335  0.4605485   8.632442
##    3     11.57889  0.4697451   8.606836
##    4     11.58383  0.4890124   8.852139
##    5     11.34660  0.5199843   8.451817
##    6     11.16105  0.5509412   8.091074
##    7     11.32849  0.5514611   8.261755
##    8     11.62200  0.5392276   8.876021
##    9     12.10421  0.5174443   9.181724
##   10     12.58414  0.4854590   9.611658
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.

Part D

Predict the response for the test set. What is the test set estimate of R2?

Response

On the test set, the PLS model performs poorly, resulting in an R squared of 0.3593983.

# Evaluate the model on the test set
plsPred <- predict(plsModel, newdata = testData)
 
# Calculate performance metrics (for regression)
postResample(pred = plsPred, obs = testData$permeability)
##       RMSE   Rsquared        MAE 
## 11.6969547  0.3781227  8.0737779

Part E

Try building other models discussed in this chapter. Do any have better predictive performance?

Response

The OLR model perfomed the best across the RMSE, R squared and MAE. The R^2 squared was 0.378, meaning that it explains more variance in the target variable compared to PLS (0.359) and the PCR model (0.263). It also has the least prediction error.

# PCR model
set.seed(123)
pcrModel <- train(
  permeability ~ .,           
  data = trainData,           
  method = "pcr",             
  tuneLength = 10,            
  trControl = ctrl,           
)


print(pcrModel)
## Principal Component Analysis 
## 
## 125 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 112, 113, 113, 113, 113, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     15.24104  0.1770015  11.825043
##    2     15.22293  0.1775454  11.920684
##    3     14.04574  0.3542879  10.794504
##    4     12.01014  0.4465492   8.892864
##    5     12.09482  0.4472377   8.859232
##    6     12.41015  0.4294419   9.092347
##    7     12.36056  0.4288184   9.047108
##    8     12.22602  0.4379938   9.053613
##    9     12.20281  0.4375291   9.004970
##   10     12.18693  0.4332150   9.020523
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.
# Train an OLR model 
set.seed(123)
olrModel <- plsr(
  permeability ~ .,          
  data = trainData,           
  ncomp = 10,                 
  validation = "CV",          
  method = "oscorespls"       
)

summary(olrModel)
## Data:    X dimension: 125 388 
##  Y dimension: 125 1
## Fit method: oscorespls
## Number of components considered: 10
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           16.09    14.01    12.09    11.66    11.98    11.83    11.62
## adjCV        16.09    13.99    12.06    11.61    11.86    11.68    11.42
##        7 comps  8 comps  9 comps  10 comps
## CV       11.74    12.03    12.68     12.94
## adjCV    11.54    11.78    12.37     12.60
## 
## TRAINING: % variance explained
##               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X               32.64    44.55    51.16    57.20    65.26    67.47    70.38
## permeability    27.59    50.88    58.95    66.19    70.50    75.39    76.99
##               8 comps  9 comps  10 comps
## X               72.23    74.55     76.90
## permeability    79.24    81.01     82.47

Test PCR

pcrPred <- predict(pcrModel, newdata = testData)
pcrPerformance <- postResample(pred = pcrPred, obs = testData$permeability)


print(pcrPerformance)
##       RMSE   Rsquared        MAE 
## 12.2976342  0.2631962  8.6566083

Test OLR

olrPred <- predict(olrModel, newdata = testData, ncomp = 6)
olrPerformance <- postResample(pred = olrPred, obs = testData$permeability)

print(olrPerformance)
##       RMSE   Rsquared        MAE 
## 11.6969547  0.3781227  8.0737779

Part F

Would you recommend any of your models to replace the permeability laboratory experiment?

All of the models scored relative low Rsquared values. I wouldn’t replace the laboratory experiment with any of the models I used.

Question 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Part A

Start R and use these commands to load the data:

data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for eachrun.

Part B

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).

Response

cmp <- ChemicalManufacturingProcess %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

Part C

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Response

set.seed(123) 

trainIndex <- createDataPartition(cmp$Yield, p = 0.75, list = FALSE)
trainData <- cmp[trainIndex, ]
testData <- cmp[-trainIndex, ]
# Train the PLS model 
set.seed(123)
plsModel <- train(
  Yield ~ .,                
  data = trainData,         
  method = "pls",           
  tuneLength = 10,          
  trControl = ctrl,         
  preProcess = c("center", "scale")
)


print(plsModel)
## Partial Least Squares 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 117, 119, 120, 118, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.570635  0.4913847  1.190149
##    2     2.088226  0.4999404  1.260397
##    3     1.362611  0.5958961  1.026489
##    4     1.615482  0.5276703  1.124717
##    5     1.919107  0.5195987  1.170184
##    6     2.029507  0.5041811  1.206177
##    7     2.202229  0.4949182  1.271059
##    8     2.355408  0.4912501  1.332010
##    9     2.645476  0.4824542  1.430559
##   10     2.875658  0.4682531  1.507346
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

Part D

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

# Predict on the test set using the optimal model
plsPred <- predict(plsModel, newdata = testData)

# Calculate performance metrics on the test set
testPerformance <- postResample(pred = plsPred, obs = testData$Yield)

print(testPerformance)
##      RMSE  Rsquared       MAE 
## 1.3062242 0.4794095 1.0894618

Part E

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The most important variables tended to process predictors. Among the top five are the variables ManufacturingProcess 32, 09, 36, 17 and 13.

# Calculate variable importance
importance <- varImp(plsModel, scale = TRUE)

print(importance)
## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   85.95
## ManufacturingProcess36   85.67
## ManufacturingProcess17   85.35
## ManufacturingProcess13   84.82
## ManufacturingProcess06   67.50
## BiologicalMaterial02     59.97
## ManufacturingProcess33   59.94
## BiologicalMaterial06     58.20
## BiologicalMaterial08     57.81
## BiologicalMaterial03     57.45
## ManufacturingProcess11   57.12
## BiologicalMaterial12     51.78
## BiologicalMaterial11     51.43
## BiologicalMaterial01     50.78
## BiologicalMaterial04     49.59
## ManufacturingProcess37   47.65
## ManufacturingProcess28   45.75
## ManufacturingProcess12   45.44
## ManufacturingProcess02   40.60

Part F

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Response

We can see that when using scatter plots to examine the relationship between top predictors and the Yield variable that there tends to be a positive or negative relationship. This can inform us on what part of manufacturing process are performing as expected and which aren’t.

ggplot(data = cmp, aes(x = Yield, y = ManufacturingProcess32)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) + # Linear trend line
  labs(title = "Scatterplot",
       x = "X Variable", y = "Y Variable") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(data = cmp, aes(x = Yield, y = ManufacturingProcess09)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) + # Linear trend line
  labs(title = "Scatterplot",
       x = "X Variable", y = "Y Variable") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(data = cmp, aes(x = Yield, y = ManufacturingProcess36)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) + # Linear trend line
  labs(title = "Scatterplot",
       x = "X Variable", y = "Y Variable") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(data = cmp, aes(x = Yield, y = ManufacturingProcess17)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) + # Linear trend line
  labs(title = "Scatterplot",
       x = "X Variable", y = "Y Variable") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(data = cmp, aes(x = Yield, y = ManufacturingProcess13)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) + # Linear trend line
  labs(title = "Scatterplot",
       x = "X Variable", y = "Y Variable") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'