library(AppliedPredictiveModeling)
library(caret)
library(caTools)
library(corrplot)
library(tidyverse)
library(pls)
library(arm)

Question 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.

A.

Start R and use these commands to load the data:

data(permeability)

B.

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

ncol(fingerprints)
## [1] 1107
nzv <- nearZeroVar(fingerprints)
fingerprints_filtered <- fingerprints[, -nzv]
num_predictors_left <- ncol(fingerprints_filtered)
num_predictors_left
## [1] 388

There was initially 1107 predictors and we got 388 left after using nearZeroVar.

C.

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(334)
index <- createDataPartition(permeability, p = .8, list = FALSE)

# Splitting Variables
train_fp <- fingerprints_filtered[index, ]
test_fp <- fingerprints_filtered[-index, ]
train_perm <- permeability[index]
test_perm <- permeability[-index]


# Cross Validation with 10 Fold
ctrl <- trainControl(method = "cv", number = 10)

plsmodel <- train(train_fp, train_perm, method = "pls", 
                  metric = "Rsquared",tuneLength = 20, trControl = ctrl,
                  preProc = c("center", "scale"))

plot(plsmodel)

plsmodel
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 121, 120, 121, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.99280  0.3070361  10.009069
##    2     12.16128  0.4339740   8.759500
##    3     11.91646  0.4313142   9.091931
##    4     11.71985  0.4403444   8.995034
##    5     11.52004  0.4849718   8.545651
##    6     11.22189  0.5120605   8.435937
##    7     11.12635  0.5352629   8.365518
##    8     11.15961  0.5243175   8.378692
##    9     11.33492  0.5060529   8.610821
##   10     11.29378  0.5167202   8.546452
##   11     11.25400  0.5231956   8.490934
##   12     11.29238  0.5252548   8.633259
##   13     11.36673  0.5138357   8.731267
##   14     11.42145  0.5074234   8.706800
##   15     11.67288  0.5035852   8.991768
##   16     12.14120  0.4803830   9.278428
##   17     12.49861  0.4605289   9.580962
##   18     12.77110  0.4494579   9.787425
##   19     12.81140  0.4481352   9.894780
##   20     13.24251  0.4282701  10.141927
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 7.

Judging based on the graph and the table we can see that the optimal amount of components will be 7 with a corresponding R-Squared value of 0.5352.

D.

Predict the response for the test set. What is the test set estimate of R2?

# Predicting the Value
predictions <- predict(plsmodel, test_fp)
# Combining the Observed and Prediction
pls1 <- data.frame(obs = test_perm, pred = predictions)
# Calculating prediction accuracy
defaultSummary(pls1)
##       RMSE   Rsquared        MAE 
## 12.0828371  0.4807966  9.5900950

It seems that the R-Squared value was 0.4808 for the test set.

E.

Try building other models discussed in this chapter. Do any have better predictive performance?

# LARS Model
set.seed(17)
larsmodel <- train(train_fp, train_perm, method = "lars", metric = "Rsquared",
                  tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
plot(larsmodel)

larsmodel
## Least Angle Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 121, 120, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.05      11.89877  0.4596729   8.650788
##   0.10      11.57994  0.4676499   8.659622
##   0.15      11.49042  0.4660516   8.684607
##   0.20      11.79188  0.4556552   8.757311
##   0.25      12.00700  0.4480268   8.894531
##   0.30      12.20094  0.4307403   9.035129
##   0.35      12.39225  0.4231629   9.200676
##   0.40      12.81794  0.4049887   9.494958
##   0.45      13.36303  0.3879590   9.887695
##   0.50      13.88195  0.3741034  10.243365
##   0.55      14.29867  0.3672552  10.537459
##   0.60      14.73194  0.3560844  10.849888
##   0.65      15.12604  0.3521315  11.093740
##   0.70      15.70629  0.3482580  11.399051
##   0.75      16.32444  0.3448372  11.727394
##   0.80      16.75558  0.3440126  11.966841
##   0.85      17.32837  0.3354593  12.340370
##   0.90      18.08390  0.3248463  12.693740
##   0.95      18.86889  0.3122171  13.191606
##   1.00      19.69152  0.3021566  13.722607
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.1.
set.seed(18)
# Elastic Net Model
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
enetmodel <- train(train_fp, train_perm, method = "enet",
                  tuneGrid = enetGrid, trControl = ctrl, preProc = c("center", "scale"))
plot(enetmodel)

enetmodel
## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 121, 119, 118, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.05      12.26077  0.4302238   9.068764
##   0.00    0.10      12.06994  0.4461138   8.710435
##   0.00    0.15      12.33234  0.4336162   9.037856
##   0.00    0.20      12.99665  0.4173619   9.596072
##   0.00    0.25      13.58559  0.4062614  10.104331
##   0.00    0.30      14.12676  0.4000653  10.474480
##   0.00    0.35      14.62540  0.3978212  10.870962
##   0.00    0.40      15.02091  0.3928948  11.235684
##   0.00    0.45      15.72952  0.3852346  11.744823
##   0.00    0.50      16.50410  0.3766445  12.313339
##   0.00    0.55      17.21365  0.3713162  12.801739
##   0.00    0.60      17.95253  0.3679022  13.261248
##   0.00    0.65      18.38448  0.3703363  13.563239
##   0.00    0.70      18.90872  0.3742150  13.855196
##   0.00    0.75      19.53553  0.3737620  14.108055
##   0.00    0.80      20.19605  0.3718859  14.397302
##   0.00    0.85      20.71126  0.3666779  14.514537
##   0.00    0.90      22.45772  0.3597702  15.061962
##   0.00    0.95      24.67305  0.3518519  15.646127
##   0.00    1.00      27.00508  0.3414503  16.498961
##   0.01    0.05      11.73387  0.4305112   8.383810
##   0.01    0.10      11.35687  0.4685121   8.118469
##   0.01    0.15      10.84106  0.4909361   7.763605
##   0.01    0.20      10.79683  0.4830988   7.779883
##   0.01    0.25      10.86258  0.4805777   7.764118
##   0.01    0.30      10.95105  0.4830138   7.798469
##   0.01    0.35      11.09094  0.4798167   7.935408
##   0.01    0.40      11.28013  0.4731556   8.093791
##   0.01    0.45      11.48214  0.4652692   8.259275
##   0.01    0.50      11.69816  0.4576207   8.454800
##   0.01    0.55      11.86757  0.4520983   8.600172
##   0.01    0.60      12.00163  0.4463469   8.706146
##   0.01    0.65      12.09158  0.4421529   8.812382
##   0.01    0.70      12.19029  0.4387677   8.946471
##   0.01    0.75      12.27026  0.4379888   9.095215
##   0.01    0.80      12.39657  0.4369456   9.264022
##   0.01    0.85      12.56291  0.4334777   9.415054
##   0.01    0.90      12.76677  0.4284080   9.586584
##   0.01    0.95      12.97475  0.4228666   9.745072
##   0.01    1.00      13.18892  0.4173654   9.898766
##   0.10    0.05      12.67053  0.3520824   9.588713
##   0.10    0.10      11.65835  0.4299693   8.303430
##   0.10    0.15      11.42173  0.4544492   8.092523
##   0.10    0.20      11.31504  0.4691684   8.070741
##   0.10    0.25      11.06593  0.4837197   7.914023
##   0.10    0.30      10.90718  0.4890784   7.819914
##   0.10    0.35      10.87891  0.4887349   7.786409
##   0.10    0.40      10.89849  0.4890789   7.779867
##   0.10    0.45      10.89071  0.4929433   7.785470
##   0.10    0.50      10.94018  0.4926868   7.838433
##   0.10    0.55      10.99183  0.4920893   7.917916
##   0.10    0.60      11.07382  0.4906018   8.015487
##   0.10    0.65      11.15629  0.4886977   8.093059
##   0.10    0.70      11.19853  0.4892970   8.145665
##   0.10    0.75      11.23936  0.4901289   8.196103
##   0.10    0.80      11.30112  0.4891838   8.268567
##   0.10    0.85      11.34713  0.4882780   8.328482
##   0.10    0.90      11.37836  0.4881447   8.367509
##   0.10    0.95      11.40881  0.4883520   8.403798
##   0.10    1.00      11.46064  0.4872150   8.456577
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2 and lambda = 0.01.

If we look at the alternative models such as LARS or Least-Angle Regression we have the optimal model is at 0.10 fraction with an R-Squared value of 0.467. The other model that was used was Elastic Net model which had an optimal lambda at 0.00 and fraction at 0.10 with an R-Squared value of 0.44611. Both of these models seems to have preformed worst compared to the Partial least Squares regression.

F.

Would you recommend any of your models to replace the permeability laboratory experiment?

I wouldn’t recommend any of the models that I had used to replace the permeability laboratory experiment however that being said I would attempt to do and XGBoost or Support Vector Machine to see if those methods would yield a higher R-Squared and lower RMSE and MAE. I feel that those methods can handle the larger number of components or predictors easier comparatively to the methods that were used. As well as SVM in particular I believe can be more effective in scenarios where the number of dimensions or predictors is larger than the number of samples.


Question 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

A.

Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)
# Checking amount of missing data
sum(is.na(ChemicalManufacturingProcess))
## [1] 106

B.

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

# Imputing missing data by KNN
impute <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
imputed <- predict(impute, ChemicalManufacturingProcess)
# Checking how many missing data is left
sum(is.na(imputed))
## [1] 0

Utilizing KNN imputation we have turned 106 missing values to their imputed values. I decided to go with KNN since I felt that with the data set being biological its more likely to exhibit an easier density function. This is since biological data exhibits patterns where observations with similar characteristics typically have similar outcomes. I decided against getting rid of the missing data since with biology and the amount of sammples that were missing I feel like there was a change that a relationship would have been removed from the data.

C.

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

imputed <- imputed[, -nearZeroVar(imputed)]
X <- dplyr::select(imputed, -Yield)
y <- imputed$Yield

set.seed(22)
index <- createDataPartition(y, p = .8, list = FALSE)

# Splitting Variables
train_X <- X[index, ] %>% as.matrix()
test_X <- X[-index, ] %>% as.matrix()
train_y <- y[index]
test_y <- y[-index]


# Cross Validation with 10 Fold
ctrl <- trainControl(method = "cv", number = 10)
plsmodel <- train(x = train_X, y = train_y, method = "pls", metric = "Rsquared",
                  tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
plot(plsmodel)

plsmodel$results
##    ncomp      RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1      1 0.7809845 0.4641117 0.6450193 0.2139658  0.2142161 0.1729941
## 2      2 0.7565412 0.5042178 0.6001322 0.2456486  0.2519112 0.1700493
## 3      3 0.7847969 0.5314297 0.6007152 0.3554721  0.1826585 0.1695716
## 4      4 0.9298371 0.5257252 0.6313315 0.7359768  0.2052627 0.2377050
## 5      5 1.0198305 0.5138693 0.6510734 0.9350179  0.1977708 0.2764945
## 6      6 1.1036741 0.5149142 0.6761522 1.1928133  0.1979130 0.3439004
## 7      7 1.1923739 0.5161161 0.6996351 1.4517427  0.2064039 0.4089646
## 8      8 1.3007643 0.5085149 0.7299349 1.6865050  0.2153006 0.4654808
## 9      9 1.3541116 0.5079805 0.7504194 1.8510684  0.2139640 0.5017224
## 10    10 1.3999438 0.4991521 0.7747183 1.9561621  0.2141850 0.5264882
## 11    11 1.4676867 0.4933880 0.7895968 2.1797237  0.2133892 0.5833295
## 12    12 1.5501788 0.4947442 0.8145576 2.4668646  0.2090101 0.6605158
## 13    13 1.5800876 0.5013401 0.8173735 2.6243889  0.2059180 0.6989431
## 14    14 1.5722512 0.5046182 0.8149677 2.6452301  0.2007664 0.7064636
## 15    15 1.5575107 0.5080047 0.8109084 2.6238350  0.2031070 0.7046450
## 16    16 1.5344163 0.5097083 0.8031382 2.5589561  0.2054466 0.6865275
## 17    17 1.5259804 0.5051724 0.8014791 2.5199227  0.2027532 0.6779517
## 18    18 1.5054252 0.5033078 0.7977358 2.4506908  0.2030902 0.6618426
## 19    19 1.4859956 0.5037595 0.7940358 2.3918544  0.2048032 0.6473744
## 20    20 1.4808569 0.4993669 0.7928088 2.3668462  0.2074360 0.6408500

The optimal number of component for PLS was 3 components with a R-Squared of 0.5314297.

D.

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

# Predicting the Value
prediction2 <- predict(plsmodel, test_X)
# Combining the Observed and Prediction
pls2 <- data.frame(obs = test_y, pred = prediction2)
# Calculating prediction accuracy
defaultSummary(pls2)
##      RMSE  Rsquared       MAE 
## 0.7758733 0.4299477 0.5180942

We went with PLS since it was the strongest performer in question 6.2 and we can see that resampled has a lower R-Squared value compared to the training set showcasing that the model needs to be more finely tuned.

E.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

# Top 10 Predictors in the Model
varImpPlot <- varImp(plsmodel, scale = FALSE)

Imp2 <- data.frame(
  Variable = rownames(varImpPlot$importance),
  Score = varImpPlot$importance$Overall
)

# Sort this data frame by Score in descending order
sorted_Imp2 <- Imp2[order(-Imp2$Score), ]

# Now, extracting the names of the top 10 predictors is straightforward
top10_pred <- head(sorted_Imp2$Variable, 10)
top10_pred
##  [1] "ManufacturingProcess32" "ManufacturingProcess13" "ManufacturingProcess36"
##  [4] "ManufacturingProcess09" "ManufacturingProcess17" "ManufacturingProcess33"
##  [7] "BiologicalMaterial02"   "BiologicalMaterial06"   "ManufacturingProcess06"
## [10] "BiologicalMaterial08"
top10_pred_df <- dplyr::select(X, all_of(top10_pred))
top10_pred_df$Yield <- y

It looks like out of the top 10 predictors for the model, Manufacturing Process predictors dominate the top 10 list with 7 compared to Biological Material’s 3.

F.

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

# Correlation Data
corr_data <- cor(dplyr::select_if(top10_pred_df, is.numeric), use = "complete.obs")
# Correlation Plot
corrplot::corrplot(corr_data, method = 'number', type = 'lower', number.cex = 0.75, tl.cex= 0.75)

An interesting part of the correlation is that the 3 negative correlations with yield were all manufacturing process with Manufacturing Process 36 being the lowest negative correlation. On the other side Manufacturing Process 32 has the highest positive correlation at 0.61. All of the Biological Material predictors all had positive correlation with yield so we can assume that biological components are very corresponding to the yield amount. Correlation between predictors seem to be interesting with all the biological materials having a positive correlation between each other and the lowest negative correlation between predictors is -0.79 between Manufacturing Process 36 and Manufacturing Process 32 as well as manufacturing process 13 and manufacturing process 9.