6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

fingerprint = fingerprints%>%data.frame()

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

388 predictors left after removing near zero variance predictors

fp_0var_removed = fingerprint[-nearZeroVar(fingerprints)]
fp_0var_removed%>%length()
## [1] 388

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

model automatically selected 102 features to keep

fp_0var_removed['permeability'] = c(permeability)

set.seed(12345)
train_index = sample(round(dim(fp_0var_removed)[1]*.70))
fp_train = fp_0var_removed[train_index,]
fp_test = fp_0var_removed[-train_index,]

pls_model = plsr(permeability~., data = fp_train, scale=TRUE, validation="CV")
pls_model$ncomp
## [1] 102

We are able to retain over 90% variation with only 25 comps

summary(pls_model)
## Data:    X dimension: 115 388 
##  Y dimension: 115 1
## Fit method: kernelpls
## Number of components considered: 102
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           13.59    12.60    12.13    12.63    13.02    12.91    12.96
## adjCV        13.59    12.57    12.05    12.47    12.74    12.61    12.67
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       12.71    12.81    13.17     13.47     13.54     14.04     14.37
## adjCV    12.39    12.49    12.81     13.04     13.11     13.56     13.89
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## CV        14.59     14.78     14.96     15.33     15.41      15.7     16.01
## adjCV     14.09     14.26     14.42     14.76     14.82      15.1     15.37
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps
## CV        16.30     16.63     16.84     17.24     17.56     17.84     17.89
## adjCV     15.65     15.95     16.14     16.51     16.81     17.08     17.12
##        28 comps  29 comps  30 comps  31 comps  32 comps  33 comps  34 comps
## CV        18.09     18.56     18.88     19.11     19.27     19.46     19.76
## adjCV     17.30     17.75     18.06     18.27     18.43     18.60     18.88
##        35 comps  36 comps  37 comps  38 comps  39 comps  40 comps  41 comps
## CV        20.14     20.66     21.25     21.68     22.11     22.70     23.30
## adjCV     19.24     19.74     20.30     20.71     21.11     21.68     22.24
##        42 comps  43 comps  44 comps  45 comps  46 comps  47 comps  48 comps
## CV        23.97     24.50     24.97     25.56     26.04     26.64     27.18
## adjCV     22.88     23.38     23.83     24.39     24.84     25.41     25.93
##        49 comps  50 comps  51 comps  52 comps  53 comps  54 comps  55 comps
## CV        27.79     28.42     28.94     29.39     29.68     30.09     30.71
## adjCV     26.50     27.10     27.59     28.02     28.29     28.68     29.27
##        56 comps  57 comps  58 comps  59 comps  60 comps  61 comps  62 comps
## CV        31.36     32.12     32.72     33.31     33.80     34.14     34.54
## adjCV     29.88     30.60     31.16     31.72     32.19     32.50     32.88
##        63 comps  64 comps  65 comps  66 comps  67 comps  68 comps  69 comps
## CV        34.88     35.15     35.36     35.63     35.78     35.87     35.94
## adjCV     33.20     33.46     33.65     33.91     34.05     34.14     34.21
##        70 comps  71 comps  72 comps  73 comps  74 comps  75 comps  76 comps
## CV        35.96     36.01     36.06     36.08     36.08     36.10     36.10
## adjCV     34.23     34.27     34.32     34.33     34.34     34.35     34.36
##        77 comps  78 comps  79 comps  80 comps   81 comps   82 comps   83 comps
## CV     59884468  59882891  59882893  97051823  111189110  125373985  125373752
## adjCV  56674003  56672510  56672513  92121077  105587174  119093647  119093425
##         84 comps   85 comps   86 comps   87 comps   88 comps   89 comps
## CV     125374181  125374034  125374240  125374188  125373590  125374132
## adjCV  119093833  119093691  119093888  119093838  119093269  119093786
##         90 comps   91 comps   92 comps   93 comps   94 comps   95 comps
## CV     125374228  125373995  125374070  125374215  125374322  125374036
## adjCV  119093877  119093655  119093725  119093865  119093966  119093694
##         96 comps   97 comps   98 comps   99 comps  100 comps  101 comps
## CV     125374432  125374071  125374240  125374143  125374074  125374024
## adjCV  119094070  119093727  119093887  119093796  119093730  119093682
##        102 comps
## CV     125374354
## adjCV  119093997
## 
## TRAINING: % variance explained
##               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X               18.46    32.36    37.85    42.26    50.21    57.91    61.27
## permeability    27.65    43.14    51.56    60.51    65.45    69.12    73.07
##               8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X               66.21    68.77     70.45     73.28     75.32     78.84
## permeability    74.33    75.99     78.46     79.86     81.09     81.69
##               14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## X                80.81     82.62     83.93     85.71     87.06     88.96
## permeability     82.74     83.65     84.78     85.58     86.34     86.86
##               20 comps  21 comps  22 comps  23 comps  24 comps  25 comps
## X                90.05     91.31     91.99     92.62     93.00     93.38
## permeability     87.60     88.04     88.64     89.12     89.73     90.18
##               26 comps  27 comps  28 comps  29 comps  30 comps  31 comps
## X                93.85     94.11     94.45     94.94     95.20     95.49
## permeability     90.47     90.94     91.25     91.41     91.62     91.75
##               32 comps  33 comps  34 comps  35 comps  36 comps  37 comps
## X                95.81     96.11     96.31     96.60     96.82     97.08
## permeability     91.88     92.00     92.13     92.23     92.31     92.37
##               38 comps  39 comps  40 comps  41 comps  42 comps  43 comps
## X                97.30     97.53     97.67     97.83     98.00     98.11
## permeability     92.41     92.46     92.54     92.61     92.64     92.67
##               44 comps  45 comps  46 comps  47 comps  48 comps  49 comps
## X                98.23     98.34     98.44     98.54     98.68      98.8
## permeability     92.69     92.72     92.74     92.76     92.78      92.8
##               50 comps  51 comps  52 comps  53 comps  54 comps  55 comps
## X                98.85     98.93     98.99     99.06     99.15     99.25
## permeability     92.82     92.83     92.84     92.85     92.86     92.87
##               56 comps  57 comps  58 comps  59 comps  60 comps  61 comps
## X                99.35     99.41     99.47     99.54     99.60     99.63
## permeability     92.88     92.89     92.90     92.91     92.92     92.93
##               62 comps  63 comps  64 comps  65 comps  66 comps  67 comps
## X                99.66     99.69     99.71     99.73     99.76     99.80
## permeability     92.94     92.94     92.95     92.95     92.95     92.95
##               68 comps  69 comps  70 comps  71 comps  72 comps  73 comps
## X                99.83     99.86     99.87     99.89     99.90     99.92
## permeability     92.95     92.95     92.95     92.95     92.95     92.95
##               74 comps  75 comps  76 comps  77 comps  78 comps  79 comps
## X                99.94     99.95     99.95     99.97     99.97     99.98
## permeability     92.95     92.95     92.95     92.95     92.95     92.95
##               80 comps  81 comps  82 comps  83 comps  84 comps  85 comps
## X                99.98     99.98     99.99     99.99    100.00    100.05
## permeability     92.95     92.95     92.95     92.95     92.95     92.95
##               86 comps  87 comps  88 comps  89 comps  90 comps  91 comps
## X               100.10    100.15    100.20    100.25    100.31    100.36
## permeability     92.95     92.95     92.95     92.95     92.95     92.95
##               92 comps  93 comps  94 comps  95 comps  96 comps  97 comps
## X               100.41    100.46    100.51    100.56    100.61    100.66
## permeability     92.95     92.95     92.95     92.95     92.95     92.95
##               98 comps  99 comps  100 comps  101 comps  102 comps
## X               100.71    100.76     100.81     100.86     100.92
## permeability     92.95     92.95      92.95      92.95      92.95

training data has r^2 value of 0.9

train_pred = predict(pls_model,fp_train,ncomp=25)
cor(train_pred,fp_train$permeability)^2
## [1] 0.9018063

(d) Predict the response for the test set. What is the test set estimate of R2?

Despite a high r^2 value from the training set, the fit is much worse for the test set (0.48)

test_pred = predict(pls_model,fp_test,ncomp=25)
cor(test_pred,fp_test$permeability)^2
## [1] 0.4828464

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Using PCA on the data, we can see that after 28 components we capture 90% of the variation

pca = prcomp(fingerprint, rank=28)
summary(pca)
## Importance of first k=28 (out of 165) components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     4.5542 2.6903 2.5000 1.94669 1.73476 1.65282 1.51822
## Proportion of Variance 0.2947 0.1028 0.0888 0.05384 0.04276 0.03881 0.03275
## Cumulative Proportion  0.2947 0.3975 0.4863 0.54013 0.58289 0.62170 0.65445
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.43449 1.32210 1.25774 1.17645 1.15103 1.03146 0.93443
## Proportion of Variance 0.02924 0.02483 0.02247 0.01966 0.01882 0.01512 0.01241
## Cumulative Proportion  0.68368 0.70852 0.73099 0.75066 0.76948 0.78460 0.79700
##                           PC15    PC16   PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.90183 0.88221 0.8474 0.77907 0.75733 0.74182 0.71650
## Proportion of Variance 0.01155 0.01106 0.0102 0.00862 0.00815 0.00782 0.00729
## Cumulative Proportion  0.80856 0.81961 0.8298 0.83844 0.84659 0.85441 0.86170
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.68195 0.66893 0.63600 0.61961 0.60665 0.60195 0.58285
## Proportion of Variance 0.00661 0.00636 0.00575 0.00545 0.00523 0.00515 0.00483
## Cumulative Proportion  0.86831 0.87466 0.88041 0.88587 0.89109 0.89624 0.90107

r^2 of the PCR model is 0.62 on the train data

fp_pca = data.frame(pca$x)
fp_pca['permeability'] = c(permeability)

fp_pca.train = fp_pca[train_index,]
fp_pca.test = fp_pca[-train_index,]

fp_lm = lm(permeability~., data = fp_pca)
summary(fp_lm)$r.squared
## [1] 0.6213252

r^2 of the PCR model is surprisingly higher for the test data, indicating that it is good for generalization (0.72)

pcr_test_pred = predict(fp_lm,fp_pca.test)
cor(pcr_test_pred,fp_pca.test$permeability)^2
## [1] 0.7242894

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

I would choose the PCR model to replace the experiment as it performed better than the PLS. It is able to fit the data decently and generalize based on comparing training and test sets

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

chem_manuf = ChemicalManufacturingProcess

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Data after knn imputation and scaling

knn = preProcess(chem_manuf, method = c('knnImpute'))
chem_imputed = predict(knn,chem_manuf)

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Based on PLS model, only need 11 components to maintain 90% variance

set.seed(2022)

index = sample(round(dim(chem_imputed)[1]*.70))
chem_train = chem_imputed[index,]
chem_test = chem_imputed[-index,]

chem_pls = plsr(Yield~., data = chem_test, validation = 'CV')
summary(chem_pls)
## Data:    X dimension: 53 57 
##  Y dimension: 53 1
## Fit method: kernelpls
## Number of components considered: 46
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          0.8652   0.8420   0.8053   0.7709   0.7468   0.7105   0.7092
## adjCV       0.8652   0.8414   0.7961   0.7593   0.7299   0.6923   0.6899
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      0.7112   0.6870   0.6889    0.6595    0.6270    0.6044    0.5740
## adjCV   0.6857   0.6602   0.6627    0.6349    0.6049    0.5840    0.5555
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## CV       0.5859    0.5988    0.6072    0.6119    0.6228    0.6496    0.6679
## adjCV    0.5663    0.5778    0.5852    0.5897    0.6001    0.6250    0.6417
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps
## CV       0.6857    0.7131    0.7199    0.7544    0.8026    0.8365    0.8663
## adjCV    0.6581    0.6837    0.6902    0.7230    0.7670    0.7994    0.8277
##        28 comps  29 comps  30 comps  31 comps  32 comps  33 comps  34 comps
## CV       0.9193    0.9533    0.9939     1.134     1.199     1.291      1.42
## adjCV    0.8775    0.9098    0.9482     1.080     1.142     1.229      1.35
##        35 comps  36 comps  37 comps  38 comps  39 comps  40 comps  41 comps
## CV        1.554     1.625     1.728     1.832     1.942     2.034     2.107
## adjCV     1.478     1.545     1.642     1.741     1.846     1.933     2.003
##        42 comps  43 comps  44 comps  45 comps  46 comps
## CV        2.162     2.216     2.280     2.312     2.334
## adjCV     2.054     2.105     2.165     2.196     2.216
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X        32.84    47.91    60.50    64.52    68.27    71.35    73.17    75.58
## Yield    20.09    46.69    57.94    71.20    77.11    80.16    84.68    87.33
##        9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X        79.09     81.61     84.36     86.72     88.53     90.13     91.24
## Yield    88.83     89.94     90.63     91.36     92.09     92.70     93.16
##        16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps
## X         92.18     93.42     94.50     95.53     96.56     97.22     97.75
## Yield     93.59     93.85     94.06     94.36     94.65     94.90     95.08
##        23 comps  24 comps  25 comps  26 comps  27 comps  28 comps  29 comps
## X         98.11     98.55     98.65     98.92     99.21     99.39     99.54
## Yield     95.21     95.33     95.77     95.84     95.88     95.94     96.01
##        30 comps  31 comps  32 comps  33 comps  34 comps  35 comps  36 comps
## X         99.60     99.69     99.78     99.83     99.85     99.88     99.89
## Yield     96.15     96.25     96.34     96.46     96.74     96.91     97.29
##        37 comps  38 comps  39 comps  40 comps  41 comps  42 comps  43 comps
## X         99.90     99.93     99.94     99.96     99.96     99.97     99.99
## Yield     97.69     97.90     98.04     98.10     98.23     98.55     98.69
##        44 comps  45 comps  46 comps
## X         99.99     99.99    100.00
## Yield     98.99     99.35     99.56

RMSE of training set is 1.8

RMSE(predict(chem_pls,chem_train, ncomp = 11),chem_train$Yield)
## [1] 1.820533

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

RMSE of test set is even smaller with 0.26

RMSE(predict(chem_pls,chem_test, ncomp =11),chem_test$Yield)
## [1] 0.2598734

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Manufacturing Process predictors seem to be more important than the Biological Materials

varImp(chem_pls)%>%
  arrange(-Overall)%>%
  head(10)
##                          Overall
## ManufacturingProcess31 0.8415415
## ManufacturingProcess28 0.6560867
## ManufacturingProcess27 0.5852799
## ManufacturingProcess29 0.5558900
## ManufacturingProcess20 0.5169115
## BiologicalMaterial06   0.4193666
## ManufacturingProcess16 0.3836831
## ManufacturingProcess26 0.3504278
## ManufacturingProcess02 0.3007938
## BiologicalMaterial03   0.2744944

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Since we are able to tell which predictors matter more than others, to increase yield we can improve upon said predictors. In this case, adjusting the manufacturing process seems to have a large impact on yield and thus that is one avenue to pursue to increasing yield. BiologicalMaterial06 and 03 seem to also be relatively important in the overall yield