Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
(a) Start R and use these commands to load the data:
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
fingerprint = fingerprints%>%data.frame()
(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
388 predictors left after removing near zero variance predictors
fp_0var_removed = fingerprint[-nearZeroVar(fingerprints)]
fp_0var_removed%>%length()
## [1] 388
(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
model automatically selected 102 features to keep
fp_0var_removed['permeability'] = c(permeability)
set.seed(12345)
train_index = sample(round(dim(fp_0var_removed)[1]*.70))
fp_train = fp_0var_removed[train_index,]
fp_test = fp_0var_removed[-train_index,]
pls_model = plsr(permeability~., data = fp_train, scale=TRUE, validation="CV")
pls_model$ncomp
## [1] 102
We are able to retain over 90% variation with only 25 comps
summary(pls_model)
## Data: X dimension: 115 388
## Y dimension: 115 1
## Fit method: kernelpls
## Number of components considered: 102
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 13.59 12.60 12.13 12.63 13.02 12.91 12.96
## adjCV 13.59 12.57 12.05 12.47 12.74 12.61 12.67
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 12.71 12.81 13.17 13.47 13.54 14.04 14.37
## adjCV 12.39 12.49 12.81 13.04 13.11 13.56 13.89
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 14.59 14.78 14.96 15.33 15.41 15.7 16.01
## adjCV 14.09 14.26 14.42 14.76 14.82 15.1 15.37
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## CV 16.30 16.63 16.84 17.24 17.56 17.84 17.89
## adjCV 15.65 15.95 16.14 16.51 16.81 17.08 17.12
## 28 comps 29 comps 30 comps 31 comps 32 comps 33 comps 34 comps
## CV 18.09 18.56 18.88 19.11 19.27 19.46 19.76
## adjCV 17.30 17.75 18.06 18.27 18.43 18.60 18.88
## 35 comps 36 comps 37 comps 38 comps 39 comps 40 comps 41 comps
## CV 20.14 20.66 21.25 21.68 22.11 22.70 23.30
## adjCV 19.24 19.74 20.30 20.71 21.11 21.68 22.24
## 42 comps 43 comps 44 comps 45 comps 46 comps 47 comps 48 comps
## CV 23.97 24.50 24.97 25.56 26.04 26.64 27.18
## adjCV 22.88 23.38 23.83 24.39 24.84 25.41 25.93
## 49 comps 50 comps 51 comps 52 comps 53 comps 54 comps 55 comps
## CV 27.79 28.42 28.94 29.39 29.68 30.09 30.71
## adjCV 26.50 27.10 27.59 28.02 28.29 28.68 29.27
## 56 comps 57 comps 58 comps 59 comps 60 comps 61 comps 62 comps
## CV 31.36 32.12 32.72 33.31 33.80 34.14 34.54
## adjCV 29.88 30.60 31.16 31.72 32.19 32.50 32.88
## 63 comps 64 comps 65 comps 66 comps 67 comps 68 comps 69 comps
## CV 34.88 35.15 35.36 35.63 35.78 35.87 35.94
## adjCV 33.20 33.46 33.65 33.91 34.05 34.14 34.21
## 70 comps 71 comps 72 comps 73 comps 74 comps 75 comps 76 comps
## CV 35.96 36.01 36.06 36.08 36.08 36.10 36.10
## adjCV 34.23 34.27 34.32 34.33 34.34 34.35 34.36
## 77 comps 78 comps 79 comps 80 comps 81 comps 82 comps 83 comps
## CV 59884468 59882891 59882893 97051823 111189110 125373985 125373752
## adjCV 56674003 56672510 56672513 92121077 105587174 119093647 119093425
## 84 comps 85 comps 86 comps 87 comps 88 comps 89 comps
## CV 125374181 125374034 125374240 125374188 125373590 125374132
## adjCV 119093833 119093691 119093888 119093838 119093269 119093786
## 90 comps 91 comps 92 comps 93 comps 94 comps 95 comps
## CV 125374228 125373995 125374070 125374215 125374322 125374036
## adjCV 119093877 119093655 119093725 119093865 119093966 119093694
## 96 comps 97 comps 98 comps 99 comps 100 comps 101 comps
## CV 125374432 125374071 125374240 125374143 125374074 125374024
## adjCV 119094070 119093727 119093887 119093796 119093730 119093682
## 102 comps
## CV 125374354
## adjCV 119093997
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 18.46 32.36 37.85 42.26 50.21 57.91 61.27
## permeability 27.65 43.14 51.56 60.51 65.45 69.12 73.07
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 66.21 68.77 70.45 73.28 75.32 78.84
## permeability 74.33 75.99 78.46 79.86 81.09 81.69
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## X 80.81 82.62 83.93 85.71 87.06 88.96
## permeability 82.74 83.65 84.78 85.58 86.34 86.86
## 20 comps 21 comps 22 comps 23 comps 24 comps 25 comps
## X 90.05 91.31 91.99 92.62 93.00 93.38
## permeability 87.60 88.04 88.64 89.12 89.73 90.18
## 26 comps 27 comps 28 comps 29 comps 30 comps 31 comps
## X 93.85 94.11 94.45 94.94 95.20 95.49
## permeability 90.47 90.94 91.25 91.41 91.62 91.75
## 32 comps 33 comps 34 comps 35 comps 36 comps 37 comps
## X 95.81 96.11 96.31 96.60 96.82 97.08
## permeability 91.88 92.00 92.13 92.23 92.31 92.37
## 38 comps 39 comps 40 comps 41 comps 42 comps 43 comps
## X 97.30 97.53 97.67 97.83 98.00 98.11
## permeability 92.41 92.46 92.54 92.61 92.64 92.67
## 44 comps 45 comps 46 comps 47 comps 48 comps 49 comps
## X 98.23 98.34 98.44 98.54 98.68 98.8
## permeability 92.69 92.72 92.74 92.76 92.78 92.8
## 50 comps 51 comps 52 comps 53 comps 54 comps 55 comps
## X 98.85 98.93 98.99 99.06 99.15 99.25
## permeability 92.82 92.83 92.84 92.85 92.86 92.87
## 56 comps 57 comps 58 comps 59 comps 60 comps 61 comps
## X 99.35 99.41 99.47 99.54 99.60 99.63
## permeability 92.88 92.89 92.90 92.91 92.92 92.93
## 62 comps 63 comps 64 comps 65 comps 66 comps 67 comps
## X 99.66 99.69 99.71 99.73 99.76 99.80
## permeability 92.94 92.94 92.95 92.95 92.95 92.95
## 68 comps 69 comps 70 comps 71 comps 72 comps 73 comps
## X 99.83 99.86 99.87 99.89 99.90 99.92
## permeability 92.95 92.95 92.95 92.95 92.95 92.95
## 74 comps 75 comps 76 comps 77 comps 78 comps 79 comps
## X 99.94 99.95 99.95 99.97 99.97 99.98
## permeability 92.95 92.95 92.95 92.95 92.95 92.95
## 80 comps 81 comps 82 comps 83 comps 84 comps 85 comps
## X 99.98 99.98 99.99 99.99 100.00 100.05
## permeability 92.95 92.95 92.95 92.95 92.95 92.95
## 86 comps 87 comps 88 comps 89 comps 90 comps 91 comps
## X 100.10 100.15 100.20 100.25 100.31 100.36
## permeability 92.95 92.95 92.95 92.95 92.95 92.95
## 92 comps 93 comps 94 comps 95 comps 96 comps 97 comps
## X 100.41 100.46 100.51 100.56 100.61 100.66
## permeability 92.95 92.95 92.95 92.95 92.95 92.95
## 98 comps 99 comps 100 comps 101 comps 102 comps
## X 100.71 100.76 100.81 100.86 100.92
## permeability 92.95 92.95 92.95 92.95 92.95
training data has r^2 value of 0.9
train_pred = predict(pls_model,fp_train,ncomp=25)
cor(train_pred,fp_train$permeability)^2
## [1] 0.9018063
(d) Predict the response for the test set. What is the test set estimate of R2?
Despite a high r^2 value from the training set, the fit is much worse for the test set (0.48)
test_pred = predict(pls_model,fp_test,ncomp=25)
cor(test_pred,fp_test$permeability)^2
## [1] 0.4828464
(e) Try building other models discussed in this chapter. Do any have better predictive performance?
Using PCA on the data, we can see that after 28 components we capture 90% of the variation
pca = prcomp(fingerprint, rank=28)
summary(pca)
## Importance of first k=28 (out of 165) components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.5542 2.6903 2.5000 1.94669 1.73476 1.65282 1.51822
## Proportion of Variance 0.2947 0.1028 0.0888 0.05384 0.04276 0.03881 0.03275
## Cumulative Proportion 0.2947 0.3975 0.4863 0.54013 0.58289 0.62170 0.65445
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.43449 1.32210 1.25774 1.17645 1.15103 1.03146 0.93443
## Proportion of Variance 0.02924 0.02483 0.02247 0.01966 0.01882 0.01512 0.01241
## Cumulative Proportion 0.68368 0.70852 0.73099 0.75066 0.76948 0.78460 0.79700
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.90183 0.88221 0.8474 0.77907 0.75733 0.74182 0.71650
## Proportion of Variance 0.01155 0.01106 0.0102 0.00862 0.00815 0.00782 0.00729
## Cumulative Proportion 0.80856 0.81961 0.8298 0.83844 0.84659 0.85441 0.86170
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.68195 0.66893 0.63600 0.61961 0.60665 0.60195 0.58285
## Proportion of Variance 0.00661 0.00636 0.00575 0.00545 0.00523 0.00515 0.00483
## Cumulative Proportion 0.86831 0.87466 0.88041 0.88587 0.89109 0.89624 0.90107
r^2 of the PCR model is 0.62 on the train data
fp_pca = data.frame(pca$x)
fp_pca['permeability'] = c(permeability)
fp_pca.train = fp_pca[train_index,]
fp_pca.test = fp_pca[-train_index,]
fp_lm = lm(permeability~., data = fp_pca)
summary(fp_lm)$r.squared
## [1] 0.6213252
r^2 of the PCR model is surprisingly higher for the test data, indicating that it is good for generalization (0.72)
pcr_test_pred = predict(fp_lm,fp_pca.test)
cor(pcr_test_pred,fp_pca.test$permeability)^2
## [1] 0.7242894
(f) Would you recommend any of your models to replace the permeability laboratory experiment?
I would choose the PCR model to replace the experiment as it performed better than the PLS. It is able to fit the data decently and generalize based on comparing training and test sets
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
(a) Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
chem_manuf = ChemicalManufacturingProcess
(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Data after knn imputation and scaling
knn = preProcess(chem_manuf, method = c('knnImpute'))
chem_imputed = predict(knn,chem_manuf)
(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Based on PLS model, only need 11 components to maintain 90% variance
set.seed(2022)
index = sample(round(dim(chem_imputed)[1]*.70))
chem_train = chem_imputed[index,]
chem_test = chem_imputed[-index,]
chem_pls = plsr(Yield~., data = chem_test, validation = 'CV')
summary(chem_pls)
## Data: X dimension: 53 57
## Y dimension: 53 1
## Fit method: kernelpls
## Number of components considered: 46
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 0.8652 0.8420 0.8053 0.7709 0.7468 0.7105 0.7092
## adjCV 0.8652 0.8414 0.7961 0.7593 0.7299 0.6923 0.6899
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 0.7112 0.6870 0.6889 0.6595 0.6270 0.6044 0.5740
## adjCV 0.6857 0.6602 0.6627 0.6349 0.6049 0.5840 0.5555
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 0.5859 0.5988 0.6072 0.6119 0.6228 0.6496 0.6679
## adjCV 0.5663 0.5778 0.5852 0.5897 0.6001 0.6250 0.6417
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## CV 0.6857 0.7131 0.7199 0.7544 0.8026 0.8365 0.8663
## adjCV 0.6581 0.6837 0.6902 0.7230 0.7670 0.7994 0.8277
## 28 comps 29 comps 30 comps 31 comps 32 comps 33 comps 34 comps
## CV 0.9193 0.9533 0.9939 1.134 1.199 1.291 1.42
## adjCV 0.8775 0.9098 0.9482 1.080 1.142 1.229 1.35
## 35 comps 36 comps 37 comps 38 comps 39 comps 40 comps 41 comps
## CV 1.554 1.625 1.728 1.832 1.942 2.034 2.107
## adjCV 1.478 1.545 1.642 1.741 1.846 1.933 2.003
## 42 comps 43 comps 44 comps 45 comps 46 comps
## CV 2.162 2.216 2.280 2.312 2.334
## adjCV 2.054 2.105 2.165 2.196 2.216
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.84 47.91 60.50 64.52 68.27 71.35 73.17 75.58
## Yield 20.09 46.69 57.94 71.20 77.11 80.16 84.68 87.33
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 79.09 81.61 84.36 86.72 88.53 90.13 91.24
## Yield 88.83 89.94 90.63 91.36 92.09 92.70 93.16
## 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps
## X 92.18 93.42 94.50 95.53 96.56 97.22 97.75
## Yield 93.59 93.85 94.06 94.36 94.65 94.90 95.08
## 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps 29 comps
## X 98.11 98.55 98.65 98.92 99.21 99.39 99.54
## Yield 95.21 95.33 95.77 95.84 95.88 95.94 96.01
## 30 comps 31 comps 32 comps 33 comps 34 comps 35 comps 36 comps
## X 99.60 99.69 99.78 99.83 99.85 99.88 99.89
## Yield 96.15 96.25 96.34 96.46 96.74 96.91 97.29
## 37 comps 38 comps 39 comps 40 comps 41 comps 42 comps 43 comps
## X 99.90 99.93 99.94 99.96 99.96 99.97 99.99
## Yield 97.69 97.90 98.04 98.10 98.23 98.55 98.69
## 44 comps 45 comps 46 comps
## X 99.99 99.99 100.00
## Yield 98.99 99.35 99.56
RMSE of training set is 1.8
RMSE(predict(chem_pls,chem_train, ncomp = 11),chem_train$Yield)
## [1] 1.820533
(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
RMSE of test set is even smaller with 0.26
RMSE(predict(chem_pls,chem_test, ncomp =11),chem_test$Yield)
## [1] 0.2598734
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Manufacturing Process predictors seem to be more important than the Biological Materials
varImp(chem_pls)%>%
arrange(-Overall)%>%
head(10)
## Overall
## ManufacturingProcess31 0.8415415
## ManufacturingProcess28 0.6560867
## ManufacturingProcess27 0.5852799
## ManufacturingProcess29 0.5558900
## ManufacturingProcess20 0.5169115
## BiologicalMaterial06 0.4193666
## ManufacturingProcess16 0.3836831
## ManufacturingProcess26 0.3504278
## ManufacturingProcess02 0.3007938
## BiologicalMaterial03 0.2744944
(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Since we are able to tell which predictors matter more than others, to increase yield we can improve upon said predictors. In this case, adjusting the manufacturing process seems to have a large impact on yield and thus that is one avenue to pursue to increasing yield. BiologicalMaterial06 and 03 seem to also be relatively important in the overall yield