Data 624 Homework 7

6.2.

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.

(a)

Start R and use these commands to load the data.The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...

(b)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

## [1]  165 1107

## [1] 165   1

## [1] 165 388

After running nearZeroVar function there are 388 predictors left.

(c)

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

## Data:    X dimension: 115 388 
##  Y dimension: 115 1
## Fit method: kernelpls
## Number of components considered: 114
## TRAINING: % variance explained
##               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X               28.66    43.95    50.20    54.22    63.43    66.99    69.60
## permeability    29.08    48.84    57.08    64.71    67.70    72.88    75.88
##               8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X               71.93    74.37     76.68     79.06     80.68     82.37
## permeability    78.26    80.64     82.24     83.36     84.99     86.35
##               14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## X                83.84     85.98     87.28     88.11     89.59     90.16
## permeability     87.53     88.18     89.04     89.91     90.28     91.01
##               20 comps  21 comps  22 comps  23 comps  24 comps  25 comps
## X                90.90     91.55     92.41     92.90     93.41     94.03
## permeability     91.42     91.82     92.06     92.41     92.70     92.96
##               26 comps  27 comps  28 comps  29 comps  30 comps  31 comps
## X                94.43     94.80     95.02     95.30     95.52     95.78
## permeability     93.30     93.54     93.76     93.88     93.99     94.07
##               32 comps  33 comps  34 comps  35 comps  36 comps  37 comps
## X                96.03     96.27     96.43     96.61     96.89     97.07
## permeability     94.16     94.25     94.34     94.41     94.45     94.50
##               38 comps  39 comps  40 comps  41 comps  42 comps  43 comps
## X                97.26     97.42     97.60     97.74     97.85     98.01
## permeability     94.54     94.61     94.66     94.69     94.72     94.73
##               44 comps  45 comps  46 comps  47 comps  48 comps  49 comps
## X                98.20     98.33     98.44     98.52     98.61     98.68
## permeability     94.74     94.76     94.78     94.79     94.80     94.82
##               50 comps  51 comps  52 comps  53 comps  54 comps  55 comps
## X                98.78     98.85     98.93     99.05     99.10     99.16
## permeability     94.82     94.84     94.85     94.86     94.87     94.88
##               56 comps  57 comps  58 comps  59 comps  60 comps  61 comps
## X                99.23     99.29     99.35     99.41     99.44     99.47
## permeability     94.89     94.89     94.90     94.90     94.91     94.91
##               62 comps  63 comps  64 comps  65 comps  66 comps  67 comps
## X                99.52     99.56     99.61     99.64     99.67     99.70
## permeability     94.91     94.92     94.92     94.92     94.92     94.92
##               68 comps  69 comps  70 comps  71 comps  72 comps  73 comps
## X                99.73     99.75     99.78     99.80     99.83     99.84
## permeability     94.92     94.92     94.92     94.92     94.92     94.92
##               74 comps  75 comps  76 comps  77 comps  78 comps  79 comps
## X                99.86     99.88     99.89     99.90     99.92     99.93
## permeability     94.92     94.92     94.92     94.92     94.92     94.92
##               80 comps  81 comps  82 comps  83 comps  84 comps  85 comps
## X                99.95     99.96     99.97     99.97     99.98     99.99
## permeability     94.92     94.92     94.92     94.92     94.92     94.92
##               86 comps  87 comps  88 comps  89 comps  90 comps  91 comps
## X                99.99    100.00    100.00    100.29    100.59    100.88
## permeability     94.92     94.92     94.92     94.92     94.92     94.92
##               92 comps  93 comps  94 comps  95 comps  96 comps  97 comps
## X               101.18    101.47    101.76    102.06    102.35    102.65
## permeability     94.92     94.92     94.92     94.92     94.92     94.92
##               98 comps  99 comps  100 comps  101 comps  102 comps  103 comps
## X               102.94    103.23     103.53     103.82     104.12     104.41
## permeability     94.92     94.92      94.92      94.92      94.92      94.92
##               104 comps  105 comps  106 comps  107 comps  108 comps  109 comps
## X                104.71     105.00     105.29     105.59     105.88     106.18
## permeability      94.92      94.92      94.92      94.92      94.92      94.92
##               110 comps  111 comps  112 comps  113 comps  114 comps
## X                106.47     106.77     107.06     107.36     107.65
## permeability      94.92      94.92      94.92      94.92      94.92

(d)

Predict the response for the test set. What is the test set estimate of R2?

## [1] 25.12240 25.48017 26.87275 25.47519 26.53963 26.86371

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.178e+09  1.000e+00  1.000e+01  1.910e+06  3.100e+01  2.481e+09

(e)

Try building other models discussed in this chapter. Do any have better predictive performance?

##           Length Class  Mode     
## call           4 -none- call     
## actions      338 -none- list     
## allset       389 -none- numeric  
## beta.pure 131482 -none- numeric  
## vn           389 -none- character
## mu             1 -none- numeric  
## normx        389 -none- numeric  
## meanx        389 -none- numeric  
## lambda         1 -none- numeric  
## L1norm       338 -none- numeric  
## penalty      338 -none- numeric  
## df           338 -none- numeric  
## Cp           338 -none- numeric  
## sigma2         1 -none- numeric

##          Length Class  Mode     
## s         1     -none- numeric  
## fraction  1     -none- numeric  
## mode      1     -none- character
## fit      50     -none- numeric

##      116      117      118      119      120      121 
## 52.11614 51.44953 51.18513 29.97141 22.01058 28.79457

## Ridge Regression 
## 
## 115 samples
## 389 predictors
## 
## Pre-processing: centered (389), scaled (389) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 103, 104, 103, 103, 104, 104, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE          Rsquared   MAE         
##   0.000000000  4.663100e-14  1.0000000  3.544018e-14
##   0.007142857  2.313521e+03  0.6283895  1.213060e+03
##   0.014285714  1.760807e+01  0.8038951  1.400285e+01
##   0.021428571  7.068826e+00  0.7907931  4.969317e+00
##   0.028571429  9.999212e+00  0.8701261  7.474803e+00
##   0.035714286  3.516263e+00  0.9547176  2.586893e+00
##   0.042857143  3.658132e+00  0.9509928  2.706185e+00
##   0.050000000  3.772068e+00  0.9465127  2.789631e+00
##   0.057142857  3.967192e+00  0.9404803  2.940685e+00
##   0.064285714  4.119357e+00  0.9350103  3.052796e+00
##   0.071428571  4.268597e+00  0.9300001  3.172374e+00
##   0.078571429  4.438219e+00  0.9239049  3.306660e+00
##   0.085714286  4.568636e+00  0.9189027  3.413702e+00
##   0.092857143  4.685828e+00  0.9147105  3.512301e+00
##   0.100000000  4.807197e+00  0.9100387  3.611950e+00
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.

##             Length Class      Mode     
## call            4  -none-     call     
## actions        88  -none-     list     
## allset        124  -none-     numeric  
## beta.pure   10912  -none-     numeric  
## vn            389  -none-     character
## mu              1  -none-     numeric  
## normx         124  -none-     numeric  
## meanx         124  -none-     numeric  
## lambda          1  -none-     numeric  
## L1norm         88  -none-     numeric  
## penalty        88  -none-     numeric  
## df             88  -none-     numeric  
## Cp             88  -none-     numeric  
## sigma2          1  -none-     numeric  
## xNames        389  -none-     character
## problemType     1  -none-     character
## tuneValue       1  data.frame list     
## obsLevels       1  -none-     logical  
## param           0  -none-     list

##      68     129     162      43      14      51      85      21     106      74 
## 28.1000  8.5850  0.5250  2.4600  5.5600  1.7650 18.9150  3.8000  1.7050  5.3550 
##       7      73      79      37     105     110      34     157     126      89 
## 25.3650 47.0050  1.8350 40.8550  1.1950  4.4950 24.0150  1.1950  5.7650  2.0200 
##      33      84      70     156      42     111      20      44     121      87 
##  3.9200 18.5550  1.6950  4.1350 45.6800 42.0600  5.5100  4.0700 30.5700  0.1050 
##     143     137      40      25     119     122      39     160     141       6 
## 26.3000  0.8200  5.5650  0.8050 31.7075 48.5100  2.3700  0.7450 50.5100  0.5100 
##      24      32     161       2      45      18      22      78     102      65 
##  6.1200  8.2700  0.7050  1.1200  1.2250  1.6800 47.6650 13.2950 17.6100  2.7400 
##     115     104     135     136     108     113     114     103      75      81 
## 31.6700  1.7750  0.7750  0.2350  2.7300 42.0600 35.4300  4.3800  8.3100 40.8550 
##     100      13     133     146      48     117      23     144      29     109 
##  8.6200  0.0600  1.7400  0.5600  1.2100 51.4150 32.3750 37.8450  0.6400  4.0450 
##     131      93      28     101     145     134     158      31      17     154 
##  5.5250  8.9000  4.1950  0.4800  0.5600  0.5400  1.0850 13.6750  0.7250  0.3200 
##      83      92      64      60     128     149      10       1     163      59 
##  9.5100 49.7650 20.3400  1.7100  0.3750  1.7000  4.9100 12.5200  1.5450  8.3850 
##      26      15      58      97     125     127      98     164      71      53 
## 14.1500  3.8500  2.4900 11.3125 47.1050  0.8650  1.0800 39.5550  6.4275  8.9800 
##     140      91      19     107      35      77     118      72     123      95 
##  0.7550  2.5500  5.2750  0.9775 13.5325  1.1250 50.2100  1.9150 47.1050 12.4900 
##     147      94      52      41     152 
##  0.4550  5.7500  5.9850  8.1800  0.5300

(f)

Would you recommend any of your models to replace the permeability laboratory experiment?

6.3.

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a)

Start R and use these commands to load the data. The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

## [1] 176  58

(b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

## [1] FALSE

(c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

## Data:    X dimension: 152 57 
##  Y dimension: 152 1
## Fit method: kernelpls
## Number of components considered: 57
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X        86.49    92.29    99.73    99.82    99.91    99.97    99.98    99.99
## Yield    15.34    23.87    30.52    41.94    45.03    46.59    56.13    59.58
##        9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X        99.99     99.99     99.99     99.99     100.0       100    100.00
## Yield    62.91     64.99     67.19     67.95      68.8        71     73.71
##        16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00    100.00
## Yield     75.04     75.38     75.59     75.93     76.29     76.68     76.88
##        23 comps  24 comps  25 comps  26 comps  27 comps  28 comps  29 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00    100.00
## Yield     77.03     77.43     77.77     77.98     78.08     78.27     78.45
##        30 comps  31 comps  32 comps  33 comps  34 comps  35 comps  36 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00    100.00
## Yield     78.57     78.69     78.81     78.85     78.88     78.94     78.97
##        37 comps  38 comps  39 comps  40 comps  41 comps  42 comps  43 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00    100.00
## Yield     79.06     79.08     79.11     79.15     79.16     79.18     79.19
##        44 comps  45 comps  46 comps  47 comps  48 comps  49 comps  50 comps
## X        100.00    100.00     100.0    100.00    100.00    100.00    100.00
## Yield     79.22     79.27      79.3     79.35     79.36     79.38     79.39
##        51 comps  52 comps  53 comps  54 comps  55 comps  56 comps  57 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00    100.00
## Yield     79.39     79.39     79.39     79.39     79.39     79.46     79.46

(d)

Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

## Support Vector Machines with Linear Kernel 
## 
## 152 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 137, 137, 136, 138, 137, 136, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.277014  0.4930476  1.342484
## 
## Tuning parameter 'C' was held constant at a value of 1

##      124      125      126      127      128      129      130      131 
## 42.17826 41.63769 41.04629 41.14536 39.47639 41.76125 40.98895 41.08199 
##      132      133      135      136      137      138      140      141 
## 41.00749 41.15955 38.04850 39.12700 37.93575 38.51842 38.85692 39.63958 
##      142      143      144      145      146      147      148      149 
## 40.61385 40.63043 40.60831 38.78862 37.94179 40.05694 39.00819 38.48481 
##      150      151      152      153      154      155      156      157 
## 38.95087 38.39020 39.87769 39.48349 39.88806 38.83271 37.18061 38.54268 
##      158      159      160      161      162      163      164      165 
## 38.76366 38.21768 38.16305 38.32175 40.04264 39.55040 40.57412 38.60783 
##      166      167      168      169      170      171 
## 37.43753 37.64187 37.22554 38.71159 39.43196 40.83218

(e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

(f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

APPENDIX

Code used in analysis

knitr::opts_chunk$set(
    echo = FALSE,
    message = FALSE,
    warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(AppliedPredictiveModeling)
data("permeability")
str( fingerprints)
#describe(fingerprints)
dim(fingerprints)
dim(permeability)
nzfinger<-nearZeroVar(fingerprints)
fingerprints <- fingerprints[,-nzfinger]
dim(fingerprints)

require(pls)
library(AppliedPredictiveModeling)

ctrl<-trainControl(method="cv", number=10)
finger.df <- as.data.frame(fingerprints)
finger.df$permeability <- as.vector(permeability)
#trainp<- createDataPartition(finger.df, p=0.70)

set.seed(1)
trainp <- sample(1:nrow(finger.df), 0.7*nrow(finger.df))
trainf <- finger.df[trainp,]
testf <- finger.df[(nrow(trainf)+1):nrow(finger.df),]

plsfit <- plsr(permeability ~ ., data = trainf)
summary(plsfit)
plot(plsfit)

plsTune<-caret::train(trainf ,trainf$permeability,
                    method="pls",
                    tuneLength=20,
                    trControl=ctrl,
                    preProc = c("center", "scale"))
plot(plsTune)

pt<-predict(plsfit, testf)
head(pt)
summary(pt)

require(MASS)
require(elasticnet)
require(caret)
require(lars)
enetfit <- enet(as.matrix(trainf), trainf$permeability, lambda = .01)
summary(enetfit)
plot(enetfit)
enetpred<-predict(enetfit, newx = testf, s=1, mode="fraction", type="fit")
summary(enetpred)
head(enetpred$fit)

enetGrid<-expand.grid(.lambda = c(0,.01, 1), .fraction = seq(.05, 1, length=20))
enetTune<-caret::train(trainf ,trainf$permeability,
                    method="enet",
                    tuneGrid=enetGrid,
                    trControl=ctrl,
                    preProc = c("center", "scale"))
plot(enetTune)

ridgeGrid<-data.frame(.lambda = seq(0, .1, length=15))
set.seed(100)
ridgeRegFit<- caret::train(trainf ,trainf$permeability,
                    method="ridge",
                    tuneGrid=ridgeGrid,
                    trControl=ctrl,
                    preProc = c("center", "scale"))
ridgeRegFit
summary(ridgeRegFit)
plot(ridgeRegFit)
ridgePred<-predict(ridgeRegFit, nex=testf)
ridgePred

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
cm<-ChemicalManufacturingProcess
dim(cm)


cm<-cm[complete.cases(cm),]
anyNA(cm)

set.seed(1)
trainp <- sample(1:nrow(cm), 0.7*nrow(cm))
trainf <- cm[trainp,]
testf <- cm[(nrow(trainf)+1):nrow(cm),]

plsfit <- plsr(Yield ~ ., data = cm)
summary(plsfit)
plot(plsfit)

library(caret)
trctrl<- trainControl(method="repeatedcv", number=10,repeats=3)
svm_lin<- caret::train(Yield~., data=cm, method="svmLinear", 
                trControl=trctrl, preProcess=c("center","scale"),
                tuneLength =10)
svm_lin
test_pred<- predict(svm_lin, newdata=testf)
test_pred

plot(test_pred)

Data 624 Homework 7

Anthony Pagan

4/5/2020

6.2.

(a)

(b)

(c)

(d)

(e)

(f)

6.3.

(a)

(b)

(c)

(d)

(e)

(f)

APPENDIX