DATA624: Predictive Analytics: HW#7

Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)
summary(permeability)

##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

dim(fingerprints)

## [1]  165 1107

fingerprints2 <- fingerprints[, -nearZeroVar(fingerprints)]
dim(fingerprints2)

## [1] 165 388

Filtering out those predictors is reducing our matrix from 165x1107 to 65x388 which would be better for building a model.

(c)

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(300)

trainingRows <- createDataPartition(permeability, p = 0.8, list = FALSE)

trainPredictors <- fingerprints2[trainingRows, ]
trainClasses <- permeability[trainingRows]

testPredictors <- fingerprints2[-trainingRows, ]
testClasses <- permeability[-trainingRows]

set.seed(300)
plsTune <- train(trainPredictors, trainClasses, 
                 method = "pls", 
                 tuneLength = 10,                    
                 trControl = trainControl(method = "repeatedcv"), 
                 preProc = c("center", "scale"))

plsTune$results

##    ncomp     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      1 13.02635 0.3221317 9.912382 3.028684  0.2478879 2.608384
## 2      2 11.85397 0.4194745 8.273539 2.372950  0.2925410 1.634590
## 3      3 11.54114 0.4707489 8.636745 1.847878  0.2458678 1.625189
## 4      4 11.43967 0.4886680 8.615876 1.492251  0.2296421 1.522032
## 5      5 11.27615 0.5052312 8.171202 1.873386  0.2319873 1.440496
## 6      6 11.30242 0.4940627 8.276156 1.881103  0.2303962 1.345800
## 7      7 11.32204 0.4815905 8.579845 1.793302  0.2138101 1.407066
## 8      8 11.38629 0.4749483 8.776290 1.843288  0.2122199 1.409170
## 9      9 11.49191 0.4736676 8.697322 1.843241  0.1973815 1.485186
## 10    10 11.73679 0.4631460 8.759930 1.656584  0.1934454 1.523329

The best R^2 value is with ncomp = 5.

(d)

Predict the response for the test set. What is the test set estimate of R2?

output <- predict(plsTune, testPredictors, ncomp = 5)
postResample(pred = output, obs = testClasses)

##       RMSE   Rsquared        MAE 
## 13.6757568  0.3000775  9.8477466

R^2 of the test set is 0.3.

(e)

Try building other models discussed in this chapter. Do any have better predictive performance?

I will try several models discussed in the reading and use RMSE to identify the best performance.

Linear model

df_train <- as.data.frame(cbind(trainPredictors,trainClasses))
df_test <- as.data.frame(cbind(testPredictors,testClasses))

set.seed(300)
lmFitAllPredictors <- lm(trainClasses ~ ., df_train)
predict.lm <- (predict(lmFitAllPredictors, df_test))

## Warning in predict.lm(lmFitAllPredictors, df_test): prediction from a rank-
## deficient fit may be misleading

postResample(pred = predict.lm, obs = df_test$testClasses)

##        RMSE    Rsquared         MAE 
## 35.74087092  0.02768807 20.39171249

Linear model is not performing as well as the PLS model - the RMSE is significantly higher.

Robust Linear model

set.seed(300)
suppressWarnings(
rlmpred <- train(trainPredictors, trainClasses, 
                 method = "rlm", 
                 trControl = trainControl(method = "repeatedcv"), 
                 preProc = "pca")
)
rlmpred

## Robust Linear Model 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: principal component signal extraction (388),
##  centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 119, 119, 120, 120, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   intercept  psi           RMSE      Rsquared   MAE      
##   FALSE      psi.huber     16.72655  0.5061180  13.169903
##   FALSE      psi.hampel    16.67739  0.5067777  13.418069
##   FALSE      psi.bisquare  16.98977  0.4451603  13.293771
##    TRUE      psi.huber     11.32632  0.5126722   7.783743
##    TRUE      psi.hampel    12.21488  0.4391843   8.190154
##    TRUE      psi.bisquare  13.04293  0.4176581   7.883718
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi
##  = psi.huber.

predict.rlm <- (predict(rlmpred, testPredictors))
postResample(pred = predict.rlm, obs = df_test$testClasses)

##       RMSE   Rsquared        MAE 
## 12.5415135  0.3427535  9.0255228

Robust Linear model is performing much better than regular Lineal Model and is slightly better than PLS model - the RMSE is slightly lower.

ENET model

enetpred <-  train(trainPredictors, trainClasses,
                 method='enet',
                 metric='RMSE',
                 trControl = trainControl(method = "repeatedcv"), 
                 preProcess=c('center','scale'))

predict.enet <- (predict(enetpred, testPredictors))
postResample(pred = predict.enet, obs = df_test$testClasses)

##       RMSE   Rsquared        MAE 
## 11.5742116  0.4267167  7.9885963

ENET model is performing much better than regular Lineal Model and is slightly better than the Robust Linear Model and PLS model - the RMSE is slightly lower.

Answer:

Overall PLS, Robust Linear Model and ENET models have a good performance overall and ENET model is the best.

(f)

Would you recommend any of your models to replace the permeability laboratory experiment?

ENET Model is performing well so I would recommend using it to replace permeability labaratory experiment or at least be used together with the exeriment to make it more efficient and cost effective.

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

df1<-ChemicalManufacturingProcess
dim(df1)

## [1] 176  58

head(df1)

##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0

(b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

I will use KNN method to impute the missing values - it will not be too computationally intense since the dataset is not that large.

df1pp <- preProcess(df1[,2:ncol(df1)], method=c('knnImpute'))
df1 <- cbind(df1$Yield,predict(df1pp, df1[,2:ncol(df1)]))
colnames(df1)[1] <- "Yield"

(c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Splitting the data into test and train is the first step - I will use 80%/20% split.

set.seed(333)
trainingRows <- createDataPartition(df1$Yield, p = 0.8, list = FALSE)

df1_train <- df1[trainingRows, ]
df1_test <- df1[-trainingRows, ]

I will tune an ENET model for this data:

df1_enet <-  train(df1_train[,2:ncol(df1_train)], df1_train$Yield,
                 method='enet',
                 metric='RMSE',
                 trControl = trainControl(method = "repeatedcv"), 
                 preProcess=c('center','scale'))

predict.df1_enet <- (predict(df1_enet, df1_train[,2:ncol(df1_train)]))
postResample(pred = predict.df1_enet, obs = df1_train$Yield)

##      RMSE  Rsquared       MAE 
## 1.2910514 0.6032864 1.0545831

Training set RMSE is 1.291

(d)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Now let’s predict the test set responses.

predict.df1_enet <- (predict(df1_enet, df1_test[,2:ncol(df1_test)]))
postResample(pred = predict.df1_enet, obs = df1_test$Yield)

##      RMSE  Rsquared       MAE 
## 1.3641651 0.6637226 1.0725533

The RMSE is slighlty higher than train set RMSE and R^2 is a bit lower - but overall looks pretty good.

(e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Here are the most important predictors for my model:

library(elasticnet)
enetCoef<- predict(df1_enet$finalModel, newx = as.matrix(df1_test[,2:ncol(df1_test)]),
                     s = .1, mode = "fraction",
                     type = "coefficients")
head(sort(enetCoef$coefficients),5)

## ManufacturingProcess13 ManufacturingProcess37 ManufacturingProcess17 
##            -0.25153249            -0.11488070            -0.06413573 
## ManufacturingProcess36 ManufacturingProcess24 
##            -0.05316984            -0.02116651

tail(sort(enetCoef$coefficients),5)

## ManufacturingProcess06 ManufacturingProcess15 ManufacturingProcess34 
##              0.1034313              0.1581922              0.1798552 
## ManufacturingProcess09 ManufacturingProcess32 
##              0.5102941              0.7201123

All the most important coefficients are Process predictors, there is not a single biological predictors on the list. Process 32, 9, 34, 15 and 13 are the stongest predictors.

(f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

lm <- lm(Yield ~ ManufacturingProcess32+ManufacturingProcess09+ManufacturingProcess34+ManufacturingProcess13+ManufacturingProcess15, data=df1_train)
summary(lm)

## 
## Call:
## lm(formula = Yield ~ ManufacturingProcess32 + ManufacturingProcess09 + 
##     ManufacturingProcess34 + ManufacturingProcess13 + ManufacturingProcess15, 
##     data = df1_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6296 -0.7670  0.0493  0.7040  3.2429 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            40.14021    0.09213 435.676  < 2e-16 ***
## ManufacturingProcess32  0.87274    0.10091   8.649 1.19e-14 ***
## ManufacturingProcess09  0.45736    0.15924   2.872 0.004721 ** 
## ManufacturingProcess34  0.30862    0.09032   3.417 0.000832 ***
## ManufacturingProcess13 -0.50055    0.17206  -2.909 0.004225 ** 
## ManufacturingProcess15  0.34418    0.10932   3.148 0.002012 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.105 on 138 degrees of freedom
## Multiple R-squared:  0.6521, Adjusted R-squared:  0.6395 
## F-statistic: 51.73 on 5 and 138 DF,  p-value: < 2.2e-16

Manufacturing process 32, 09 and 34, 13 and 15 create a good linear model which predicts 65.2% of variance in the data. This informaiton will be helpfil in improving yield since adjustments to these processes could make a significant difference to the Yield outcome.

Sources

https://stackoverflow.com/questions/40088228/how-to-retrieve-elastic-net-coefficients