DATA624_ZHerold

6.2. Molecular Permeability

In this exercise, the author makes use of the permeability dataset of the AppliedPredictiveModeling package to develop a model for predicting compounds’ permeability. A compound’s ability to permeate relevant biological membranes is critically important to understand early in the drug discovery process. Compounds that appear to be effective for a particular disease in research screening experiments, but appear to be poorly permeable may need to be altered in order improve permeability, and thus the compound’s ability to reach the desired target.

In this project there were 165 unique compounds; 1107 molecular fingerprints were determined for each. A molecular fingerprint is a binary sequence of numbers that represents the presence or absence of a specific molecular sub-structure. The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response. The response is highly skewed, the predictors are sparse (15.5 percent are present), and many predictors are strongly associated.

https://www.rdocumentation.org/packages/AppliedPredictiveModeling/versions/1.1-7/topics/permeability

__(a) Start R and use these commands to load the data:__

data(permeability)
df1 <- data.frame(fingerprints)
target <- data.frame(permeability)
dim(df1)

## [1]  165 1107

#head(df1)

From the boxplot of the permeability scores, we observe a heavy skew to the right, which we seek to mitigate by taking the cubic root of the values.

hist(permeability, breaks=10)

hist(permeability^.33, breaks=10)

#Checking for null values
any(is.na(df1))

## [1] FALSE

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse, meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

Step 1: Removing Duplicate Columns To cut down on the sheer number of predictors, we remove columns that are complete duplicates of each other. The unique() function allows us to remove duplicate rows, so we first transpose the dataframe. Unlike dpylr’s distinct function, unique() allows us to keep the substructure IDs in the process. This alone reduces the number od columns from 1107 to 327.

df1.transposed <- data.frame(t(df1))
df1.distinct <- data.frame(t(unique(df1.transposed)))
head(df1.distinct)

Step 2: Removing Sparse Predictors Next, we remove sparce substructures with nzv of the caret package. nearZeroVar diagnoses predictors for which the ratio of the frequency of the most common value to the frequency of the second most common value is large. If a particular level (0 or 1) for a column occurs less than 10% of the time in the sample (16 times), we remove that column due to its relative sparsity of incidence. This further reduces our dimensionality to 117 predictors.

df1.filter <- nearZeroVar(df1.distinct, freqCut = 149/16, saveMetrics=TRUE)
index <- !df1.filter[,4]
df1.select <- df1.distinct[,index == TRUE]
dim(df1.select)

## [1] 165 117

head(df1.select)

Step 3: Removing Singularities Finally we removing variables that are undefined in a linear regression due to singularity. We expect these to be other predictors that are highly correlated to other predictors or the target itself. This narrows the predictor space down to 90 dimensions. We note an Adjusted R-squared of 0.7609 for the cubic root of the target, but with only 18 explanatory variables with a p-value of less than .05 in the linear regression model. The cubic root yields a better R-squared score than that of the original permeability value and of its squared root.

df1.all <- cbind(target,df1.select)
lmFit <- lm(permeability ~., data=df1.select)
index2 <- !is.na(lmFit$coefficients)
df1.select2 <- df1.all[,index2 == TRUE]

dim(df1.select2)

## [1] 165  91

In our new feature space, the 91 out of 1107 initial predictors are listed here, along with the target.

colnames(df1.select2)

##  [1] "permeability" "X1"           "X2"           "X3"           "X6"          
##  [6] "X11"          "X12"          "X15"          "X16"          "X25"         
## [11] "X35"          "X36"          "X41"          "X42"          "X48"         
## [16] "X86"          "X87"          "X88"          "X93"          "X96"         
## [21] "X98"          "X102"         "X103"         "X111"         "X121"        
## [26] "X125"         "X141"         "X143"         "X150"         "X153"        
## [31] "X156"         "X157"         "X158"         "X159"         "X182"        
## [36] "X221"         "X226"         "X229"         "X230"         "X231"        
## [41] "X235"         "X236"         "X237"         "X238"         "X241"        
## [46] "X242"         "X247"         "X248"         "X251"         "X257"        
## [51] "X258"         "X260"         "X262"         "X263"         "X270"        
## [56] "X271"         "X272"         "X276"         "X278"         "X280"        
## [61] "X295"         "X296"         "X297"         "X311"         "X312"        
## [66] "X313"         "X314"         "X315"         "X319"         "X320"        
## [71] "X322"         "X329"         "X338"         "X340"         "X342"        
## [76] "X357"         "X358"         "X359"         "X361"         "X366"        
## [81] "X370"         "X371"         "X374"         "X496"         "X497"        
## [86] "X503"         "X507"         "X509"         "X510"         "X571"        
## [91] "X573"

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

We set a train-test split of 80/20.

# split the dataset into training and test set
indexTrain <- sample(
  x=c(TRUE, FALSE),
  size=nrow(df1.select2),
  replace=TRUE,
  prob=c(0.8, 0.2)
)

dtTrain <- df1.select2[indexTrain,] #129 rows
dtTest <- df1.select2[!indexTrain,] #36 rows

When we tune the Partial Least Squares model, with the only pre-processing we perform the taking of the cube root value of the target and the filtering out singularities among the predictors, we find Root Mean Squared Error is minimized when there are five components of the model. As all the explanatory variables are in fact dummy variables, no additional scaling or transformation is needed for them. We apply 5-fold cross-validation to the training set, averaging across 10 repeats.

# Compile cross-validation settings
set.seed(100)
myfolds <- createMultiFolds(dtTrain$permeability, k = 5, times = 10)
control <- trainControl("repeatedcv", index = myfolds, selectionFunction = "best")

# Train PLS model
mod1 <- train(permeability^.33 ~ ., data = dtTrain,
 method = "pls",
 metric = "RMSE",
 tuneLength = 20,
 trControl = control)
 
# Check CV profile
plot(mod1)

That the PLS model is optimized at five components is further confirmed below, with RMSE of 0.6378 using adjusted cross-validation. WIth five components, 74.05% of the variance in the cubic root of the target can be accounted for by the model. We require 20 components for this figure to exceed 90%.

plsFit <- plsr(permeability^.33 ~ ., data = dtTrain, scale=T,validation="CV")
#summary(plsFit)

The R-squared value is maximized at 0.53319 with five components under the PLS model.

# Compile models and compare performance
options(scipen = 999)
R2(plsFit,ncomp=1:10)

## (Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps  
##    -0.01498      0.31777      0.43101      0.46220      0.45503      0.47334  
##     6 comps      7 comps      8 comps      9 comps     10 comps  
##     0.48227      0.45360      0.43374      0.43987      0.44734

(d) Predict the response for the test set. What is the test set estimate of R2?

plsFit.test <- plsr(permeability^.33 ~ ., data = dtTest, scale=T,validation="CV")
predict.pls <- (predict(plsFit, dtTest[,2:length(dtTest)], ncomp = 5))^3

plot(dtTest$permeability,predict.pls)
abline(0, 1)

From the 45-degree line above, we observe that our predictions tends to underestimate those compounds with high permeability scores.

R2(plsFit.test,ncomp=1:10)

## (Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps  
##    -0.07015      0.07496     -0.03479     -0.04755     -0.15354     -0.26219  
##     6 comps      7 comps      8 comps      9 comps     10 comps  
##    -0.26800     -0.28721     -0.33783     -0.29570     -0.24505

The test data has a better fit to the PLS model when it has 8 components, but the sharp drop in R-squared indicates our model suffers from underfitting and strong bias.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

We use a ordinary linear regression as a benchmark.

lmFit2 <- lm(permeability^.33 ~ ., data=df1.select2)
#summary(lmFit2)  # ----- Adjusted R-squared:  0.7609
predict.lm <- (predict(lmFit2, dtTest[,2:length(dtTest)]))^3

We construct a lasso model (L1-regularization) for the permeability data, using the enet() function of the elasticnet package, setting lambda = 0.

LassoModel <- enet(x = as.matrix(dtTrain[,2:ncol(dtTest)]), y = dtTrain$permeability^.33,
                   lambda = 0)

LassoPred <- predict(LassoModel, newx = as.matrix(dtTest[,2:ncol(dtTest)]),
                                               s = .1, mode = "fraction",
                                               type = "fit")

predict.lasso <- LassoPred$fit^3

plot(dtTest$permeability,predict.lasso)
abline(0, 1)

cbind(actual=dtTest$permeability, predict.lm, predict.pls, predict.lasso)

##      actual predict.lm predict.pls     predict.lasso
## 7   25.3650 36.7185591  34.3951652     65.5395920326
## 8    0.5500  0.8473613   2.4696802  65205.0134383035
## 9   39.4550 22.0635556  14.6100571     16.3223809204
## 39   2.3700  2.6955601   1.7976210      0.4239620976
## 46  14.9850 16.5461144   4.7213466     32.4205671511
## 50   7.2200  3.6907005   4.2904620      2.3220132226
## 61  45.7650 34.7657486  26.8978582      6.4127641781
## 65   2.7400  4.4177367   6.1439000     13.7318629887
## 66   9.1200 11.7480335  14.0367706     11.8954026987
## 67  45.0650 43.3811421  10.4171772     -0.0067392630
## 70   1.6950  3.0311405   3.2291458      0.0737925527
## 71   6.4275  6.1246784   5.9872387      0.8675559026
## 76   1.6950  2.7268594   5.7546560      1.8039036745
## 78  13.2950  7.9471506   7.0227233      3.6189095808
## 85  18.9150  6.6502513   4.2621417      1.8509415144
## 95  12.4900 21.6666505  24.5246586    164.5801320719
## 102 17.6100 17.1120390  19.8876990     87.7128564457
## 103  4.3800  4.6163663   8.1281217     10.9070472059
## 104  1.7750  2.5526551   3.1874597      4.3649121437
## 105  1.1950  0.5772158   4.0566153  56482.8152663675
## 119 31.7075 36.7185591  34.3951652     65.5395920326
## 120 15.4700 22.7075740  27.8570682     29.9497689366
## 127  0.8650  0.8662554   1.8842996 -48352.5972391741
## 128  0.3750  0.8633652   0.9609017      1.6054357454
## 133  1.7400  4.3974224  14.3441696     12.4753647202
## 140  0.7550  0.1030742   0.7813554      2.0486708744
## 143 26.3000 31.2631532  24.2433459     25.9580597900
## 147  0.4550  0.1972776   0.2561956      0.0005934078
## 149  1.7000  4.4692968   2.5048420      1.6821784471
## 165  0.7950  0.5604702   0.4810523 -61942.0422280295

print(paste("Ordinary Linear Regression RMSE: ", rmse(dtTest$permeability, predict.lm)))

## [1] "Ordinary Linear Regression RMSE:  5.67186531238535"

print(paste("Partial Least Squares Regression RMSE: ", rmse(dtTest$permeability, predict.pls)))

## [1] "Partial Least Squares Regression RMSE:  10.2992076886302"

print(paste("Lasso Regression RMSE: ", rmse(dtTest$permeability, predict.lasso)))

## [1] "Lasso Regression RMSE:  21304.7757170117"

In terms of RMSE, the PLS model outperforms both the ordinary linear regression and the lasso regression models.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

For biomedical engineering with life-death implications, no degree of statistically significance can replace the clinical trials needed to verify a drug’s efficacy and possible side effects.

6.3. Chemical Manufacturing

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

__(a) Start R and use these commands to load the data:__

data(ChemicalManufacturingProcess)
df2 <- ChemicalManufacturingProcess
#str(df2) #Yield is the dependent variable, with 57 dependent ones (12 materials and 45 processes)
summary(df2$Yield)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.25   38.75   39.97   40.18   41.48   46.34

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. The target variable Yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

df2[!complete.cases(df2),] #24 rows have missing values

From the VIM package, we see that ManufacturingProcess03 accounts for the higest share of missing values. 86.36% of the observations are complete. Observation #1 uniquely accounts for the several NA values, and as it is the only case making a number of columns incomplete, we drop it from the dataset.

aggr_plot <- aggr(df2, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=2, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##                Variable       Count
##  ManufacturingProcess03 0.085227273
##  ManufacturingProcess11 0.056818182
##  ManufacturingProcess10 0.051136364
##  ManufacturingProcess25 0.028409091
##  ManufacturingProcess26 0.028409091
##  ManufacturingProcess27 0.028409091
##  ManufacturingProcess28 0.028409091
##  ManufacturingProcess29 0.028409091
##  ManufacturingProcess30 0.028409091
##  ManufacturingProcess31 0.028409091
##  ManufacturingProcess33 0.028409091
##  ManufacturingProcess34 0.028409091
##  ManufacturingProcess35 0.028409091
##  ManufacturingProcess36 0.028409091
##  ManufacturingProcess02 0.017045455
##  ManufacturingProcess06 0.011363636
##  ManufacturingProcess01 0.005681818
##  ManufacturingProcess04 0.005681818
##  ManufacturingProcess05 0.005681818
##  ManufacturingProcess07 0.005681818
##  ManufacturingProcess08 0.005681818
##  ManufacturingProcess12 0.005681818
##  ManufacturingProcess14 0.005681818
##  ManufacturingProcess22 0.005681818
##  ManufacturingProcess23 0.005681818
##  ManufacturingProcess24 0.005681818
##  ManufacturingProcess40 0.005681818
##  ManufacturingProcess41 0.005681818
##                   Yield 0.000000000
##    BiologicalMaterial01 0.000000000
##    BiologicalMaterial02 0.000000000
##    BiologicalMaterial03 0.000000000
##    BiologicalMaterial04 0.000000000
##    BiologicalMaterial05 0.000000000
##    BiologicalMaterial06 0.000000000
##    BiologicalMaterial07 0.000000000
##    BiologicalMaterial08 0.000000000
##    BiologicalMaterial09 0.000000000
##    BiologicalMaterial10 0.000000000
##    BiologicalMaterial11 0.000000000
##    BiologicalMaterial12 0.000000000
##  ManufacturingProcess09 0.000000000
##  ManufacturingProcess13 0.000000000
##  ManufacturingProcess15 0.000000000
##  ManufacturingProcess16 0.000000000
##  ManufacturingProcess17 0.000000000
##  ManufacturingProcess18 0.000000000
##  ManufacturingProcess19 0.000000000
##  ManufacturingProcess20 0.000000000
##  ManufacturingProcess21 0.000000000
##  ManufacturingProcess32 0.000000000
##  ManufacturingProcess37 0.000000000
##  ManufacturingProcess38 0.000000000
##  ManufacturingProcess39 0.000000000
##  ManufacturingProcess42 0.000000000
##  ManufacturingProcess43 0.000000000
##  ManufacturingProcess44 0.000000000
##  ManufacturingProcess45 0.000000000

#https://datasciencebeginners.com/2018/11/07/visualization-of-imputed-values-using-vim/

df2 <- df2[-1,] #Dropping Row 1

For the remaining imputation, we apply the mice() function of the same package. Mice stands for multivariate imputation by chained equations. The imputation strategy we use is ‘norm.predict’, which takes the linear regression predicted values, as all the predctors are numeric.

#Ascertaining all missing values have been replaced
any(is.na(wholeData))

## [1] FALSE

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

# split the dataset into training and test set
indexTrain <- sample(
  x=c(TRUE, FALSE),
  size=nrow(wholeData),
  replace=TRUE,
  prob=c(0.8, 0.2)
)

dtTrain2 <- wholeData[indexTrain,] #150 rows
dtTest2 <- wholeData[!indexTrain,] #24 rows

# mod <- lm(Yield~.,data=dtTrain2)
# summary(mod)
# Benchmark of standard linear regression: Adjusted R-squared:  0.6983

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

We again fit a lasso model to the training set, predicting on the test set, to arrive at a RMSE score of 1.127. This is slightly worse that the RMSE of the training set of 0.9586.

LassoModel2 <- enet(x = as.matrix(dtTrain2[,2:ncol(dtTrain2)]), y = dtTrain2$Yield,
                   lambda = 0)

LassoPred2 <- predict(LassoModel2, newx = as.matrix(dtTest2[,2:ncol(dtTest2)]),
                                               s = .1, mode = "fraction",
                                               type = "fit")

predict.lasso2 <- LassoPred2$fit

rmse(dtTest2$Yield, predict.lasso2)

## [1] 1.152131

LassoPred.train <- predict(LassoModel2, newx = as.matrix(dtTrain2[,2:ncol(dtTrain2)]),
                                               s = .1, mode = "fraction",
                                               type = "fit")
predict.lasso2.train <- LassoPred.train$fit
rmse(dtTrain2$Yield, predict.lasso2.train)

## [1] 0.9637036

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

ManufacturingProcess32 is the predictor with the highest predictive value according to the regression model. In general, the manufacturing processes account for a larger share of the variance than the biological materials.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

lmMod2 <- lm(Yield~ManufacturingProcess32,data=dtTrain2)
summary(lmMod2)

## 
## Call:
## lm(formula = Yield ~ ManufacturingProcess32, data = dtTrain2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0825 -0.9690 -0.0575  0.8672  4.2859 
## 
## Coefficients:
##                        Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)             6.11495    3.58577   1.705              0.0902 .  
## ManufacturingProcess32  0.21520    0.02257   9.534 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.461 on 152 degrees of freedom
## Multiple R-squared:  0.3742, Adjusted R-squared:  0.3701 
## F-statistic: 90.89 on 1 and 152 DF,  p-value: < 0.00000000000000022

#Adjusted R-squared:  0.3365

# lmMod3 <- lm(Yield~ManufacturingProcess33,data=dtTrain2)
# summary(lmMod3)
# #Adjusted R-squared:  0.1332
# 
# lmMod4 <- lm(Yield~BiologicalMaterial08,data=dtTrain2)
# summary(lmMod4)
# #Adjusted R-squared:  0.1459
# 
# lmMod5 <- lm(Yield~BiologicalMaterial08+ManufacturingProcess33+ManufacturingProcess32,data=dtTrain2)
# summary(lmMod5)
# #Adjusted R-squared:  0.3833

From a univariate ordinary linear regression on Manufacting Process 32, an R-squared of 33.26% indicates that one-third of the variance in the yield can be attributed to this one predictor. Therefore, for the most cost-effective optimization of yield, an engineer would focus first and foremost on this process 32.

DATA624_ZHerold_HW7

Zachary Herold

April 5, 2020

6.2. Molecular Permeability

6.3. Chemical Manufacturing