In this exercise, the author makes use of the permeability dataset of the AppliedPredictiveModeling package to develop a model for predicting compounds’ permeability. A compound’s ability to permeate relevant biological membranes is critically important to understand early in the drug discovery process. Compounds that appear to be effective for a particular disease in research screening experiments, but appear to be poorly permeable may need to be altered in order improve permeability, and thus the compound’s ability to reach the desired target.
In this project there were 165 unique compounds; 1107 molecular fingerprints were determined for each. A molecular fingerprint is a binary sequence of numbers that represents the presence or absence of a specific molecular sub-structure. The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response. The response is highly skewed, the predictors are sparse (15.5 percent are present), and many predictors are strongly associated.
https://www.rdocumentation.org/packages/AppliedPredictiveModeling/versions/1.1-7/topics/permeability
__(a) Start R and use these commands to load the data:__
data(permeability)
df1 <- data.frame(fingerprints)
target <- data.frame(permeability)
dim(df1)
## [1] 165 1107
#head(df1)
From the boxplot of the permeability scores, we observe a heavy skew to the right, which we seek to mitigate by taking the cubic root of the values.
hist(permeability, breaks=10)
hist(permeability^.33, breaks=10)
#Checking for null values
any(is.na(df1))
## [1] FALSE
(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse, meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
Step 1: Removing Duplicate Columns To cut down on the sheer number of predictors, we remove columns that are complete duplicates of each other. The unique() function allows us to remove duplicate rows, so we first transpose the dataframe. Unlike dpylr’s distinct function, unique() allows us to keep the substructure IDs in the process. This alone reduces the number od columns from 1107 to 327.
df1.transposed <- data.frame(t(df1))
df1.distinct <- data.frame(t(unique(df1.transposed)))
head(df1.distinct)
Step 2: Removing Sparse Predictors Next, we remove sparce substructures with nzv of the caret package. nearZeroVar diagnoses predictors for which the ratio of the frequency of the most common value to the frequency of the second most common value is large. If a particular level (0 or 1) for a column occurs less than 10% of the time in the sample (16 times), we remove that column due to its relative sparsity of incidence. This further reduces our dimensionality to 117 predictors.
df1.filter <- nearZeroVar(df1.distinct, freqCut = 149/16, saveMetrics=TRUE)
index <- !df1.filter[,4]
df1.select <- df1.distinct[,index == TRUE]
dim(df1.select)
## [1] 165 117
head(df1.select)
Step 3: Removing Singularities Finally we removing variables that are undefined in a linear regression due to singularity. We expect these to be other predictors that are highly correlated to other predictors or the target itself. This narrows the predictor space down to 90 dimensions. We note an Adjusted R-squared of 0.7609 for the cubic root of the target, but with only 18 explanatory variables with a p-value of less than .05 in the linear regression model. The cubic root yields a better R-squared score than that of the original permeability value and of its squared root.
df1.all <- cbind(target,df1.select)
lmFit <- lm(permeability ~., data=df1.select)
index2 <- !is.na(lmFit$coefficients)
df1.select2 <- df1.all[,index2 == TRUE]
dim(df1.select2)
## [1] 165 91
In our new feature space, the 91 out of 1107 initial predictors are listed here, along with the target.
colnames(df1.select2)
## [1] "permeability" "X1" "X2" "X3" "X6"
## [6] "X11" "X12" "X15" "X16" "X25"
## [11] "X35" "X36" "X41" "X42" "X48"
## [16] "X86" "X87" "X88" "X93" "X96"
## [21] "X98" "X102" "X103" "X111" "X121"
## [26] "X125" "X141" "X143" "X150" "X153"
## [31] "X156" "X157" "X158" "X159" "X182"
## [36] "X221" "X226" "X229" "X230" "X231"
## [41] "X235" "X236" "X237" "X238" "X241"
## [46] "X242" "X247" "X248" "X251" "X257"
## [51] "X258" "X260" "X262" "X263" "X270"
## [56] "X271" "X272" "X276" "X278" "X280"
## [61] "X295" "X296" "X297" "X311" "X312"
## [66] "X313" "X314" "X315" "X319" "X320"
## [71] "X322" "X329" "X338" "X340" "X342"
## [76] "X357" "X358" "X359" "X361" "X366"
## [81] "X370" "X371" "X374" "X496" "X497"
## [86] "X503" "X507" "X509" "X510" "X571"
## [91] "X573"
(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
We set a train-test split of 80/20.
# split the dataset into training and test set
indexTrain <- sample(
x=c(TRUE, FALSE),
size=nrow(df1.select2),
replace=TRUE,
prob=c(0.8, 0.2)
)
dtTrain <- df1.select2[indexTrain,] #129 rows
dtTest <- df1.select2[!indexTrain,] #36 rows
When we tune the Partial Least Squares model, with the only pre-processing we perform the taking of the cube root value of the target and the filtering out singularities among the predictors, we find Root Mean Squared Error is minimized when there are five components of the model. As all the explanatory variables are in fact dummy variables, no additional scaling or transformation is needed for them. We apply 5-fold cross-validation to the training set, averaging across 10 repeats.
# Compile cross-validation settings
set.seed(100)
myfolds <- createMultiFolds(dtTrain$permeability, k = 5, times = 10)
control <- trainControl("repeatedcv", index = myfolds, selectionFunction = "best")
# Train PLS model
mod1 <- train(permeability^.33 ~ ., data = dtTrain,
method = "pls",
metric = "RMSE",
tuneLength = 20,
trControl = control)
# Check CV profile
plot(mod1)
That the PLS model is optimized at five components is further confirmed below, with RMSE of 0.6378 using adjusted cross-validation. WIth five components, 74.05% of the variance in the cubic root of the target can be accounted for by the model. We require 20 components for this figure to exceed 90%.
plsFit <- plsr(permeability^.33 ~ ., data = dtTrain, scale=T,validation="CV")
#summary(plsFit)
The R-squared value is maximized at 0.53319 with five components under the PLS model.
# Compile models and compare performance
options(scipen = 999)
R2(plsFit,ncomp=1:10)
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## -0.01498 0.31777 0.43101 0.46220 0.45503 0.47334
## 6 comps 7 comps 8 comps 9 comps 10 comps
## 0.48227 0.45360 0.43374 0.43987 0.44734
(d) Predict the response for the test set. What is the test set estimate of R2?
plsFit.test <- plsr(permeability^.33 ~ ., data = dtTest, scale=T,validation="CV")
predict.pls <- (predict(plsFit, dtTest[,2:length(dtTest)], ncomp = 5))^3
plot(dtTest$permeability,predict.pls)
abline(0, 1)
From the 45-degree line above, we observe that our predictions tends to underestimate those compounds with high permeability scores.
R2(plsFit.test,ncomp=1:10)
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## -0.07015 0.07496 -0.03479 -0.04755 -0.15354 -0.26219
## 6 comps 7 comps 8 comps 9 comps 10 comps
## -0.26800 -0.28721 -0.33783 -0.29570 -0.24505
The test data has a better fit to the PLS model when it has 8 components, but the sharp drop in R-squared indicates our model suffers from underfitting and strong bias.
(e) Try building other models discussed in this chapter. Do any have better predictive performance?
We use a ordinary linear regression as a benchmark.
lmFit2 <- lm(permeability^.33 ~ ., data=df1.select2)
#summary(lmFit2) # ----- Adjusted R-squared: 0.7609
predict.lm <- (predict(lmFit2, dtTest[,2:length(dtTest)]))^3
We construct a lasso model (L1-regularization) for the permeability data, using the enet() function of the elasticnet package, setting lambda = 0.
LassoModel <- enet(x = as.matrix(dtTrain[,2:ncol(dtTest)]), y = dtTrain$permeability^.33,
lambda = 0)
LassoPred <- predict(LassoModel, newx = as.matrix(dtTest[,2:ncol(dtTest)]),
s = .1, mode = "fraction",
type = "fit")
predict.lasso <- LassoPred$fit^3
plot(dtTest$permeability,predict.lasso)
abline(0, 1)
cbind(actual=dtTest$permeability, predict.lm, predict.pls, predict.lasso)
## actual predict.lm predict.pls predict.lasso
## 7 25.3650 36.7185591 34.3951652 65.5395920326
## 8 0.5500 0.8473613 2.4696802 65205.0134383035
## 9 39.4550 22.0635556 14.6100571 16.3223809204
## 39 2.3700 2.6955601 1.7976210 0.4239620976
## 46 14.9850 16.5461144 4.7213466 32.4205671511
## 50 7.2200 3.6907005 4.2904620 2.3220132226
## 61 45.7650 34.7657486 26.8978582 6.4127641781
## 65 2.7400 4.4177367 6.1439000 13.7318629887
## 66 9.1200 11.7480335 14.0367706 11.8954026987
## 67 45.0650 43.3811421 10.4171772 -0.0067392630
## 70 1.6950 3.0311405 3.2291458 0.0737925527
## 71 6.4275 6.1246784 5.9872387 0.8675559026
## 76 1.6950 2.7268594 5.7546560 1.8039036745
## 78 13.2950 7.9471506 7.0227233 3.6189095808
## 85 18.9150 6.6502513 4.2621417 1.8509415144
## 95 12.4900 21.6666505 24.5246586 164.5801320719
## 102 17.6100 17.1120390 19.8876990 87.7128564457
## 103 4.3800 4.6163663 8.1281217 10.9070472059
## 104 1.7750 2.5526551 3.1874597 4.3649121437
## 105 1.1950 0.5772158 4.0566153 56482.8152663675
## 119 31.7075 36.7185591 34.3951652 65.5395920326
## 120 15.4700 22.7075740 27.8570682 29.9497689366
## 127 0.8650 0.8662554 1.8842996 -48352.5972391741
## 128 0.3750 0.8633652 0.9609017 1.6054357454
## 133 1.7400 4.3974224 14.3441696 12.4753647202
## 140 0.7550 0.1030742 0.7813554 2.0486708744
## 143 26.3000 31.2631532 24.2433459 25.9580597900
## 147 0.4550 0.1972776 0.2561956 0.0005934078
## 149 1.7000 4.4692968 2.5048420 1.6821784471
## 165 0.7950 0.5604702 0.4810523 -61942.0422280295
print(paste("Ordinary Linear Regression RMSE: ", rmse(dtTest$permeability, predict.lm)))
## [1] "Ordinary Linear Regression RMSE: 5.67186531238535"
print(paste("Partial Least Squares Regression RMSE: ", rmse(dtTest$permeability, predict.pls)))
## [1] "Partial Least Squares Regression RMSE: 10.2992076886302"
print(paste("Lasso Regression RMSE: ", rmse(dtTest$permeability, predict.lasso)))
## [1] "Lasso Regression RMSE: 21304.7757170117"
In terms of RMSE, the PLS model outperforms both the ordinary linear regression and the lasso regression models.
(f) Would you recommend any of your models to replace the permeability laboratory experiment?
For biomedical engineering with life-death implications, no degree of statistically significance can replace the clinical trials needed to verify a drug’s efficacy and possible side effects.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
__(a) Start R and use these commands to load the data:__
data(ChemicalManufacturingProcess)
df2 <- ChemicalManufacturingProcess
#str(df2) #Yield is the dependent variable, with 57 dependent ones (12 materials and 45 processes)
summary(df2$Yield)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.25 38.75 39.97 40.18 41.48 46.34
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. The target variable Yield contains the percent yield for each run.
(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
df2[!complete.cases(df2),] #24 rows have missing values
From the VIM package, we see that ManufacturingProcess03 accounts for the higest share of missing values. 86.36% of the observations are complete. Observation #1 uniquely accounts for the several NA values, and as it is the only case making a number of columns incomplete, we drop it from the dataset.
aggr_plot <- aggr(df2, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=2, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## ManufacturingProcess03 0.085227273
## ManufacturingProcess11 0.056818182
## ManufacturingProcess10 0.051136364
## ManufacturingProcess25 0.028409091
## ManufacturingProcess26 0.028409091
## ManufacturingProcess27 0.028409091
## ManufacturingProcess28 0.028409091
## ManufacturingProcess29 0.028409091
## ManufacturingProcess30 0.028409091
## ManufacturingProcess31 0.028409091
## ManufacturingProcess33 0.028409091
## ManufacturingProcess34 0.028409091
## ManufacturingProcess35 0.028409091
## ManufacturingProcess36 0.028409091
## ManufacturingProcess02 0.017045455
## ManufacturingProcess06 0.011363636
## ManufacturingProcess01 0.005681818
## ManufacturingProcess04 0.005681818
## ManufacturingProcess05 0.005681818
## ManufacturingProcess07 0.005681818
## ManufacturingProcess08 0.005681818
## ManufacturingProcess12 0.005681818
## ManufacturingProcess14 0.005681818
## ManufacturingProcess22 0.005681818
## ManufacturingProcess23 0.005681818
## ManufacturingProcess24 0.005681818
## ManufacturingProcess40 0.005681818
## ManufacturingProcess41 0.005681818
## Yield 0.000000000
## BiologicalMaterial01 0.000000000
## BiologicalMaterial02 0.000000000
## BiologicalMaterial03 0.000000000
## BiologicalMaterial04 0.000000000
## BiologicalMaterial05 0.000000000
## BiologicalMaterial06 0.000000000
## BiologicalMaterial07 0.000000000
## BiologicalMaterial08 0.000000000
## BiologicalMaterial09 0.000000000
## BiologicalMaterial10 0.000000000
## BiologicalMaterial11 0.000000000
## BiologicalMaterial12 0.000000000
## ManufacturingProcess09 0.000000000
## ManufacturingProcess13 0.000000000
## ManufacturingProcess15 0.000000000
## ManufacturingProcess16 0.000000000
## ManufacturingProcess17 0.000000000
## ManufacturingProcess18 0.000000000
## ManufacturingProcess19 0.000000000
## ManufacturingProcess20 0.000000000
## ManufacturingProcess21 0.000000000
## ManufacturingProcess32 0.000000000
## ManufacturingProcess37 0.000000000
## ManufacturingProcess38 0.000000000
## ManufacturingProcess39 0.000000000
## ManufacturingProcess42 0.000000000
## ManufacturingProcess43 0.000000000
## ManufacturingProcess44 0.000000000
## ManufacturingProcess45 0.000000000
#https://datasciencebeginners.com/2018/11/07/visualization-of-imputed-values-using-vim/
df2 <- df2[-1,] #Dropping Row 1
For the remaining imputation, we apply the mice() function of the same package. Mice stands for multivariate imputation by chained equations. The imputation strategy we use is ‘norm.predict’, which takes the linear regression predicted values, as all the predctors are numeric.
#Ascertaining all missing values have been replaced
any(is.na(wholeData))
## [1] FALSE
(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
# split the dataset into training and test set
indexTrain <- sample(
x=c(TRUE, FALSE),
size=nrow(wholeData),
replace=TRUE,
prob=c(0.8, 0.2)
)
dtTrain2 <- wholeData[indexTrain,] #150 rows
dtTest2 <- wholeData[!indexTrain,] #24 rows
# mod <- lm(Yield~.,data=dtTrain2)
# summary(mod)
# Benchmark of standard linear regression: Adjusted R-squared: 0.6983
(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
We again fit a lasso model to the training set, predicting on the test set, to arrive at a RMSE score of 1.127. This is slightly worse that the RMSE of the training set of 0.9586.
LassoModel2 <- enet(x = as.matrix(dtTrain2[,2:ncol(dtTrain2)]), y = dtTrain2$Yield,
lambda = 0)
LassoPred2 <- predict(LassoModel2, newx = as.matrix(dtTest2[,2:ncol(dtTest2)]),
s = .1, mode = "fraction",
type = "fit")
predict.lasso2 <- LassoPred2$fit
rmse(dtTest2$Yield, predict.lasso2)
## [1] 1.152131
LassoPred.train <- predict(LassoModel2, newx = as.matrix(dtTrain2[,2:ncol(dtTrain2)]),
s = .1, mode = "fraction",
type = "fit")
predict.lasso2.train <- LassoPred.train$fit
rmse(dtTrain2$Yield, predict.lasso2.train)
## [1] 0.9637036
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
ManufacturingProcess32 is the predictor with the highest predictive value according to the regression model. In general, the manufacturing processes account for a larger share of the variance than the biological materials.
(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
lmMod2 <- lm(Yield~ManufacturingProcess32,data=dtTrain2)
summary(lmMod2)
##
## Call:
## lm(formula = Yield ~ ManufacturingProcess32, data = dtTrain2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0825 -0.9690 -0.0575 0.8672 4.2859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.11495 3.58577 1.705 0.0902 .
## ManufacturingProcess32 0.21520 0.02257 9.534 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.461 on 152 degrees of freedom
## Multiple R-squared: 0.3742, Adjusted R-squared: 0.3701
## F-statistic: 90.89 on 1 and 152 DF, p-value: < 0.00000000000000022
#Adjusted R-squared: 0.3365
# lmMod3 <- lm(Yield~ManufacturingProcess33,data=dtTrain2)
# summary(lmMod3)
# #Adjusted R-squared: 0.1332
#
# lmMod4 <- lm(Yield~BiologicalMaterial08,data=dtTrain2)
# summary(lmMod4)
# #Adjusted R-squared: 0.1459
#
# lmMod5 <- lm(Yield~BiologicalMaterial08+ManufacturingProcess33+ManufacturingProcess32,data=dtTrain2)
# summary(lmMod5)
# #Adjusted R-squared: 0.3833
From a univariate ordinary linear regression on Manufacting Process 32, an R-squared of 33.26% indicates that one-third of the variance in the yield can be attributed to this one predictor. Therefore, for the most cost-effective optimization of yield, an engineer would focus first and foremost on this process 32.