6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

A

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

B

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

library(caret)
x <- nearZeroVar(data.frame(fingerprints))
fp <- fingerprints[,-x]

paste("Dimensions of Fingerprints:")
## [1] "Dimensions of Fingerprints:"
dim(fingerprints)
## [1]  165 1107
paste("Dimensions of Fingerprints without low frequencies:")
## [1] "Dimensions of Fingerprints without low frequencies:"
dim(fp)
## [1] 165 388

As you can see from the dimensions, the fingerprint dataset with low frequency predictors filtered has 388 predictors remaining.

C

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

#load PLS package
library(pls,warn.conflicts = FALSE)
#set data as a dataframe
fp <- as.data.frame(fp)
#split the data
set.seed(12)
trainingRows <- createDataPartition(permeability, p=.80, list = FALSE)
fp.train <- fp[trainingRows,]
fp.test <- fp[-trainingRows,]
p.train <- permeability[trainingRows,]
p.test <- permeability[-trainingRows,]

Prior to performing PLS, the predictors should be centered and scaled, especially if the predictors are on scales of differing magnitude.

#PLS Model
set.seed(13)

p.pls <- train(fp.train,
               p.train,
              method = 'pls',
              metric = 'Rsquared',
              tuneLength = 10,
              preProcess = c('center','scale'),
              trControl = trainControl(method = 'repeatedcv',repeats = 5))

plot(p.pls)

p.pls$results
##    ncomp     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      1 12.68146 0.3665434 9.776116 3.123659  0.2771189 2.316821
## 2      2 11.55246 0.4827333 8.106699 3.113183  0.2676194 1.878597
## 3      3 11.48955 0.5019479 8.612267 2.673786  0.2330425 1.808798
## 4      4 11.72446 0.4914098 8.911053 2.691123  0.2208669 1.846095
## 5      5 11.81218 0.4733893 8.969090 2.739692  0.2282794 1.836669
## 6      6 11.66710 0.4732592 8.724174 2.897524  0.2352861 1.916482
## 7      7 11.78867 0.4656823 8.829696 2.766073  0.2254400 1.885928
## 8      8 11.94939 0.4549406 9.009219 2.650089  0.2152294 1.845693
## 9      9 12.21021 0.4441951 9.178497 2.736538  0.2200479 1.872916
## 10    10 12.42975 0.4381003 9.341113 2.831452  0.2171249 1.893385

Based on our results, ncomp3 has the lowest RMSE(11.49) and the highest rsquared value (0.502) indicating that it is the strongest model.

D

Predict the response for the test set. What is the test set estimate of R2?

p.pred <- predict(p.pls, newdata=fp.test)
postResample(pred = p.pred, obs = p.test)
##       RMSE   Rsquared        MAE 
## 12.6109752  0.4177249  9.7154060

E

Try building other models discussed in this chapter. Do any have better predictive performance?

  1. Would you recommend any of your models to replace the permeability laboratory experiment?