Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
library(caret)
x <- nearZeroVar(data.frame(fingerprints))
fp <- fingerprints[,-x]
paste("Dimensions of Fingerprints:")
## [1] "Dimensions of Fingerprints:"
dim(fingerprints)
## [1] 165 1107
paste("Dimensions of Fingerprints without low frequencies:")
## [1] "Dimensions of Fingerprints without low frequencies:"
dim(fp)
## [1] 165 388
As you can see from the dimensions, the fingerprint dataset with low frequency predictors filtered has 388 predictors remaining.
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
#load PLS package
library(pls,warn.conflicts = FALSE)
#set data as a dataframe
fp <- as.data.frame(fp)
#split the data
set.seed(12)
trainingRows <- createDataPartition(permeability, p=.80, list = FALSE)
fp.train <- fp[trainingRows,]
fp.test <- fp[-trainingRows,]
p.train <- permeability[trainingRows,]
p.test <- permeability[-trainingRows,]
Prior to performing PLS, the predictors should be centered and scaled, especially if the predictors are on scales of differing magnitude.
#PLS Model
set.seed(13)
p.pls <- train(fp.train,
p.train,
method = 'pls',
metric = 'Rsquared',
tuneLength = 10,
preProcess = c('center','scale'),
trControl = trainControl(method = 'repeatedcv',repeats = 5))
plot(p.pls)
p.pls$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 12.68146 0.3665434 9.776116 3.123659 0.2771189 2.316821
## 2 2 11.55246 0.4827333 8.106699 3.113183 0.2676194 1.878597
## 3 3 11.48955 0.5019479 8.612267 2.673786 0.2330425 1.808798
## 4 4 11.72446 0.4914098 8.911053 2.691123 0.2208669 1.846095
## 5 5 11.81218 0.4733893 8.969090 2.739692 0.2282794 1.836669
## 6 6 11.66710 0.4732592 8.724174 2.897524 0.2352861 1.916482
## 7 7 11.78867 0.4656823 8.829696 2.766073 0.2254400 1.885928
## 8 8 11.94939 0.4549406 9.009219 2.650089 0.2152294 1.845693
## 9 9 12.21021 0.4441951 9.178497 2.736538 0.2200479 1.872916
## 10 10 12.42975 0.4381003 9.341113 2.831452 0.2171249 1.893385
Based on our results, ncomp3 has the lowest RMSE(11.49) and the highest rsquared value (0.502) indicating that it is the strongest model.
Predict the response for the test set. What is the test set estimate of R2?
p.pred <- predict(p.pls, newdata=fp.test)
postResample(pred = p.pred, obs = p.test)
## RMSE Rsquared MAE
## 12.6109752 0.4177249 9.7154060
Try building other models discussed in this chapter. Do any have better predictive performance?