Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(caret)
library(AppliedPredictiveModeling)
library(tidyr)
library(dplyr)
library(e1071)
library(elasticnet)
library(knitr)
library(pls)
library(ggplot2)
library(tidyverse)
library(kableExtra)
library(RANN)
library(corrplot)
Start R and use these commands to load the data
data(permeability)
head(permeability) %>%
kable() %>%
kable_styling()
permeability |
---|
12.520 |
1.120 |
19.405 |
1.730 |
1.680 |
0.510 |
Permeability data: This pharmaceutical data set was used to develop a model for predicting compounds’ permeability (i.e. a molecule’s ability to cross a membrane). It holds 165 unique compounds; 1107 molecular fingerprints
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
nzvar <- nearZeroVar(fingerprints)
fpfilter <- fingerprints[,-nzvar]
ncol(fpfilter)
## [1] 388
The nearZeroVar function can be used to identify near zero-variance. They have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
fingerprintsdf <- as.data.frame(fpfilter)
df <- as.data.frame(fingerprintsdf) %>% mutate(permeability = permeability)
head(df) %>%
kable() %>%
kable_styling()
X1 | X2 | X3 | X4 | X5 | X6 | X11 | X12 | X15 | X16 | X20 | X21 | X25 | X26 | X27 | X28 | X29 | X35 | X36 | X37 | X38 | X39 | X40 | X41 | X42 | X43 | X44 | X46 | X47 | X48 | X49 | X50 | X51 | X52 | X53 | X54 | X55 | X56 | X57 | X58 | X59 | X60 | X61 | X62 | X63 | X64 | X65 | X66 | X67 | X68 | X69 | X70 | X71 | X72 | X73 | X74 | X75 | X76 | X78 | X79 | X80 | X86 | X87 | X88 | X93 | X94 | X96 | X97 | X98 | X99 | X101 | X102 | X103 | X108 | X111 | X118 | X121 | X125 | X126 | X127 | X129 | X130 | X133 | X138 | X141 | X142 | X143 | X146 | X150 | X152 | X153 | X154 | X156 | X157 | X158 | X159 | X162 | X163 | X167 | X168 | X169 | X170 | X171 | X172 | X173 | X174 | X175 | X176 | X177 | X178 | X179 | X180 | X181 | X182 | X183 | X184 | X185 | X186 | X187 | X188 | X189 | X190 | X191 | X192 | X193 | X194 | X195 | X196 | X197 | X198 | X199 | X200 | X201 | X202 | X203 | X204 | X205 | X206 | X207 | X208 | X209 | X210 | X211 | X212 | X213 | X214 | X215 | X221 | X223 | X224 | X225 | X226 | X227 | X228 | X229 | X230 | X231 | X232 | X233 | X234 | X235 | X236 | X237 | X238 | X239 | X240 | X241 | X242 | X244 | X245 | X246 | X247 | X248 | X249 | X250 | X251 | X253 | X254 | X255 | X256 | X257 | X258 | X260 | X261 | X262 | X263 | X264 | X265 | X266 | X267 | X268 | X269 | X270 | X271 | X272 | X274 | X276 | X278 | X279 | X280 | X281 | X284 | X285 | X286 | X290 | X291 | X293 | X294 | X295 | X296 | X297 | X298 | X299 | X300 | X301 | X302 | X303 | X304 | X305 | X306 | X307 | X308 | X309 | X310 | X311 | X312 | X313 | X314 | X315 | X316 | X317 | X318 | X319 | X320 | X321 | X322 | X323 | X324 | X325 | X326 | X327 | X328 | X329 | X330 | X331 | X332 | X333 | X334 | X335 | X336 | X337 | X338 | X339 | X340 | X341 | X342 | X343 | X344 | X345 | X355 | X356 | X357 | X358 | X359 | X360 | X361 | X362 | X366 | X367 | X368 | X370 | X371 | X372 | X373 | X374 | X376 | X377 | X378 | X380 | X381 | X382 | X383 | X385 | X386 | X387 | X388 | X389 | X390 | X392 | X394 | X395 | X396 | X398 | X400 | X401 | X403 | X406 | X496 | X497 | X499 | X503 | X504 | X505 | X506 | X507 | X508 | X509 | X510 | X511 | X512 | X514 | X515 | X516 | X517 | X518 | X519 | X520 | X521 | X522 | X524 | X529 | X549 | X551 | X553 | X554 | X556 | X557 | X558 | X559 | X560 | X561 | X565 | X568 | X571 | X573 | X574 | X576 | X577 | X590 | X591 | X592 | X593 | X594 | X595 | X597 | X598 | X599 | X600 | X601 | X602 | X603 | X604 | X613 | X621 | X679 | X698 | X699 | X700 | X701 | X702 | X703 | X704 | X705 | X719 | X732 | X733 | X750 | X751 | X752 | X753 | X754 | X755 | X773 | X774 | X775 | X776 | X780 | X782 | X792 | X793 | X795 | X798 | X800 | X801 | X805 | X806 | X812 | X813 | permeability |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12.520 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.120 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 19.405 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.730 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.680 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.510 |
#Set random seed
set.seed(10)
#Create test/ train at 80-20
in_train <- createDataPartition(df$permeability, p = 0.8, times =1, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
pmodel <- train(permeability ~ ., data = train_df, method = "pls", center = TRUE, trControl = trainControl("cv", number = 10), tuneLength = 10)
## Plot model RMSE vs different values of components
ggplot(pmodel) +
xlab("Number of Variables") +
ggtitle("PLS Model")
pmodel$bestTune$ncomp
## [1] 6
Best tuning parameter ncomp that minimizes the cross-validation error, RMSE is 6
summary(pmodel$finalModel)
## Data: X dimension: 133 388
## Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 6
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 27.14 43.27 50.06 52.74 56.36 65.55
## .outcome 31.03 51.39 59.75 69.77 75.26 77.04
pmodel$results %>%
kable() %>%
kable_styling()
ncomp | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|
1 | 13.09201 | 0.3906819 | 10.321637 | 3.908703 | 0.2666489 | 2.876551 |
2 | 11.72473 | 0.5206390 | 8.445616 | 3.400597 | 0.1985467 | 2.163855 |
3 | 11.43216 | 0.5331825 | 8.482856 | 2.581718 | 0.1581227 | 1.931290 |
4 | 11.46963 | 0.5377636 | 8.951449 | 1.753067 | 0.1380532 | 1.541152 |
5 | 11.17025 | 0.5475971 | 8.604173 | 1.782655 | 0.1617694 | 1.294497 |
6 | 10.78514 | 0.5696499 | 8.346736 | 1.787969 | 0.1587404 | 1.319356 |
7 | 10.81335 | 0.5702843 | 8.402793 | 1.972811 | 0.1430146 | 1.230677 |
8 | 10.88491 | 0.5658647 | 8.377881 | 1.810761 | 0.1495770 | 1.357485 |
9 | 10.98097 | 0.5651603 | 8.386326 | 2.151343 | 0.1519615 | 1.618178 |
10 | 11.01367 | 0.5787205 | 8.371964 | 2.198121 | 0.1270145 | 1.692152 |
6 components captures 65.55% of information contained in the predictors. It captures 77.04% of information in the outcome variable.
Predict the response for the test set. What is the test set estimate of R2?
# Make predictions
predictions <- pmodel %>% predict(test_df)
results <- data.frame(Model = "PLS Model",
RMSE = caret::RMSE(predictions, test_df$permeability),
Rsquared = caret::R2(predictions, test_df$permeability))
results %>%
kable() %>%
kable_styling()
Model | RMSE | Rsquared | |
---|---|---|---|
permeability | PLS Model | 13.40002 | 0.2414261 |
R2 is 0.24 and RMSE is 13.40
plot(predictions)
Try building other models discussed in this chapter. Do any have better predictive performance?
fit <-lm(permeability ~ ., train_df)
data_clear <- fingerprints[, -nearZeroVar(fingerprints)]
data_clear <- cbind(data.frame(permeability),data_clear) #adding permeability
number <- floor(0.70 * nrow(data_clear)) # 70/30 split
idx <- sample(seq_len(nrow(data_clear)), size = number)
train_df <- data_clear[idx, ]
test_df <- data_clear[-idx, ]
#train the Elastic Net model
elastic_model <- train(x=train_df[,-1],
y=train_df$permeability,
method='enet',
metric='RMSE', # error mettric
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.2),
.lambda = seq(0, 1, by=0.2)),
trControl=trainControl(method='cv',number=10),
preProcess=c('center','scale'))
plot(elastic_model)
# Best params
elastic_model$bestTune
## fraction lambda
## 8 0.2 0.2
# Perf of best params
getTrainPerf(elastic_model)
## TrainRMSE TrainRsquared TrainMAE method
## 1 11.20948 0.5283559 7.925961 enet
Elastic net is a popular type of regularized linear regression that combines two popular penalties, specifically the L1 and L2 penalty functions. As we see in the chart there is a decline in R2
Would you recommend any of your models to replace the permeability laboratory experiment?
I would not use the Elastic net because the lab experiment numbers are better.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
head(chem) %>%
kable() %>%
kable_styling()
Yield | BiologicalMaterial01 | BiologicalMaterial02 | BiologicalMaterial03 | BiologicalMaterial04 | BiologicalMaterial05 | BiologicalMaterial06 | BiologicalMaterial07 | BiologicalMaterial08 | BiologicalMaterial09 | BiologicalMaterial10 | BiologicalMaterial11 | BiologicalMaterial12 | ManufacturingProcess01 | ManufacturingProcess02 | ManufacturingProcess03 | ManufacturingProcess04 | ManufacturingProcess05 | ManufacturingProcess06 | ManufacturingProcess07 | ManufacturingProcess08 | ManufacturingProcess09 | ManufacturingProcess10 | ManufacturingProcess11 | ManufacturingProcess12 | ManufacturingProcess13 | ManufacturingProcess14 | ManufacturingProcess15 | ManufacturingProcess16 | ManufacturingProcess17 | ManufacturingProcess18 | ManufacturingProcess19 | ManufacturingProcess20 | ManufacturingProcess21 | ManufacturingProcess22 | ManufacturingProcess23 | ManufacturingProcess24 | ManufacturingProcess25 | ManufacturingProcess26 | ManufacturingProcess27 | ManufacturingProcess28 | ManufacturingProcess29 | ManufacturingProcess30 | ManufacturingProcess31 | ManufacturingProcess32 | ManufacturingProcess33 | ManufacturingProcess34 | ManufacturingProcess35 | ManufacturingProcess36 | ManufacturingProcess37 | ManufacturingProcess38 | ManufacturingProcess39 | ManufacturingProcess40 | ManufacturingProcess41 | ManufacturingProcess42 | ManufacturingProcess43 | ManufacturingProcess44 | ManufacturingProcess45 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
38.00 | 6.25 | 49.58 | 56.97 | 12.74 | 19.51 | 43.73 | 100 | 16.66 | 11.44 | 3.46 | 138.09 | 18.83 | NA | NA | NA | NA | NA | NA | NA | NA | 43.00 | NA | NA | NA | 35.5 | 4898 | 6108 | 4682 | 35.5 | 4865 | 6049 | 4665 | 0.0 | NA | NA | NA | 4873 | 6074 | 4685 | 10.7 | 21.0 | 9.9 | 69.1 | 156 | 66 | 2.4 | 486 | 0.019 | 0.5 | 3 | 7.2 | NA | NA | 11.6 | 3.0 | 1.8 | 2.4 |
42.44 | 8.01 | 60.97 | 67.48 | 14.65 | 19.36 | 53.14 | 100 | 19.04 | 12.55 | 3.46 | 153.67 | 21.05 | 0.0 | 0 | NA | 917 | 1032.2 | 210.0 | 177 | 178 | 46.57 | NA | NA | 0 | 34.0 | 4869 | 6095 | 4617 | 34.0 | 4867 | 6097 | 4621 | 0.0 | 3 | 0 | 3 | 4869 | 6107 | 4630 | 11.2 | 21.4 | 9.9 | 68.7 | 169 | 66 | 2.6 | 508 | 0.019 | 2.0 | 2 | 7.2 | 0.1 | 0.15 | 11.1 | 0.9 | 1.9 | 2.2 |
42.03 | 8.01 | 60.97 | 67.48 | 14.65 | 19.36 | 53.14 | 100 | 19.04 | 12.55 | 3.46 | 153.67 | 21.05 | 0.0 | 0 | NA | 912 | 1003.6 | 207.1 | 178 | 178 | 45.07 | NA | NA | 0 | 34.8 | 4878 | 6087 | 4617 | 34.8 | 4877 | 6078 | 4621 | 0.0 | 4 | 1 | 4 | 4897 | 6116 | 4637 | 11.1 | 21.3 | 9.4 | 69.3 | 173 | 66 | 2.6 | 509 | 0.018 | 0.7 | 2 | 7.2 | 0.0 | 0.00 | 12.0 | 1.0 | 1.8 | 2.3 |
41.42 | 8.01 | 60.97 | 67.48 | 14.65 | 19.36 | 53.14 | 100 | 19.04 | 12.55 | 3.46 | 153.67 | 21.05 | 0.0 | 0 | NA | 911 | 1014.6 | 213.3 | 177 | 177 | 44.92 | NA | NA | 0 | 34.8 | 4897 | 6102 | 4635 | 34.8 | 4872 | 6073 | 4611 | 0.0 | 5 | 2 | 5 | 4892 | 6111 | 4630 | 11.1 | 21.3 | 9.4 | 69.3 | 171 | 68 | 2.5 | 496 | 0.018 | 1.2 | 2 | 7.2 | 0.0 | 0.00 | 10.6 | 1.1 | 1.8 | 2.1 |
42.49 | 7.47 | 63.33 | 72.25 | 14.02 | 17.91 | 54.66 | 100 | 18.22 | 12.80 | 3.05 | 147.61 | 21.05 | 10.7 | 0 | NA | 918 | 1027.5 | 205.7 | 178 | 178 | 44.96 | NA | NA | 0 | 34.6 | 4992 | 6233 | 4733 | 33.9 | 4886 | 6102 | 4659 | -0.7 | 8 | 4 | 18 | 4930 | 6151 | 4684 | 11.3 | 21.6 | 9.0 | 69.4 | 171 | 70 | 2.5 | 468 | 0.017 | 0.2 | 2 | 7.3 | 0.0 | 0.00 | 11.0 | 1.1 | 1.7 | 2.1 |
43.57 | 6.12 | 58.36 | 65.31 | 15.17 | 21.79 | 51.23 | 100 | 18.30 | 12.13 | 3.78 | 151.88 | 20.76 | 12.0 | 0 | NA | 924 | 1016.8 | 208.9 | 178 | 178 | 45.32 | NA | NA | 0 | 34.0 | 4985 | 6222 | 4786 | 33.4 | 4862 | 6115 | 4696 | -0.6 | 9 | 1 | 1 | 4871 | 6128 | 4687 | 11.4 | 21.7 | 10.1 | 68.2 | 173 | 70 | 2.5 | 490 | 0.018 | 0.4 | 2 | 7.2 | 0.0 | 0.00 | 11.5 | 2.2 | 1.8 | 2.0 |
This data set contains information about a chemical manufacturing process, in which the goal is to understand the relationship between the process and the resulting final product yield. Raw material in this process is put through a sequence of 27 steps to generate the final pharmaceutical product. The starting material is generated from a biological unit and has a range of quality and characteristics. The objective in this project was to develop a model to predict percent yield of the manufacturing process. The data set consisted of 177 samples of biological material for which 57 characteristics were measured. Of the 57 characteristics, there were 12 measurements of the biological starting material, and 45 measurements of the manufacturing process. The process variables included measurements such as temperature, drying time, washing time, and concentrations of by–products at various steps. Some of the process measurements can be controlled, while others are observed. Predictors are continuous, count, categorical; some are correlated, and some contain missing values. Samples are not independent because sets of samples come from the same batch of biological starting material.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
# Make this reproducible
set.seed(42)
knn_model <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df_no_missing <- predict(knn_model, ChemicalManufacturingProcess)
Impute missing data using nearest-neighbor method
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Split data at 70:30 ratio.
number <- floor(0.70 * nrow(df_no_missing)) # 70/30 split
idx <- sample(seq_len(nrow(df_no_missing)), size = number)
training_df <- df_no_missing[idx, ]
testing_df <- df_no_missing[-idx, ]
# build PLS model
pmodel <- train(
Yield ~ ., data = training_df, method = "pls",
center = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
#pls model results
plot(pmodel$results$Rsquared,
xlab = "ncomp",
ylab = "Rsquared"
)
pmodel$results %>%
filter(ncomp == pmodel$bestTune$ncomp) %>%
select(ncomp, RMSE, Rsquared) %>%
kable() %>%
kable_styling()
ncomp | RMSE | Rsquared |
---|---|---|
3 | 0.6430892 | 0.613201 |
The ncomp is 3 RMSE is 0.64 and RSquared value is 0.61. The optimal number of model is 3. The model captures 6% of the yield.
Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
# Make predictions
pred <- predict(pmodel, testing_df)
# Model Evaluation
results <- data.frame(Model = "PLS Model",
RMSE = caret::RMSE(pred, testing_df$Yield),
Rsquared = caret::R2(pred, testing_df$Yield))
results %>%
kable() %>%
kable_styling()
Model | RMSE | Rsquared |
---|---|---|
PLS Model | 0.6557724 | 0.6132977 |
RMSE is root mean squared error. It is based the assumption that data error follow normal distribution. It is 0.66 while RSquared if 0.61
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
pls_importance <- varImp(pmodel)$importance %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
filter(Overall >= 50) %>% # set a threshold for vairables importance
arrange(desc(Overall)) %>%
mutate(importance = row_number())
varImp(pmodel) %>%
plot(., top = max(pls_importance$importance), main = "PLS Model Feature Importance")
ManufacturingProcess32 look important. We can set a threshold and only pass the variables that threshold for example 50%
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future rounds of the manufacturing process?
important_vars <- df_no_missing %>%
select_at(vars(Yield, pls_importance$Variable))
important_vars_p <- cor.mtest(important_vars)$p
important_vars %>%
cor() %>%
corrplot(method = "color", type = "lower", order = "hclust",
tl.cex = 0.8, tl.col = "black", tl.srt = 45,
addCoef.col = "black", number.cex = 0.7,
p.mat = important_vars_p, sig.level = 0.05, diag = FALSE)
The correlation heat map shows that variables are positively correleted with Yield respond. The Manufacuring process 32 is the most correleted variable to respond variable. Some variables are negatively correleted to othe explanatory variable. For example, Manufacuring process 32 is negatively correlated with manufacturing process 13.