Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the ‘nearZeroVar’ function from the caret package. How many predictors are left for modeling?
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
columns_to_remove <- nearZeroVar(fingerprints)
reduced_fp <- fingerprints[,-columns_to_remove]
reduced_fp <- as.data.frame(reduced_fp)
dim(reduced_fp)
## [1] 165 388
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R-squared?
# Set the random number seed so we can reproduce the results
set.seed(12345)
# By default, the numbers are returned as a list. Using
# list = FALSE, a matrix of row numbers is generated.
# These samples are allocated to the training set.
trainingRows <- createDataPartition(permeability, p = .70, list = FALSE)
head(trainingRows)
##      Resample1
## [1,]         2
## [2,]         3
## [3,]         5
## [4,]         6
## [5,]         7
## [6,]         8
# Subset the data into objects for training using
# integer sub-setting.
train <- reduced_fp[trainingRows,]
perm_train <- permeability[trainingRows]
head(train)
##   X1 X2 X3 X4 X5 X6 X11 X12 X15 X16 X20 X21 X25 X26 X27 X28 X29 X35 X36 X37 X38
## 2  0  0  0  0  0  0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
## 3  0  0  0  0  0  1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 5  0  0  0  0  0  0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
## 6  0  0  0  0  0  0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
## 7  0  0  0  0  0  1   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
## 8  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   X39 X40 X41 X42 X43 X44 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58
## 2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 5   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 6   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 7   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 8   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   X59 X60 X61 X62 X63 X64 X65 X66 X67 X68 X69 X70 X71 X72 X73 X74 X75 X76 X78
## 2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 5   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 6   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 7   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 8   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   X79 X80 X86 X87 X88 X93 X94 X96 X97 X98 X99 X101 X102 X103 X108 X111 X118
## 2   0   0   1   0   0   0   0   1   0   0   0    1    0    0    0    0    0
## 3   0   0   1   0   0   0   0   1   0   0   0    1    0    1    1    1    0
## 5   0   0   1   0   0   0   0   1   0   0   0    1    0    0    0    0    0
## 6   0   0   1   0   0   0   0   1   0   0   0    1    0    0    0    0    0
## 7   0   0   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
## 8   0   0   0   0   0   1   0   0   0   0   0    0    0    1    1    1    0
##   X121 X125 X126 X127 X129 X130 X133 X138 X141 X142 X143 X146 X150 X152 X153
## 2    0    0    0    0    0    0    0    0    1    0    1    0    1    1    1
## 3    0    0    0    0    0    0    0    0    0    0    1    0    1    1    1
## 5    0    0    0    0    0    0    0    0    0    0    1    0    1    1    1
## 6    0    0    0    0    0    0    0    0    0    0    1    0    1    1    1
## 7    0    0    0    0    0    0    1    1    1    0    1    0    1    1    1
## 8    0    0    0    0    0    0    0    0    0    0    1    0    1    1    1
##   X154 X156 X157 X158 X159 X162 X163 X167 X168 X169 X170 X171 X172 X173 X174
## 2    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 3    1    0    1    1    1    1    1    1    1    1    1    1    1    1    1
## 5    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 6    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 7    1    1    0    0    0    1    1    1    1    1    1    1    1    1    1
## 8    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
##   X175 X176 X177 X178 X179 X180 X181 X182 X183 X184 X185 X186 X187 X188 X189
## 2    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 3    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 5    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 6    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 7    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 8    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
##   X190 X191 X192 X193 X194 X195 X196 X197 X198 X199 X200 X201 X202 X203 X204
## 2    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 3    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 5    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 6    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 7    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 8    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
##   X205 X206 X207 X208 X209 X210 X211 X212 X213 X214 X215 X221 X223 X224 X225
## 2    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0
## 3    1    1    1    1    1    1    1    1    1    1    1    0    0    0    0
## 5    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0
## 6    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0
## 7    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0
## 8    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0
##   X226 X227 X228 X229 X230 X231 X232 X233 X234 X235 X236 X237 X238 X239 X240
## 2    1    1    1    1    0    1    1    1    1    1    1    1    0    1    1
## 3    0    0    0    0    1    0    0    0    0    0    0    0    1    1    1
## 5    1    1    1    1    0    1    1    1    1    1    1    1    0    1    1
## 6    1    1    1    1    0    1    1    1    1    1    1    1    1    1    1
## 7    1    1    1    0    0    1    1    1    1    0    0    0    0    0    0
## 8    1    1    1    1    0    1    1    1    1    1    1    0    0    1    1
##   X241 X242 X244 X245 X246 X247 X248 X249 X250 X251 X253 X254 X255 X256 X257
## 2    1    0    1    1    1    1    1    1    0    0    1    1    1    1    1
## 3    1    0    1    1    1    1    1    1    0    0    1    1    1    1    1
## 5    1    0    1    1    1    1    1    1    0    0    1    1    1    1    1
## 6    1    0    1    1    1    1    1    1    0    0    1    1    1    1    1
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    1    1    1    1    1    0    0    0    1    1    1    1    0
##   X258 X260 X261 X262 X263 X264 X265 X266 X267 X268 X269 X270 X271 X272 X274
## 2    0    1    1    1    1    0    1    1    0    1    1    1    0    0    1
## 3    0    1    1    1    1    0    1    1    0    1    1    1    0    0    1
## 5    0    1    1    1    1    0    1    1    0    1    1    1    0    0    1
## 6    0    1    1    1    1    0    1    1    0    1    1    1    0    0    1
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1
## 8    0    1    0    1    0    0    1    1    0    1    1    0    0    1    1
##   X276 X278 X279 X280 X281 X284 X285 X286 X290 X291 X293 X294 X295 X296 X297
## 2    0    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 3    0    0    0    0    0    0    0    0    0    0    0    0    0    1    1
## 5    0    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 6    0    0    0    1    1    0    0    0    0    0    1    1    1    1    1
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   X298 X299 X300 X301 X302 X303 X304 X305 X306 X307 X308 X309 X310 X311 X312
## 2    1    1    1    1    1    1    1    1    1    1    1    1    1    0    0
## 3    0    0    0    1    1    1    0    0    0    0    0    1    1    1    1
## 5    1    1    1    1    1    1    1    1    1    1    1    1    1    0    0
## 6    1    1    1    1    1    1    1    1    1    1    1    1    1    0    1
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   X313 X314 X315 X316 X317 X318 X319 X320 X321 X322 X323 X324 X325 X326 X327
## 2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 3    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 6    1    0    0    0    0    0    0    0    0    0    0    0    1    1    1
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0
##   X328 X329 X330 X331 X332 X333 X334 X335 X336 X337 X338 X339 X340 X341 X342
## 2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 3    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 6    1    1    1    1    1    1    0    1    0    0    0    0    0    0    0
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    0    0    0    0    0    0    0    0    1    1    1    1    1
##   X343 X344 X345 X355 X356 X357 X358 X359 X360 X361 X362 X366 X367 X368 X370
## 2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 3    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0
## 5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 6    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1
## 8    1    1    0    0    0    0    1    0    0    0    0    0    0    0    0
##   X371 X372 X373 X374 X376 X377 X378 X380 X381 X382 X383 X385 X386 X387 X388
## 2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 7    1    1    1    1    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    0    0    1    1    1    1    1    1    1    1    1    1    1
##   X389 X390 X392 X394 X395 X396 X398 X400 X401 X403 X406 X496 X497 X499 X503
## 2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 8    1    1    1    1    1    1    1    1    1    1    1    0    0    0    0
##   X504 X505 X506 X507 X508 X509 X510 X511 X512 X514 X515 X516 X517 X518 X519
## 2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   X520 X521 X522 X524 X529 X549 X551 X553 X554 X556 X557 X558 X559 X560 X561
## 2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   X565 X568 X571 X573 X574 X576 X577 X590 X591 X592 X593 X594 X595 X597 X598
## 2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   X599 X600 X601 X602 X603 X604 X613 X621 X679 X698 X699 X700 X701 X702 X703
## 2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   X704 X705 X719 X732 X733 X750 X751 X752 X753 X754 X755 X773 X774 X775 X776
## 2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   X780 X782 X792 X793 X795 X798 X800 X801 X805 X806 X812 X813
## 2    0    0    0    0    0    0    0    0    0    0    0    0
## 3    0    0    0    0    0    0    0    0    0    0    0    0
## 5    0    0    0    0    0    0    0    0    0    0    0    0
## 6    0    0    0    0    0    0    0    0    0    0    0    0
## 7    0    0    0    0    0    0    0    0    0    0    0    0
## 8    0    0    0    0    0    0    0    0    0    0    0    0
# Do the same for the test set using negative integers.
test <- reduced_fp[-trainingRows, ]
perm_test <- permeability[-trainingRows]
head(test)
##    X1 X2 X3 X4 X5 X6 X11 X12 X15 X16 X20 X21 X25 X26 X27 X28 X29 X35 X36 X37
## 1   0  0  0  0  0  1   0   1   0   0   0   0   0   0   0   0   0   0   0   0
## 4   0  0  0  0  0  0   0   1   0   0   0   0   0   0   0   0   0   0   0   0
## 12  1  1  1  1  1  0   0   1   0   0   1   1   1   1   1   1   1   0   0   1
## 13  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 18  0  0  0  0  0  0   0   1   0   0   0   0   0   0   0   0   0   0   0   0
## 21  1  1  1  1  1  1   0   1   0   0   1   1   1   1   1   1   1   0   0   1
##    X38 X39 X40 X41 X42 X43 X44 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57
## 1    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 4    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 12   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
## 13   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 18   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 21   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
##    X58 X59 X60 X61 X62 X63 X64 X65 X66 X67 X68 X69 X70 X71 X72 X73 X74 X75 X76
## 1    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 4    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 12   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
## 13   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 18   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 21   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
##    X78 X79 X80 X86 X87 X88 X93 X94 X96 X97 X98 X99 X101 X102 X103 X108 X111
## 1    0   0   0   1   1   1   0   0   1   1   1   0    1    1    0    0    0
## 4    0   0   0   1   0   0   0   0   1   0   0   0    1    0    0    0    0
## 12   1   1   1   1   1   1   0   0   1   1   1   1    1    1    0    0    0
## 13   0   0   0   0   0   0   0   0   0   0   0   0    0    0    0    0    0
## 18   0   0   0   1   1   1   0   0   1   1   1   1    1    1    0    0    0
## 21   1   1   1   1   1   0   0   0   1   1   0   0    1    1    0    0    1
##    X118 X121 X125 X126 X127 X129 X130 X133 X138 X141 X142 X143 X146 X150 X152
## 1     0    0    0    0    0    0    0    0    0    0    0    1    0    1    1
## 4     0    0    0    0    0    0    0    0    0    0    0    1    0    1    1
## 12    0    0    0    0    0    0    0    0    0    1    1    1    0    0    0
## 13    0    0    0    0    0    0    0    0    0    0    0    1    0    1    1
## 18    0    0    0    0    0    0    0    0    0    0    0    1    0    1    1
## 21    0    1    0    0    0    0    0    0    0    0    1    1    0    0    0
##    X153 X154 X156 X157 X158 X159 X162 X163 X167 X168 X169 X170 X171 X172 X173
## 1     1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 4     1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 12    0    0    1    1    1    1    0    0    0    0    0    0    1    0    0
## 13    1    1    1    1    1    0    1    1    1    1    1    1    1    1    1
## 18    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 21    0    0    1    1    1    1    0    0    0    0    0    0    1    0    0
##    X174 X175 X176 X177 X178 X179 X180 X181 X182 X183 X184 X185 X186 X187 X188
## 1     1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 4     1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 18    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 21    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##    X189 X190 X191 X192 X193 X194 X195 X196 X197 X198 X199 X200 X201 X202 X203
## 1     1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 4     1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 18    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 21    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##    X204 X205 X206 X207 X208 X209 X210 X211 X212 X213 X214 X215 X221 X223 X224
## 1     1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 4     1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 12    0    0    0    0    0    0    0    0    0    0    0    1    1    1    1
## 13    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 18    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 21    0    0    0    0    0    0    0    0    0    0    0    1    1    1    1
##    X225 X226 X227 X228 X229 X230 X231 X232 X233 X234 X235 X236 X237 X238 X239
## 1     1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 4     0    1    1    1    1    0    1    1    1    1    1    1    1    1    1
## 12    0    1    1    1    1    0    1    1    1    1    1    1    1    0    1
## 13    0    1    1    1    0    0    1    1    1    1    1    0    0    0    1
## 18    0    1    1    1    1    0    1    1    1    1    1    1    1    0    1
## 21    0    1    1    1    1    0    1    1    1    1    1    1    1    0    1
##    X240 X241 X242 X244 X245 X246 X247 X248 X249 X250 X251 X253 X254 X255 X256
## 1     1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 4     1    1    0    1    1    1    1    1    1    0    0    1    1    1    1
## 12    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 13    1    0    0    1    1    1    1    0    0    0    0    1    1    1    0
## 18    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 21    1    1    1    1    1    1    1    1    1    1    0    1    1    1    1
##    X257 X258 X260 X261 X262 X263 X264 X265 X266 X267 X268 X269 X270 X271 X272
## 1     1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 4     1    0    1    1    1    1    0    1    1    0    1    1    1    0    0
## 12    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0
## 13    0    0    1    0    0    0    0    1    0    0    0    0    0    0    0
## 18    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 21    1    0    1    1    1    1    1    1    1    1    1    1    1    1    1
##    X274 X276 X278 X279 X280 X281 X284 X285 X286 X290 X291 X293 X294 X295 X296
## 1     1    1    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     1    0    0    0    1    1    0    0    0    0    0    1    1    1    1
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1
## 13    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 18    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 21    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##    X297 X298 X299 X300 X301 X302 X303 X304 X305 X306 X307 X308 X309 X310 X311
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     1    1    1    1    1    1    1    1    1    1    1    1    1    1    0
## 12    1    0    0    0    1    1    1    0    0    0    0    0    1    1    0
## 13    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 18    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 21    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##    X312 X313 X314 X315 X316 X317 X318 X319 X320 X321 X322 X323 X324 X325 X326
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     1    1    0    0    0    0    0    0    0    0    0    0    0    1    1
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    0    0    0    1    0    0    0    0    0    0    0    0    0    0    0
## 18    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 21    0    0    1    1    0    0    0    0    0    0    0    0    0    0    0
##    X327 X328 X329 X330 X331 X332 X333 X334 X335 X336 X337 X338 X339 X340 X341
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     1    1    1    1    1    1    1    0    1    0    0    0    0    0    0
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 18    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 21    0    0    0    0    0    0    0    0    0    0    0    1    1    0    1
##    X342 X343 X344 X345 X355 X356 X357 X358 X359 X360 X361 X362 X366 X367 X368
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     0    0    0    1    0    0    0    0    0    0    0    0    0    0    0
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    0    0    0    0    0    0    0    0    1    1    0    1    0    0    0
## 18    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 21    1    1    0    0    1    1    1    1    1    1    1    1    1    1    1
##    X370 X371 X372 X373 X374 X376 X377 X378 X380 X381 X382 X383 X385 X386 X387
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0
## 18    0    0    0    0    0    1    1    1    1    1    1    1    1    1    1
## 21    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0
##    X388 X389 X390 X392 X394 X395 X396 X398 X400 X401 X403 X406 X496 X497 X499
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 18    1    1    1    1    1    1    1    1    1    1    1    1    0    0    0
## 21    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##    X503 X504 X505 X506 X507 X508 X509 X510 X511 X512 X514 X515 X516 X517 X518
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 18    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 21    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##    X519 X520 X521 X522 X524 X529 X549 X551 X553 X554 X556 X557 X558 X559 X560
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 12    0    1    1    0    0    0    0    0    0    0    0    0    0    0    0
## 13    0    0    0    0    0    0    1    1    1    1    1    1    1    1    1
## 18    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 21    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##    X561 X565 X568 X571 X573 X574 X576 X577 X590 X591 X592 X593 X594 X595 X597
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0
## 18    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 21    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##    X598 X599 X600 X601 X602 X603 X604 X613 X621 X679 X698 X699 X700 X701 X702
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 18    0    0    0    0    0    0    0    1    1    0    0    0    0    0    0
## 21    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##    X703 X704 X705 X719 X732 X733 X750 X751 X752 X753 X754 X755 X773 X774 X775
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 18    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
## 21    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##    X776 X780 X782 X792 X793 X795 X798 X800 X801 X805 X806 X812 X813
## 1     0    0    0    0    0    0    0    0    0    0    0    0    0
## 4     0    0    0    0    0    0    0    0    0    0    0    0    0
## 12    0    0    0    0    0    0    0    0    0    0    0    0    0
## 13    0    0    0    0    0    0    0    0    0    0    0    0    0
## 18    0    0    0    0    0    0    0    0    0    0    0    0    0
## 21    0    0    0    0    0    0    0    0    0    0    0    0    0

I’m going to center and scale the data. I don’t think scaling is really necessary because the data is already binary.

trans <- preProcess(train, method = c("center", "scale"))
trans
## Created from 117 samples and 388 variables
## 
## Pre-processing:
##   - centered (388)
##   - ignored (0)
##   - scaled (388)
train_pp <- predict(trans, train)
test_pp <- predict(trans, test)

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
plsFit <- plsr(perm_train ~ ., data = train_pp)

train_pred <- predict(plsFit, train_pp, ncomp = 2)

We end up with an R-squared of 0.479 for the

lmValues1 <- data.frame(obs = perm_train, pred = train_pred)
names(lmValues1) <- c('obs', 'pred')
defaultSummary(lmValues1)
##       RMSE   Rsquared        MAE 
## 11.2383218  0.4794739  7.6065846

If we plot RMSE vs ncomp we see that 2 components gives us the optimal model.

plsTune <- train(train, perm_train,
                 method = "pls",
                 ## The default tuning grid evaluates
                 ## components 1... tuneLength
                 tuneLength = 20,
                 #trControl = ctrl,
                 preProc = c("center", "scale"))
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X732, X733
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X613, X621
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: X345
plot(plsTune$results$ncomp, plsTune$results$RMSE)

  1. Predict the response for the test set. What is the test set estimate of R-squared?

Interestingly, we end up with a higher R-squared on the test data of 0.542.

test_pred <- predict(plsFit, test_pp, ncomp = 2)
lmValues1 <- data.frame(obs = perm_test, pred = test_pred)
names(lmValues1) <- c('obs', 'pred')
defaultSummary(lmValues1)
##       RMSE   Rsquared        MAE 
## 10.8022420  0.5427105  8.1074323
  1. Try building other models discussed in this chapter. Do any have better predictive performance?

I identified a lambda of 0.15 based as the optimal lambda based on minimizing the RMSE. I only looked at 5 models at a time because this takes forever to run on my machine. I also had to start the lambda search grid at a value greater than zero else I got an error.

ridgeGrid <- data.frame(.lambda = seq(0.001, .2, length = 5))
set.seed(100)
ridgeRegFit <- train(
  train,
  perm_train,
  method = "ridge",
  tuneGrid = ridgeGrid,
  preProc = c("center", "scale")
)
ridgeRegFit
## Ridge Regression 
## 
## 117 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 117, 117, 117, 117, 117, 117, ... 
## Resampling results across tuning parameters:
## 
##   lambda   RMSE       Rsquared   MAE     
##   0.00100  145.69715  0.1589619  89.01047
##   0.05075   14.79445  0.3557792  10.88255
##   0.10050   14.47902  0.3746246  10.66020
##   0.15025   14.44841  0.3835864  10.64434
##   0.20000   14.52040  0.3891773  10.69327
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.15025.
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
ridgeModel <- enet(x = as.matrix(train_pp), y = perm_train, lambda = 0.15025)

The ridge regression gives us an R-squared of 0.458 which is worse than our PLS model.

ridgePred <- predict(ridgeModel, newx = as.matrix(test_pp), s = 1, mode = "fraction",type = "fit")

lmValues1 <- data.frame(obs = perm_test, pred = ridgePred$fit)
names(lmValues1) <- c('obs', 'pred')
defaultSummary(lmValues1)
##       RMSE   Rsquared        MAE 
## 12.8261042  0.4588614  9.5449348
  1. Would you recommend any of your models to replace the permeability laboratory experiment?

Heavens no! My R-squared values are just above a coin flip.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Let’s first just get a count of missing values. We have 106 missing values out of 10208.

cmp_df <- ChemicalManufacturingProcess
sum(is.na(cmp_df))
## [1] 106

We can use the ‘colSums()’ function to understand missingness by predictor. The greatest amount of missingness is 15 values. I’m going to replace the missing values with the column mean.

colSums(is.na(cmp_df))
##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##                      0                      0                      0 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##                      0                      0                      0 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##                      0                      0                      0 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##                      0                      0                      0 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##                      0                      1                      3 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##                     15                      1                      1 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##                      2                      1                      1 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##                      0                      9                     10 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##                      1                      0                      1 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##                      0                      0                      0 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##                      0                      0                      0 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##                      0                      1                      1 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##                      1                      5                      5 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##                      5                      5                      5 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##                      5                      5                      0 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      5                      5                      5 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##                      5                      0                      0 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##                      0                      1                      1 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##                      0                      0                      0 
## ManufacturingProcess45 
##                      0
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
cmp_df <- na.aggregate(cmp_df)
sum(is.na(cmp_df))
## [1] 0
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
columns_to_remove <- nearZeroVar(cmp_df)
cmp_df <- cmp_df[,-columns_to_remove]

# Set the random number seed so we can reproduce the results
set.seed(12345)
# By default, the numbers are returned as a list. Using
# list = FALSE, a matrix of row numbers is generated.
# These samples are allocated to the training set.
trainingRows <- createDataPartition(cmp_df$Yield, p = .70, list = FALSE)
head(trainingRows)
##      Resample1
## [1,]         1
## [2,]         4
## [3,]         5
## [4,]         6
## [5,]         7
## [6,]         8
# Subset the data into objects for training using
# integer sub-setting.
train <- cmp_df[trainingRows,]
train <- train |> select(-Yield)

trans <- preProcess(train, method = c("center", "scale"))
train <- predict(trans, train)

yield_train <- cmp_df$Yield[trainingRows]
head(train)
##   BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1           -0.2054709           -1.4935462          -2.58818370
## 4            2.2289466            1.2370302          -0.06878264
## 5            1.4820231            1.8028037           1.07465628
## 6           -0.3852858            0.6113231          -0.58896345
## 7            1.4958550            2.0761011           1.11301062
## 8            0.7489314            1.8675321           1.02911049
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1            0.2166325            0.4667598           -1.3596000
## 4            1.2294085            0.3900254            1.0780326
## 5            0.8953515           -0.3517407            1.4717841
## 6            1.5051381            1.6331230            0.5832527
## 7            0.7893016           -0.4540533            1.4173842
## 8            1.7861701            0.4207192            1.4873269
##   BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 1            -1.198115          -3.27566324            1.0730308
## 4             2.242458          -0.67867984            1.0730308
## 5             1.057050          -0.09377367            0.4130636
## 6             1.172700          -1.66132221            1.5881271
## 7             1.779860           0.25717004            0.3969669
## 8             1.967790           0.67830248            1.7008044
##   BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 1          -1.78548929           -1.7258510            -0.05364395
## 4           1.39343823            1.0706188            -7.59440574
## 5           0.15696193            1.0706188            -0.39506017
## 6           1.02820843            0.7053142             0.47962667
## 7           0.05494244            0.6927175             0.14320865
## 8           1.50157890            1.5744873             0.47962667
##   ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 1             0.03620792             0.07385208             -0.0157511
## 4            -1.87615195             0.07385208             -3.3337196
## 5            -1.87615195             0.07385208             -2.2198497
## 6            -1.87615195             0.07385208             -1.2651040
## 7            -1.87615195             1.00547533              0.1670145
## 8            -1.87615195             0.54957459             -0.4694826
##   ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 1           -0.005821203           -0.004328789            -0.03161382
## 4            0.398482632            2.453890276            -0.99164162
## 5            0.802571670           -0.713554810             1.00841629
## 6            0.467397042            0.620106279             1.00841629
## 7           -0.406562970            1.078552278            -0.99164162
## 8            0.282581126            1.787059732             1.00841629
##   ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 1            -0.04608268             -1.6448449            -0.02151027
## 4            -1.16811191             -0.4657744            -0.02151027
## 5             0.85616763             -0.4412105            -0.02151027
## 6             0.85616763             -0.2201348            -0.02151027
## 7             0.85616763              2.2608260             3.09339061
## 8             0.85616763              1.8432386             1.29209497
##   ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 1          -6.820423e-05            -0.01639611              1.0002106
## 4          -6.820423e-05            -0.49223212              0.3133571
## 5          -6.820423e-05            -0.49223212              0.1171133
## 6          -6.820423e-05            -0.49223212             -0.4716183
## 7           2.968453e+00            -0.49223212             -2.0415691
## 8           2.687670e+00            -0.49223212             -0.8641060
##   ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 1             0.82870919               1.161244              1.4386632
## 4             0.81019055               1.058903              0.6791950
## 5             2.56946146               3.293355              2.2627670
## 6             2.43983097               3.105730              3.1191885
## 7            -2.00464290              -0.697957             -1.7284808
## 8             0.01388898               1.110073              0.5337649
##   ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 1              0.8804051             0.15325123              0.4317426
## 4              0.3562682             0.16927605              0.9716022
## 5             -0.3176221             0.20132570              1.6239325
## 6             -0.6920056             0.14638344              1.9163564
## 7             -0.3924988            -0.09169966             -0.3780468
## 8             -0.5422522            -0.07338557             -0.1755995
##   ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 1             0.28345692              0.1935769            0.006746414
## 4             0.15319630              0.1935769           -0.110173715
## 5             0.26898351             -0.6725302            0.754376533
## 6             0.35823616             -0.5488006            1.042559950
## 7            -0.06149251              1.9257912           -1.262907380
## 8             0.01087451              0.1935769           -0.974723964
##   ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 1             0.05155938             0.02351133             0.12765205
## 4            -0.59175250            -0.69227220             0.17104710
## 5             0.67318659             1.73456465             0.25783719
## 6            -1.22422204            -1.43899124             0.12308415
## 7            -1.22422204            -1.43899124            -0.05049604
## 8            -0.59175250            -1.25231148            -0.02537259
##   ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 1              0.1312176             0.31997702              0.7520516
## 4              0.1990806             0.18728134              0.8299056
## 5              0.2724459             0.31756437              0.8688326
## 6              0.2302609             0.32480231              0.8882961
## 7              0.1000374             0.04734772              0.8493691
## 8              0.1037056             0.08112480              0.8299056
##   ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 1              0.5294589              0.6961438            -0.13275952
## 4              0.6861900              0.2311268            -0.10194716
## 5              0.8429210             -0.1408867            -0.08654098
## 6              0.8951647              0.8821505            -0.27141516
## 7              0.6339463              1.9051878            -0.36385225
## 8              0.6339463              1.6261776            -0.31763370
##   ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 1             -0.4607269              0.9938604             -1.8463324
## 4              2.3862897              1.7983734              0.1306222
## 5              2.3862897              2.6028864              0.1306222
## 6              2.7658919              2.6028864              0.1306222
## 7              0.1086764              0.5916039              0.1306222
## 8              0.4882787              0.5916039              0.1306222
##   ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 1            -0.87434774             -0.6314575            -1.14086639
## 4             0.07271827             -1.7877118             0.42107687
## 5            -2.57906655             -2.9439661            -1.81027065
## 6            -0.49552134             -1.7877118            -1.36400115
## 7            -1.91612035             -0.6314575            -0.47146214
## 8            -1.63200054             -0.6314575            -0.02519263
##   ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 1              0.7297985              0.2015935             0.03908889
## 4             -0.8578334              0.2015935            -0.44096769
## 5             -0.8578334              0.2797110            -0.44096769
## 6             -0.8578334              0.2015935            -0.44096769
## 7             -0.8578334              0.2797110            -0.44096769
## 8             -0.8578334              0.2797110            -0.44096769
##   ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 1              0.1157941             0.18834701              5.0434973
## 4             -0.4361989            -0.36640428              0.5305920
## 5             -0.4361989            -0.14450376              0.5305920
## 6             -0.4361989             0.13287188              3.1433267
## 7             -0.4361989             0.24382214             -0.4194933
## 8             -0.4361989             0.07739675             -0.1819720
##   ManufacturingProcess44 ManufacturingProcess45
## 1            -0.05623755              0.6542438
## 4            -0.05623755             -0.1233239
## 5            -0.38830687             -0.1233239
## 6            -0.05623755             -0.3825132
## 7             0.60790110              0.1358653
## 8             0.60790110              0.1358653
# Do the same for the test set using negative integers.
test <- cmp_df[-trainingRows,]
test <- test |> select(-Yield)

trans <- preProcess(test, method = c("center", "scale"))
test <- predict(trans, test)

yield_test <- cmp_df$Yield[-trainingRows]
head(test)
##    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 2              2.246775             1.501094          -0.02150013
## 3              2.246775             1.501094          -0.02150013
## 11             1.043057             1.571158           0.68212094
## 14             1.702236             1.377134           0.46797540
## 19             1.530276             1.169636          -0.07712235
## 21             1.530276             1.169636          -0.07712235
##    BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 2             1.5134554           0.48444252             1.258718
## 3             1.5134554           0.48444252             1.258718
## 11            0.6489183           0.04395094             1.169873
## 14            3.4837958           1.32073813             1.169873
## 19            2.2305521           2.04212290             1.017975
## 21            2.2305521           2.04212290             1.017975
##    BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 2             2.3676003           -0.8360234            1.1712937
## 3             2.3676003           -0.8360234            1.1712937
## 11            0.5736843           -0.6553153            0.1497881
## 14            1.9036565            0.8677963            3.8709870
## 19            2.0119101            0.2740409            2.5941050
## 21            2.0119101            0.2740409            2.5941050
##    BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 2             1.3843339           1.16472434            -4.51602764
## 3             1.3843339           1.16472434            -4.51602764
## 11           -0.8460896           0.01448335            -0.29402659
## 14            2.4040174           2.15064519             0.03389583
## 19            2.3911372           2.21911192             0.68974065
## 21            2.3911372           2.21911192             0.68974065
##    ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 2               -2.295779             -0.1951161            -2.34074196
## 3               -2.295779             -0.1951161            -3.14152210
## 11              -2.295779              0.3319508             0.06159847
## 14              -2.295779              2.3523741             0.38191053
## 19              -2.295779             -0.1951161            -0.41886961
## 21              -2.295779              0.8370566            -1.05949373
##    ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 2              1.15239960             0.79698667             -0.8819589
## 3              0.08750045            -0.08415049              1.1120351
## 11            -0.67580138             0.61468243             -0.8819589
## 14             1.32367709             0.52353031              1.1120351
## 19             3.38645376             0.03738567              1.1120351
## 21             3.78113667            -0.23607069              1.1120351
##    ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 2               0.9903379              0.7092319             0.05916066
## 3               0.9903379             -0.4063173             0.05916066
## 11             -0.9903379              0.4638110            -0.20653257
## 14              0.9903379             -0.2501405             0.68385077
## 19              0.9903379             -0.8599740            -0.35492979
## 21              0.9903379             -0.2947624            -0.20653257
##    ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 2             0.000175907             -0.4530754             -0.5657891
## 3             0.000175907             -0.4530754              0.2240373
## 11            0.173972030             -0.4530754              1.2113203
## 14            0.629500816             -0.4530754              0.3227656
## 19            0.629500816             -0.4530754              0.9151354
## 21            0.933186673             -0.4530754              0.6189505
##    ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 2               0.2450061              1.0070572              0.1813262
## 3               0.4064732              0.8692493              0.1813262
## 11              1.5905652              2.0578418              0.2954556
## 14              0.4782363              1.2309949              0.2407361
## 19              1.8237955              2.4884913              0.3908240
## 21              1.6802691              2.1439717              0.4142753
##    ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 2              -0.3825686              0.6517261              1.5013342
## 3               0.4014143              0.8719324              1.1086103
## 11              1.3813929              1.3784070              2.1007550
## 14              1.3813929              0.4535404              0.9432529
## 19              1.0873993              1.0921388              2.0594156
## 21              0.7934058              0.7398086              1.7287007
##    ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 2               0.8110093              0.2576876             -0.8296707
## 3               0.8110093              0.2576876             -0.4926170
## 11              1.6379208              0.2576876              0.1814905
## 14              0.9425634              1.5271383              1.5297053
## 19              2.4836258              0.2576876             -0.8296707
## 21              2.3896586              0.2576876             -0.1555633
##    ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 2              -1.7555601             -0.9103277              0.3595165
## 3              -1.2089186             -0.7619357              1.0568542
## 11             -0.1156357             -0.4651518              1.7541919
## 14             -0.6622772             -1.0587196              1.4055230
## 19             -0.6622772             -0.9103277              0.5338509
## 21              0.4310058             -0.6135437              0.6085656
##    ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 2                1.344933              0.8649995              0.9865250
## 3                1.546427              1.0047519              0.9675194
## 11               2.553898              2.0029837              1.0435419
## 14               1.725533              1.9430898              0.9675194
## 19               1.389710              2.1826654              0.9675194
## 21               1.568815              2.1826654              0.9675194
##    ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 2                2.137901              1.1497141              -2.165173
## 3                1.977198              0.3456575              -1.521008
## 11               2.941416              0.3456575              -2.165173
## 14               2.298604              0.3456575              -1.735730
## 19               2.137901              0.9889028              -2.057812
## 21               2.298604              1.1497141              -2.165173
##    ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 2               1.8178659              1.0186290             1.74521492
## 3               2.5142124              1.0186290             1.74521492
## 11              1.2956061              1.4393084             0.09904423
## 14              0.4251731              1.0186290            -1.54712645
## 19             -0.4452600              0.1772701            -1.54712645
## 21             -0.2711734             -0.2434093             0.09904423
##    ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 2               1.0519370             -0.7430777              2.2192462
## 3               1.1431342             -1.9088236             -0.7223429
## 11             -0.3160211             -0.7430777              1.0878658
## 14             -0.5896127             -0.7430777             -0.9486190
## 19              1.5991203              1.5884139             -0.7223429
## 21             -0.1336267              0.4226681             -0.2697907
##    ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 2              -0.7418953              0.2916831              1.9119609
## 3              -0.7418953              0.2916831             -0.5129651
## 11             -0.7418953              0.1885027             -0.5129651
## 14              0.6869401              0.3948635             -0.5129651
## 19              0.6869401              0.3432733             -0.5129651
## 21              0.6869401              0.3432733              1.9119609
##    ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 2               1.5823454             0.01024402           -0.065615517
## 3              -0.4919898             0.40976090            0.002624621
## 11             -0.4919898            -0.03414674            0.002624621
## 14             -0.4919898             0.27658861           -0.270335931
## 19             -0.4919898             0.23219785            0.548545724
## 21              2.2737905             0.14341632            0.685025999
##    ManufacturingProcess44 ManufacturingProcess45
## 2              0.33382803              0.1853595
## 3              0.06259276              0.4044208
## 11             0.33382803             -0.6908855
## 14            -0.20864252             -0.4718243
## 19            -0.20864252              0.4044208
## 21            -0.47987780             -0.2527630
ridgeGrid <- data.frame(.lambda = seq(0.001, .3, length = 5))
set.seed(100)
ridgeRegFit <- train(
  train,
  yield_train,
  method = "ridge",
  tuneGrid = ridgeGrid,
  preProc = c("center", "scale")
)
ridgeRegFit
## Ridge Regression 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   lambda   RMSE      Rsquared   MAE     
##   0.00100  9.483782  0.2000670  2.800208
##   0.07575  2.593656  0.4211952  1.351093
##   0.15050  2.396367  0.4346943  1.292148
##   0.22525  2.362100  0.4334281  1.280578
##   0.30000  2.379935  0.4303572  1.289601
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.22525.
  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
ridgeModel <- enet(x = as.matrix(train), y = yield_train, lambda = 0.22525)

ridgePred <- predict(ridgeModel, newx = as.matrix(test), s = 1, mode = "fraction",type = "fit")

lmValues1 <- data.frame(obs = yield_test, pred = ridgePred$fit)
names(lmValues1) <- c('obs', 'pred')
defaultSummary(lmValues1)
##      RMSE  Rsquared       MAE 
## 1.4247436 0.4016476 1.2141389
  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The predictors that are most important are all manfacturing process metrics

ridgeCoef<- predict(ridgeModel, newx = as.matrix(test), s = 1, mode = "fraction",type = "coefficients")

coef <- as.data.frame(ridgeCoef$coefficients)
names(coef) <- c("coefficients")

coef|> as.data.frame() |> filter(abs(coefficients) > 0.3)
##                        coefficients
## ManufacturingProcess09    0.3771488
## ManufacturingProcess32    0.5768940
## ManufacturingProcess36   -0.3284562
  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

This is really interesting, when we build a linear model with just the three predictors with the largest coefficients from the Ridge regression model we find that the R-squared value is 0.6 which is better than any of the larger models we have tried.

What these results show is that a one unit increase in ManufacturingProcess09 or ManufacturingProcess32 leads to a half unit increase in Yield. A one unit increase of ManufacturingProcess34 leads to a 10th of a unit decrease in Yield. This shows that a relatively changes in a small number of processing elements (temperature, drying time, washing time, and concentrations of by–products at various steps) can lead to meaningful changes in product Yield

trans <- preProcess(cmp_df, method = c("center", "scale"))
pp_cmp_df <- predict(trans, cmp_df)

model <- lm(Yield ~ ManufacturingProcess09 + ManufacturingProcess32 + ManufacturingProcess36, data = pp_cmp_df)

summary(model)
## 
## Call:
## lm(formula = Yield ~ ManufacturingProcess09 + ManufacturingProcess32 + 
##     ManufacturingProcess36, data = pp_cmp_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.44444 -0.49190  0.01062  0.32466  1.79513 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -6.907e-16  4.787e-02   0.000    1.000    
## ManufacturingProcess09  4.779e-01  4.806e-02   9.942  < 2e-16 ***
## ManufacturingProcess32  5.073e-01  7.776e-02   6.523 7.38e-10 ***
## ManufacturingProcess36 -1.035e-01  7.778e-02  -1.331    0.185    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6351 on 172 degrees of freedom
## Multiple R-squared:  0.6035, Adjusted R-squared:  0.5966 
## F-statistic: 87.27 on 3 and 172 DF,  p-value: < 2.2e-16