In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

6.2

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug: (a) Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
data(permeability)

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## Loading required package: lattice
library(ggplot2)

# Columns with near-zero variance
nzv_idx <- nearZeroVar(fingerprints)
length(nzv_idx)
## [1] 719
nzv_idx[1:10]
##  [1]  7  8  9 10 13 14 17 18 19 22
fingerprints_filtered <- fingerprints[, -nzv_idx]
dim(fingerprints_filtered)
## [1] 165 388

After filtering out, there are 388 predictors left for modeling.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R^2?

set.seed(123)

y <- permeability[,1]

inTrain <- createDataPartition(y, p = 0.7, list = FALSE)
x_train <- fingerprints_filtered[inTrain, ]
y_train <- y[inTrain]
x_test  <- fingerprints_filtered[-inTrain, ]
y_test  <- y[-inTrain]
train_df <- data.frame(permeability = y_train, x_train)
ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

set.seed(624)

pls_fit <- train(
  permeability ~ .,
  data = train_df,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = ctrl,
  metric = "Rsquared"
)

pls_fit
## Partial Least Squares 
## 
## 117 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 105, 105, 105, 105, 106, 106, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.52614  0.3276551  10.590267
##    2     12.31790  0.4577164   8.719467
##    3     12.77836  0.4287870   9.656110
##    4     13.11105  0.4071152   9.959855
##    5     13.03721  0.4186377   9.770525
##    6     12.71350  0.4436353   9.295525
##    7     12.54494  0.4508120   9.281392
##    8     12.58752  0.4529055   9.413440
##    9     12.92993  0.4445795   9.758149
##   10     13.31897  0.4262053  10.055161
##   11     13.59403  0.4155700  10.240230
##   12     13.95736  0.4031960  10.521149
##   13     14.33818  0.3867688  10.695687
##   14     14.71896  0.3742757  10.879621
##   15     15.00679  0.3644702  11.097477
##   16     15.36197  0.3547855  11.322127
##   17     15.62895  0.3485080  11.484254
##   18     15.75516  0.3457461  11.674387
##   19     15.90706  0.3421411  11.782810
##   20     16.37534  0.3313220  12.110636
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 2.
pls_fit$bestTune
subset(pls_fit$results, ncomp == pls_fit$bestTune$ncomp)

Using pls_fit$bestTune to determine the best number of components, which is the number of optimal number of latent variables, is 2. The corresponding resampled Rsquared is 0.4577164.

(d) Predict the response for the test set. What is the test set estimate of R^2?

test_df <- data.frame(permeability = y_test, x_test)
test_pred <- predict(pls_fit, newdata = test_df)

# test
test_metrics <- postResample(pred = test_pred, obs = y_test)
test_metrics["Rsquared"]
##  Rsquared 
## 0.3819407

The test set estimate of R^2 is 0.3819407

(e) Try building other models discussed in this chapter. Do any have better predictive performance? First attempt other another model that was discussed earlier in the chapter and what I thought of immediately was linear regression, which I didn’t bother building for. This is due to Johnson stating that too many predictors in comparison to samples as well as the predictors being correlated which therefore makes the coefficients unstable. Using PCR is sort of limited because builds components that only capture variance in the predictors, not variance that actually matters for permeability, so it needs many components. PLS would be the most reasonable because it builds new latent components that are directly chosen to explain variation in permeability, while also handling the fact that we have many highly correlated binary predictors and way more predictors than samples.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

No I wouldn’t recommend because PLS for example still leaves the variability in permeability unexplained, so its predictions aren’t accurate enough to fully trust on new drugs. Instead I could use it to rank molecules and decide which ones to test first in the lab, but not to skip the lab work.

6.3

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(chemicalManufacturing) The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

library(RANN)
## Warning: package 'RANN' was built under R version 4.4.3
data("ChemicalManufacturingProcess")
yield <- ChemicalManufacturingProcess$Yield

# everything except Yield are predictors
predictors <- ChemicalManufacturingProcess[, setdiff(names(ChemicalManufacturingProcess), "Yield"), drop = FALSE]

predictors <- as.data.frame(predictors)

colSums(is.na(predictors))[1:10]
## BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03 
##                    0                    0                    0 
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06 
##                    0                    0                    0 
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09 
##                    0                    0                    0 
## BiologicalMaterial10 
##                    0
pp_obj <- preProcess(
  predictors,
  method = c("knnImpute", "center", "scale")
)

predictors_imputed <- predict(pp_obj, predictors)
sum(is.na(predictors_imputed))
## [1] 0
chem_complete <- data.frame(
  Yield = yield,
  predictors_imputed
)
str(chem_complete)
## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  -0.226 2.239 2.239 2.239 1.483 ...
##  $ BiologicalMaterial02  : num  -1.51 1.31 1.31 1.31 1.89 ...
##  $ BiologicalMaterial03  : num  -2.683 -0.0562 -0.0562 -0.0562 1.1359 ...
##  $ BiologicalMaterial04  : num  0.22 1.296 1.296 1.296 0.941 ...
##  $ BiologicalMaterial05  : num  0.494 0.413 0.413 0.413 -0.373 ...
##  $ BiologicalMaterial06  : num  -1.38 1.13 1.13 1.13 1.53 ...
##  $ BiologicalMaterial07  : num  -0.131 -0.131 -0.131 -0.131 -0.131 ...
##  $ BiologicalMaterial08  : num  -1.23 2.28 2.28 2.28 1.07 ...
##  $ BiologicalMaterial09  : num  -3.396 -0.723 -0.723 -0.723 -0.121 ...
##  $ BiologicalMaterial10  : num  1.101 1.101 1.101 1.101 0.416 ...
##  $ BiologicalMaterial11  : num  -1.839 1.393 1.393 1.393 0.136 ...
##  $ BiologicalMaterial12  : num  -1.77 1.1 1.1 1.1 1.1 ...
##  $ ManufacturingProcess01: num  0.215 -6.15 -6.15 -6.15 -0.278 ...
##  $ ManufacturingProcess02: num  0.566 -1.969 -1.969 -1.969 -1.969 ...
##  $ ManufacturingProcess03: num  0.377 0.198 0.109 0.466 0.109 ...
##  $ ManufacturingProcess04: num  0.566 -2.367 -3.164 -3.323 -2.208 ...
##  $ ManufacturingProcess05: num  -0.4459 0.9993 0.0625 0.4228 0.8454 ...
##  $ ManufacturingProcess06: num  -0.541 0.963 -0.112 2.185 -0.63 ...
##  $ ManufacturingProcess07: num  -0.16 -0.958 1.038 -0.958 1.038 ...
##  $ ManufacturingProcess08: num  -0.31 0.894 0.894 -1.112 0.894 ...
##  $ ManufacturingProcess09: num  -1.72 0.588 -0.382 -0.479 -0.453 ...
##  $ ManufacturingProcess10: num  -0.077 0.523 0.3143 -0.0248 -0.39 ...
##  $ ManufacturingProcess11: num  -0.0916 1.082 0.5511 0.8026 0.104 ...
##  $ ManufacturingProcess12: num  -0.481 -0.481 -0.481 -0.481 -0.481 ...
##  $ ManufacturingProcess13: num  0.9771 -0.5003 0.2877 0.2877 0.0907 ...
##  $ ManufacturingProcess14: num  0.809 0.278 0.443 0.791 2.533 ...
##  $ ManufacturingProcess15: num  1.185 0.962 0.825 1.082 3.328 ...
##  $ ManufacturingProcess16: num  0.33 0.146 0.146 0.197 0.475 ...
##  $ ManufacturingProcess17: num  0.926 -0.275 0.366 0.366 -0.356 ...
##  $ ManufacturingProcess18: num  0.151 0.156 0.183 0.17 0.208 ...
##  $ ManufacturingProcess19: num  0.456 1.51 1.093 0.983 1.619 ...
##  $ ManufacturingProcess20: num  0.311 0.185 0.185 0.156 0.294 ...
##  $ ManufacturingProcess21: num  0.211 0.211 0.211 0.211 -0.688 ...
##  $ ManufacturingProcess22: num  0.0583 -0.7223 -0.4221 -0.1218 0.7789 ...
##  $ ManufacturingProcess23: num  0.832 -1.815 -1.213 -0.612 0.591 ...
##  $ ManufacturingProcess24: num  0.891 -1.006 -0.834 -0.661 1.58 ...
##  $ ManufacturingProcess25: num  0.12 0.109 0.184 0.171 0.273 ...
##  $ ManufacturingProcess26: num  0.126 0.197 0.216 0.205 0.291 ...
##  $ ManufacturingProcess27: num  0.346 0.191 0.21 0.191 0.343 ...
##  $ ManufacturingProcess28: num  0.783 0.878 0.859 0.859 0.897 ...
##  $ ManufacturingProcess29: num  0.594 0.835 0.775 0.775 0.955 ...
##  $ ManufacturingProcess30: num  0.757 0.757 0.244 0.244 -0.165 ...
##  $ ManufacturingProcess31: num  -0.195 -0.267 -0.159 -0.159 -0.141 ...
##  $ ManufacturingProcess32: num  -0.457 1.952 2.693 2.322 2.322 ...
##  $ ManufacturingProcess33: num  0.989 0.989 0.989 1.794 2.6 ...
##  $ ManufacturingProcess34: num  -1.72 1.957 1.957 0.118 0.118 ...
##  $ ManufacturingProcess35: num  -0.8869 1.1464 1.2388 0.0373 -2.5506 ...
##  $ ManufacturingProcess36: num  -0.656 -0.656 -1.8 -1.8 -2.944 ...
##  $ ManufacturingProcess37: num  -1.154 2.216 -0.705 0.419 -1.828 ...
##  $ ManufacturingProcess38: num  0.717 -0.822 -0.822 -0.822 -0.822 ...
##  $ ManufacturingProcess39: num  0.232 0.232 0.232 0.232 0.298 ...
##  $ ManufacturingProcess40: num  0.0597 2.1491 -0.4627 -0.4627 -0.4627 ...
##  $ ManufacturingProcess41: num  -0.069 2.346 -0.441 -0.441 -0.441 ...
##  $ ManufacturingProcess42: num  0.2028 -0.0547 0.4088 -0.3122 -0.1062 ...
##  $ ManufacturingProcess43: num  2.4056 -0.0137 0.1015 0.2167 0.2167 ...
##  $ ManufacturingProcess44: num  -0.0159 0.2947 -0.0159 -0.0159 -0.3264 ...
##  $ ManufacturingProcess45: num  0.6437 0.1522 0.398 -0.0936 -0.0936 ...

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

y <- ChemicalManufacturingProcess$Yield
X <- ChemicalManufacturingProcess[, colnames(ChemicalManufacturingProcess) != "Yield"]

set.seed(123)
train_idx <- createDataPartition(y, p = 0.7, list = FALSE)

X_train <- X[train_idx, ]
y_train <- y[train_idx]

X_test  <- X[-train_idx, ]
y_test  <- y[-train_idx]

pp_obj <- preProcess(
  X_train,
  method = c("knnImpute", "center", "scale")
)

X_train_pp <- predict(pp_obj, X_train)
X_test_pp  <- predict(pp_obj, X_test)

# My model of choice is PLS

ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

pls_fit <- train(
  x = X_train_pp,
  y = y_train,
  method = "pls",
  tuneLength = 20,              # try 1..20 components
  trControl = ctrl
)

pls_fit
## Partial Least Squares 
## 
## 124 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 112, 112, 111, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.421228  0.4648680  1.1584972
##    2     1.205138  0.6054354  1.0072349
##    3     1.141289  0.6532558  0.9429623
##    4     1.151347  0.6548316  0.9489145
##    5     1.170626  0.6489001  0.9633433
##    6     1.185512  0.6427541  0.9654653
##    7     1.192273  0.6380087  0.9708489
##    8     1.205133  0.6329415  0.9849564
##    9     1.267243  0.6137698  1.0375404
##   10     1.325513  0.5992689  1.0768414
##   11     1.388209  0.5800552  1.1247833
##   12     1.452071  0.5655375  1.1614984
##   13     1.489109  0.5570718  1.1759213
##   14     1.501329  0.5605424  1.1761263
##   15     1.520286  0.5598108  1.1834329
##   16     1.540573  0.5582740  1.1879209
##   17     1.557011  0.5595346  1.1897146
##   18     1.580617  0.5580881  1.1961664
##   19     1.608045  0.5582227  1.2007752
##   20     1.664392  0.5513168  1.2157566
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

So as you can see, the model picked 3 components as the final value is ncomp = 3 because that gave the lowest RMSE, about 1.14. So therefore the optimal performance metric is RMSE = 1.141289 with 3 components.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

set.seed(123)
idx <- createDataPartition(chem_complete$Yield, p = 0.7, list = FALSE)
train_data <- chem_complete[idx, ]
test_data  <- chem_complete[-idx, ]

pls_fit <- train(
  Yield ~ ., 
  data = train_data,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = trainControl(
    method = "repeatedcv",
    number = 10,
    repeats = 5
  ),
  metric = "RMSE"
)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
pls_fit
## Partial Least Squares 
## 
## 124 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 112, 112, 111, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.631177  0.4324276  1.245749
##    2     1.886294  0.5129642  1.245510
##    3     1.451409  0.5944219  1.056524
##    4     1.732062  0.5733243  1.159056
##    5     2.038058  0.5505969  1.252117
##    6     2.173988  0.5503784  1.284806
##    7     2.328955  0.5371246  1.341086
##    8     2.497208  0.5239341  1.400328
##    9     2.865093  0.5021550  1.543590
##   10     3.196419  0.4885783  1.659752
##   11     3.711028  0.4651931  1.843961
##   12     4.207027  0.4506260  2.002258
##   13     4.503629  0.4464268  2.090289
##   14     4.624495  0.4493170  2.119935
##   15     4.622055  0.4514259  2.118095
##   16     4.622221  0.4502016  2.120155
##   17     4.526185  0.4523319  2.088807
##   18     4.468692  0.4548248  2.066794
##   19     4.417392  0.4579625  2.044434
##   20     4.406922  0.4639619  2.030712
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
test_x <- test_data[, !(names(test_data) %in% "Yield")]
test_y <- test_data$Yield
test_pred <- predict(pls_fit, newdata = test_x)
test_results <- postResample(pred = test_pred, obs = test_y)
test_results
##      RMSE  Rsquared       MAE 
## 1.2737308 0.5438189 1.0363968

As you can see, the model with 3 components got a test RMSE of 1.2737308 and an Rsquared of about 0.5438189. Those numbers are very close to the cross-validated training results, so the model performs pretty much the same on new data as it did during training.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

So when checking for variable importance from the PLS model, most of the top predictors are the ManufacturingProcess variables, and NOT the BiologicalMaterial ones. This means the steps in the process matter more for predicting yield than the starting material.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Exploring the relationships on how yield changes with each top process variable, I can see which settings are linked to higher yield and which settings lead to low yield. I can use that to adjust those steps in future runs so that in a real world scenario I can keep the process in the “good” range and get more product.