In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts.

Problem 6.2

6.2. Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a

  1. Start R and use these commands to load the data:

library(AppliedPredictiveModeling) data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

data(permeability)

dim(fingerprints)
## [1]  165 1107
dim(permeability)
## [1] 165   1

b

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package.
?caret::nearZeroVar
## starting httpd help server ... done
fingerprints_var0_filtered <- fingerprints[,-nearZeroVar(fingerprints)]

dim(fingerprints_var0_filtered)
## [1] 165 388

How many predictors are left for modeling?

Answer:

There are 388 predictors left.

c

  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
set.seed(6698)

# Gonna use the train/test split functions rec'ed by this text book

# Train-Test split
training_indices <- createDataPartition(permeability, p = 0.8, list = FALSE)
  # I'm using createDatPartition even though I'm less familiar with it
  # the book seems to rec it because it stratifies my sampling to have approx
  # matching outcome distributions in the groups 

train_x <- fingerprints_var0_filtered[training_indices, ]
train_y <- permeability[training_indices]

test_x  <- fingerprints_var0_filtered[-training_indices, ]
test_y  <- permeability[-training_indices]

# 10-fold cross validation
resamp_strat <- trainControl(method = "cv", number = 10)

# PLS model (Partial Least Squares)
pls_model <- train(x = train_x, 
                   y = train_y,
                   method = "pls",
                   preProc = c("center", "scale"),
                   tuneLength = 20, # 1 to 20 PLS components
                   trControl = resamp_strat)

# View
pls_model
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 118, 121, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.83890  0.3502161   9.825724
##    2     11.42041  0.5186836   8.356922
##    3     11.38480  0.5204140   8.783389
##    4     11.67924  0.5030930   8.943898
##    5     11.65690  0.5083793   8.942920
##    6     11.80656  0.4942210   9.031041
##    7     11.77729  0.4937998   9.301009
##    8     11.87157  0.4800589   9.273800
##    9     11.66462  0.5002849   8.958239
##   10     11.98327  0.4878718   9.321489
##   11     12.28236  0.4779875   9.436134
##   12     12.66777  0.4599554   9.737932
##   13     13.19584  0.4409551  10.143841
##   14     13.60243  0.4263858  10.318696
##   15     14.14360  0.3991441  10.703807
##   16     14.44759  0.3898065  10.795502
##   17     15.33350  0.3570033  11.446246
##   18     15.60322  0.3538903  11.721214
##   19     16.16632  0.3421170  11.993775
##   20     16.76010  0.3337149  12.422636
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

Answer:

The optimal number of latent variables is 3. The resampled estimate of R squared is 0.5204140.

d

  1. Predict the response for the test set. What is the test set estimate of R2?
# predictions on the test set
pls_predictions <- predict(pls_model, newdata = test_x)

# Evaluate test set performance
postResample(pred = pls_predictions, obs = test_y) 
##       RMSE   Rsquared        MAE 
## 12.2595322  0.3953602  9.2630208

Answer:

The test set estimate of R squared is 0.3953602.

e

  1. Try building other models discussed in this chapter. Do any have better predictive performance?
# tuning grid for Ridge regression
ridge_grid <- data.frame(.lambda = seq(0, 0.1, length = 15))

# Tune the Ridge model
set.seed(123)
ridge_model <- train(x = train_x, 
                     y = train_y,
                     method = "ridge",
                     preProc = c("center", "scale"),
                     tuneGrid = ridge_grid,
                     trControl = resamp_strat)
## Warning: model fit failed for Fold01: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
# View  
ridge_model
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE          Rsquared   MAE         
##   0.000000000  2.427145e+11  0.2843079  8.726688e+10
##   0.007142857  2.297551e+02  0.2849035  1.775379e+02
##   0.014285714  1.663321e+01  0.3440606  1.235427e+01
##   0.021428571  1.563000e+01  0.3589179  1.162268e+01
##   0.028571429  1.504775e+01  0.3698597  1.117211e+01
##   0.035714286  1.470256e+01  0.3788302  1.093696e+01
##   0.042857143  1.431293e+01  0.3880294  1.067162e+01
##   0.050000000  1.417135e+01  0.3939735  1.055728e+01
##   0.057142857  1.392238e+01  0.4001279  1.041934e+01
##   0.064285714  1.378274e+01  0.4051858  1.031989e+01
##   0.071428571  1.363631e+01  0.4110406  1.023953e+01
##   0.078571429  1.356395e+01  0.4142743  1.018997e+01
##   0.085714286  1.343796e+01  0.4197286  1.012409e+01
##   0.092857143  1.337298e+01  0.4230017  1.009031e+01
##   0.100000000  1.335355e+01  0.4259993  1.007853e+01
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
# Predict on the test set  
ridge_predictions <- predict(ridge_model, newdata = test_x)
postResample(pred = ridge_predictions, obs = test_y)
##       RMSE   Rsquared        MAE 
## 11.0563060  0.5757314  7.9560411

The Ridge model performed quite well, with an R squared of 0.576 and an RMSE of 11.06.

# grid for Lasso 
lasso_grid <- expand.grid(.lambda = 0,
                          .fraction = seq(0.05, 1, length = 20))

# Tune the Lasso model
set.seed(123)
lasso_model <- train(x = train_x, 
                     y = train_y,
                     method = "enet",
                     preProc = c("center", "scale"),
                     tuneGrid = lasso_grid,
                     trControl = resamp_strat)
## Warning: model fit failed for Fold01: lambda=0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
# View  
lasso_model
## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE          Rsquared   MAE         
##   0.05      2.272906e+01  0.2944987  1.567209e+01
##   0.10      2.335909e+01  0.3252556  1.570868e+01
##   0.15      2.378598e+01  0.3668277  1.596329e+01
##   0.20      2.414854e+01  0.4028806  1.625992e+01
##   0.25      2.689048e+01  0.4166470  1.749800e+01
##   0.30      1.769667e+10  0.4189902  6.362758e+09
##   0.35      3.415875e+10  0.4165742  1.228162e+10
##   0.40      5.028502e+10  0.4156407  1.807975e+10
##   0.45      6.635794e+10  0.4065313  2.385869e+10
##   0.50      8.240066e+10  0.3929474  2.962678e+10
##   0.55      9.843204e+10  0.3792638  3.539079e+10
##   0.60      1.144634e+11  0.3664525  4.115480e+10
##   0.65      1.304948e+11  0.3532798  4.691881e+10
##   0.70      1.465262e+11  0.3402426  5.268282e+10
##   0.75      1.625576e+11  0.3309746  5.844683e+10
##   0.80      1.785890e+11  0.3218986  6.421084e+10
##   0.85      1.946203e+11  0.3111412  6.997485e+10
##   0.90      2.106517e+11  0.3013610  7.573886e+10
##   0.95      2.266831e+11  0.2920677  8.150287e+10
##   1.00      2.427145e+11  0.2843079  8.726688e+10
## 
## Tuning parameter 'lambda' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.
# Predict on the test set  
lasso_predictions <- predict(lasso_model, newdata = test_x)
postResample(pred = lasso_predictions, obs = test_y)
##       RMSE   Rsquared        MAE 
## 11.6793782  0.4291562  8.6132341

The Lasso model achieved an R squared of 0.429 and an RMSE of 11.68.*

Comparing Test Set Performance:

  • Ridge: RMSE = 11.06, R2 = 0.576
  • Lasso: RMSE = 11.68, R2 = 0.429
  • Partial Least Squares (PLS): RMSE = 12.26, R2 = 0.395

Both the Ridge and Lasso models had higher R squared values on the test set thatn PLS, explaining more of variance.

f

  1. Would you recommend any of your models to replace the permeability laboratory experiment?

Answer:

I would recommend the Ridge model. It had the highest R squared and lowers MSE of the three models on the test data.

Problem 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors) measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch

a

  1. Start R and use these commands to load the data:

library(AppliedPredictiveModeling) data(ChemicalManufacturingProcess)

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

b

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).
#library(RANN) # Required for knnImpute in caret

# the predictors (columns 2-58) and the response (column 1)
processPredictors <- ChemicalManufacturingProcess[, -1]
yield <- ChemicalManufacturingProcess$Yield

# Create the imputation model 
impute_model <- preProcess(processPredictors, method = "knnImpute")

# Apply the imputation to the predictor matrix
imputedPredictors <- predict(impute_model, processPredictors)

c

  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I’m choosing PLS again.

set.seed(6698)

# Train-Test split
training_indices_3 <- createDataPartition(yield, p = 0.8, list = FALSE) 

train_3x <- imputedPredictors[training_indices_3, ]
train_3y <- yield[training_indices_3]

test_3x  <- imputedPredictors[-training_indices_3, ]
test_3y  <- yield[-training_indices_3]

# 10-fold cross validation
resamp_strat_3 <- trainControl(method = "cv", number = 10) # same sef as I used in 6.2 honestly

# PLS model (Partial Least Squares)
pls_model_3 <- train(x = train_3x, 
                   y = train_3y,
                   method = "pls",
                   preProc = c("center", "scale"),
                   tuneLength = 20, # 1 to 20 PLS components
                   trControl = resamp_strat_3)

# View
pls_model_3
## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 130, 130, 129, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.345551  0.4861716  1.089909
##    2     1.471664  0.5247170  1.069696
##    3     1.410657  0.5822645  1.042906
##    4     1.684913  0.5499083  1.134243
##    5     1.995138  0.5311975  1.229277
##    6     2.426637  0.5142106  1.354588
##    7     2.757757  0.4978192  1.463980
##    8     2.978351  0.4921191  1.530142
##    9     3.082827  0.4875135  1.572579
##   10     3.104231  0.4822510  1.586137
##   11     3.094592  0.4761268  1.582941
##   12     3.083765  0.4694793  1.588063
##   13     2.885535  0.4713323  1.532257
##   14     2.868825  0.4733060  1.532938
##   15     2.790373  0.4742011  1.508876
##   16     2.719785  0.4824434  1.487661
##   17     2.636087  0.4767597  1.466501
##   18     2.611807  0.4776051  1.456602
##   19     2.646141  0.4738674  1.470119
##   20     2.636634  0.4746611  1.467546
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.

Answer:

The optimal ncomp was 1, becuase this is where the RMSE is lowest. (Though it’s worth noting that the R squared peaked at ncomp = 3)

d

  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
# Predict on the test set
pls_predictions3 <- predict(pls_model_3, newdata = test_3x)

# performance metrics 
testMetrics <- postResample(pred = pls_predictions3, obs = test_3y)
print(testMetrics)
##      RMSE  Rsquared       MAE 
## 2.1865026 0.2026449 1.3854237

This PLS model has an R squared is 0.2026449 on the test set. The optimal R squared from the training set was actually 0.4861716

e

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
# variable importance for the PLS model
pls_importance <- varImp(pls_model_3)

# top 20 most important predictors
plot(pls_importance, top = 20, main = "Top 20 Predictors for Yield (PLS)")

The Manufacturing Process predictors are generally most important variables; they take up the top 5 spots and 12/20 spots. So yes, I think one could argue that they dominate the list. There are 8 Biological Material predictors that appear in the top 20 as well though, which is worth pointing out. ManufacturingProcess32 ranks as the top predictor here but a pretty decent margin..

f

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process
# top 
imp_df <- as.data.frame(varImp(pls_model_3)$importance)
imp_df$Feature <- rownames(imp_df)
top_15_features <- head(imp_df[order(-imp_df$Overall), "Feature"], 15)

# yield and predictors in data frame
top_data <- cbind(Yield = train_3y, train_3x[, top_15_features])

# correlation matrix
cor_matrix <- cor(top_data)

# corr matrix
corrplot(cor_matrix, 
         method = "circle",      # Uses circles whose size/color represent the correlation
         type = "lower",         # Only shows the lower triangle to avoid redundancy
         tl.col = "black",       # Black text labels
         tl.cex = 0.8,           # Adjust text size to fit
         col = colorRampPalette(c("darkred", "white", "steelblue"))(100),
         title = "Correlation: Yield and Top 15 Predictors",
         mar = c(0,0,2,0))       # Margin adjustment for the title

Positive Process Drivers: ManufacturingProcess32 and ManufacturingProcess09 have the strongest positive correlations with yield. Increasing these parameters tends to lead to higher yields.

Negative Process Drivers: ManufacturingProcess36 and ManufacturingProcess13 show strong negative correlations, meaning higher levels of these parameters actively decrease the yield.

Raw Material Screening: Biological materials like BiologicalMaterial02 and BiologicalMaterial06 also show moderate positive correlations. Of course I can’t change materials mid process, but this is good to know for selecting materials up front.

Also I like how this matrix reveals some multicollinearity among predictors. For instance, ManufacturingProcess32 and ManufacturingProcess36 are highly negatively correlated with each other .Maybe these two processes are somewhat inverse from one another.

These insights could be very helpful for trying to optimize around higher yields.