DATA 624 - Homework 7

Kuhn and Johnson, Chapter 6

Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

fingerprints_filtered <- fingerprints[,-nearZeroVar(fingerprints)]

dim(fingerprints_filtered)

## [1] 165 388

After filtering out predictors with low frequencies, we have 388 predictors left.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

set.seed(613)  # for reproducibility

# split into training and testing
train_indices <- sample(nrow(fingerprints_filtered), nrow(fingerprints_filtered)*.8, replace=F)

X_train <- fingerprints_filtered[train_indices,]
X_test <- fingerprints_filtered[-train_indices,]

y_train <- permeability[train_indices,]
y_test <- permeability[-train_indices,]

Let’s check to see if there are any missing values.

any(is.na(X_train)) | any(is.na(X_test))

## [1] FALSE

There are no missing values.

We will use cross-fold validation to determine the optimal number of components (resampling technique) and we will scale and center the data within the PLS function.

ctrl <- trainControl(method="CV",
                     number=10)

pls_mod <- train(X_train,
                 y_train,
                 method = "pls",
                 preProcess = c("center", "scale"),
                 trControl = ctrl,
                 tuneLength=20)

plot(pls_mod)

pls_mod

## Partial Least Squares 
## 
## 132 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 119, 119, 120, 118, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.18785  0.3966661  9.512178
##    2     11.24003  0.4799066  7.843522
##    3     11.66296  0.4226539  8.457142
##    4     11.95787  0.3994729  8.909149
##    5     12.22852  0.3801330  8.840589
##    6     12.16547  0.3880907  8.789197
##    7     12.02162  0.3996247  8.850929
##    8     12.11775  0.3964606  8.955032
##    9     11.93628  0.4141765  8.583337
##   10     12.12939  0.4039441  8.667530
##   11     12.15361  0.4044442  8.580975
##   12     12.27896  0.4040945  8.716719
##   13     12.71739  0.3745855  8.999261
##   14     12.76637  0.3819589  9.188841
##   15     12.61217  0.4037047  9.199757
##   16     12.76865  0.4029550  9.234611
##   17     12.94572  0.3900271  9.278010
##   18     12.99874  0.3859356  9.307824
##   19     13.04231  0.3820955  9.404233
##   20     13.05177  0.3853788  9.469839
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.

According to the model, the optimal number of latent components is 2 and the corresponding \(R^2\) value is about 0.48.

(d) Predict the response for the test set. What is the test set estimate of \(R^2\)?

pls_pred <- predict(pls_mod, X_test)

# r2 value
R2(pls_pred, y_test)

## [1] 0.2926345

# RMSE
RMSE(pls_pred, y_test)

## [1] 14.88432

The test estimate of \(R^2\) is 0.29.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

PCA

pca_mod <- train(X_train,
                 y_train,
                 method = "pcr",
                 preProcess = c("center", "scale"),
                 trControl = ctrl,
                 tuneLength=20)

plot(pca_mod)

pca_mod

## Principal Component Analysis 
## 
## 132 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 120, 120, 118, 118, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     14.18292  0.1622220  10.735483
##    2     14.04598  0.1777391  10.751289
##    3     12.17207  0.3720281   9.212869
##    4     12.10454  0.3980585   9.265144
##    5     11.57066  0.4604906   8.192442
##    6     11.65999  0.4587723   8.207429
##    7     11.71684  0.4548234   8.213875
##    8     11.45616  0.4742200   7.945985
##    9     11.49975  0.4724512   7.989879
##   10     11.56040  0.4663111   8.036966
##   11     11.73858  0.4508321   8.091737
##   12     11.86388  0.4452228   8.231327
##   13     11.83437  0.4474663   8.298998
##   14     11.44712  0.4861469   7.919622
##   15     11.58049  0.4716563   8.039214
##   16     11.62745  0.4671816   8.085807
##   17     11.77694  0.4597873   8.218923
##   18     11.89672  0.4552462   8.314439
##   19     11.88092  0.4544649   8.230867
##   20     11.79540  0.4665171   8.203176
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 14.

For the PCA model, the optimal number of components is 14 with a corresponding \(R^2\) value of 0.49. This is slightly improved from the PLS model but the RMSE is slightly higher for this model than the PLS model.

pca_pred <- predict(pca_mod, X_test)

# r2 value
R2(pca_pred, y_test)

## [1] 0.3104183

# RMSE 
RMSE(pca_pred, y_test)

## [1] 14.54669

The \(R^2\) value for the test set is 0.31 which is also slightly improved from our PLS model.

Ridge

ridge_mod <- train(X_train,
                   y_train,
                   method = "ridge",
                   preProcess = c("center", "scale"),
                   trControl = ctrl,
                   tuneGrid = expand.grid(.lambda = seq(0, 1, 0.05)))

## Warning: model fit failed for Fold03: lambda=0.00 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

plot(ridge_mod)

ridge_mod

## Ridge Regression 
## 
## 132 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 118, 120, 120, 117, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.00    19.78384  0.3445566  13.177795
##   0.05    13.11925  0.4188681   9.267912
##   0.10    12.76826  0.4493847   9.116099
##   0.15    12.71815  0.4612040   9.115231
##   0.20    12.75480  0.4684972   9.164462
##   0.25    12.85351  0.4734550   9.234285
##   0.30    12.99410  0.4770698   9.329549
##   0.35    13.16311  0.4798142   9.449907
##   0.40    13.35526  0.4821121   9.577437
##   0.45    13.56259  0.4841312   9.720374
##   0.50    13.80914  0.4854894   9.912405
##   0.55    14.03552  0.4869424  10.079830
##   0.60    14.29248  0.4882576  10.275131
##   0.65    14.57132  0.4889218  10.480479
##   0.70    14.83915  0.4899979  10.695014
##   0.75    15.12936  0.4907728  10.945217
##   0.80    15.43554  0.4913414  11.199942
##   0.85    15.74160  0.4920793  11.454952
##   0.90    16.04671  0.4927332  11.701515
##   0.95    16.36740  0.4932643  11.955283
##   1.00    16.69451  0.4937462  12.212142
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.15.

The optimal \(\lambda\) value for the ridge model is 0.15 with a \(R^2\) of 0.46. This is lower than the \(R^2\) for the PLS model.

ridge_pred <- predict(ridge_mod, X_test)

# r2 value
R2(ridge_pred, y_test)

## [1] 0.4751709

# RMSE
RMSE(ridge_pred, y_test)

## [1] 13.28983

The \(R^2\) for the test set is about 0.47. This is improved from the PLS model.

The ridge model has a slightly lower \(R^2\) than the PLS model for the training set but a higher \(R^2\) for the test set. Since the \(R^2\) for the training model is not so much less than the \(R^2\) for the PLS model and the \(R^2\) for the unseen (testing) data is much improved, the ridge model performs better.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

I would recommend the ridge model because it has the highest \(R^2\) and lowest RMSE for the test set. The \(R^2\) is comparable to the other models, even though it is slightly lower, but this model seems to perform the best for the unseen data.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

plot_missing(ChemicalManufacturingProcess, 
             missing_only = T,
             ggtheme = theme_classic())

We can use KNN imputation to fill in the missing values. We will use the3 preProcess function mentioned in section 3.8 to do this.

imputations <- preProcess(ChemicalManufacturingProcess, 
               method = c("knnImpute"), 
               k=5)
chem_man_imputed <- predict(imputations, ChemicalManufacturingProcess)

# check
any(is.na(chem_man_imputed))

## [1] FALSE

There are no missing values left.

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

First, we remove near zero variance predictors from the dataset.

chem_man_filtered <- chem_man_imputed[,-nearZeroVar(chem_man_imputed)]

Next, we’ll split the data into a training and testing set.

set.seed(613)  # for reproducibility

# split into training and testing
train_indices <- sample(nrow(chem_man_filtered), nrow(chem_man_filtered)*.8, replace=F)

train <- chem_man_filtered[train_indices,]
test <- chem_man_filtered[-train_indices,]

We will then train a lasso model to the training data.

lasso_mod <- train(Yield ~.,
                   data=train,
                   method = "glmnet",
                   preProcess = c("center", "scale"),
                   trControl = ctrl,
                   tuneGrid = expand.grid(.alpha = 1, .lambda = seq(0, 1, 0.05)))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

plot(lasso_mod)

lasso_mod

## glmnet 
## 
## 140 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 126, 125, 126, 126, 125, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   0.00    2.2830463  0.4771485  1.0373041
##   0.05    0.6567030  0.6180707  0.5217931
##   0.10    0.6633051  0.6075810  0.5495223
##   0.15    0.6642085  0.6268495  0.5594627
##   0.20    0.6840971  0.6388624  0.5722698
##   0.25    0.7145695  0.6392532  0.5940907
##   0.30    0.7513699  0.6336907  0.6248142
##   0.35    0.7909580  0.6230559  0.6571128
##   0.40    0.8341110  0.6001490  0.6898937
##   0.45    0.8784862  0.5526571  0.7232343
##   0.50    0.9206409  0.4794189  0.7557006
##   0.55    0.9538961  0.4424531  0.7841000
##   0.60    0.9838691  0.4408428  0.8099099
##   0.65    1.0106106  0.3377224  0.8326521
##   0.70    1.0126357        NaN  0.8339713
##   0.75    1.0126357        NaN  0.8339713
##   0.80    1.0126357        NaN  0.8339713
##   0.85    1.0126357        NaN  0.8339713
##   0.90    1.0126357        NaN  0.8339713
##   0.95    1.0126357        NaN  0.8339713
##   1.00    1.0126357        NaN  0.8339713
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.05.

The optimal \(\lambda\) is 0.05 with an \(R^2\) value of 0.62 and and RMSE of 0.66.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

lasso_pred <- predict(lasso_mod, test)

R2(lasso_pred, test$Yield)

## [1] 0.5426866

RMSE(lasso_pred, test$Yield)

## [1] 0.6344769

The \(R^2\) for the test set is 0.54 which is lower than the \(R^2\) of the training set. The RMSE of the test set is 0.63 which is lower than the RMSE of the training set.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

varImp(lasso_mod)

## glmnet variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09  51.694
## ManufacturingProcess13  46.501
## ManufacturingProcess15  16.024
## ManufacturingProcess36  15.614
## ManufacturingProcess17  14.370
## BiologicalMaterial02    14.036
## ManufacturingProcess24  13.184
## ManufacturingProcess39  12.296
## ManufacturingProcess34   9.433
## ManufacturingProcess06   9.396
## ManufacturingProcess45   9.183
## BiologicalMaterial05     8.221
## ManufacturingProcess37   5.007
## ManufacturingProcess01   5.005
## ManufacturingProcess07   4.914
## ManufacturingProcess42   0.000
## BiologicalMaterial08     0.000
## ManufacturingProcess23   0.000
## ManufacturingProcess12   0.000

Of the 20 most important variables, process predictors dominate. The top most important variable is ManufacturingProcess32.

We can also plot these scores.

plot(varImp(lasso_mod))

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Let’s take a look at the relationship of the response variable with the top 10 predictors.

top_10 <- varImp(lasso_mod)$importance |>
  arrange(desc(Overall)) |>
  head(10) |>
  rownames_to_column()

names(top_10) <- c("predictor", "importance")

chem_man_imputed[,c("Yield", top_10$predictor)] |>
  cor() |>
  corrplot(method="color",
           diag=FALSE,
           type="lower",
           addCoef.col = "black",
           number.cex=0.5)

If you track down the first column of the correlation plot, you can see the individual relationships between each predictor variable and the response variable. Based on these correlations, you can see which predictor has a positive or negative impact on the response variable.

Based on the impact of each of the most important predictors, one could alter these steps to provide for a greater overall yield.

DATA 624 - Homework 7

Shoshana Farber

Mar 31, 2024

Kuhn and Johnson, Chapter 6

Exercise 6.2

PCA

Ridge

Exercise 6.3