Question 6.2

Developing a model to predict permeability (See Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Part A

Start R and use these commands to load the data:

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

Part B

the fingerprints predictors indicate the presense or absense of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many are left for modeling?

 719

There are 719 variables left.

Part E

Try building other models discussed in this chapter. Do any have better predictive performance?

Summary

Model RMSE Rsquared
PLS 14.29229 0.2634860
Lasso Regression 13.98552 0.2417885
Elastic Net Regression 14.08593 0.2055749
Ridge Regression 14.59042 0.1426659
PCR 15.90717 0.0344595

None of the models did a better job (had a higher \(R^2\)) than the PLS model.

Part F

Would you recommend any of your models to replace the permeability laboratory experiment?

No, because the best model’s (PLS) \(R^2\) was really low. The model did not have much explanatory power.

Question 6.3

A chemical manufacturing process for a pharmaceutical produce was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurement of the raw materials (predictors), measurements of the manufacutring process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw materials before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boot revenue by approximately one hundred thousand dollars per batch:

Part A

Start R and use these commands to load the data:

the matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs, yield contains the percent yueld for each run.

Part B

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

I will use KNN to impute values. The caret package makes it simple:

Part C

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I will pre-process the data removing the zero variance variables prior to splitting the data.

Now the data will be split into training and test set using an 80:20 split.

Given the preformance of the PLS in the previous exercise, I will use it again. I’m going to further pre-process by centering and scaling the predictors.

For the performance metric I will again look at the RMSE and the \(R^2\). The optimal values for RMSE would be zero, and the optimal \(R^2\) would be 1. Here’s the performance metrics for the optimal model on the training set.

ncomp RMSE Rsquared
3 0.7363066 0.5244517

Part D

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

RMSE Rsquared
0.6192577 0.6771122

We have an RMSE of about 0.62 and a \(R^2\) of roughly 0.68 on the test set.

Part E

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

I will select the variables with a varImp sore greater than or equal to 60 to be the “important” ones.

Variable n
ManufacturingProcess 9
BiologicalMaterial 3

There are only 12 variables and the process predictors dominate the list.

Part F

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future rounds of the manufacturing process?

All variables are positively correlated with the Yield. Manufacuring process 32 is the most important variable. It is positively correlated the biological variables. It is also negatively correlated with manufacturing process 13.

Further exploration of these relationships could improve the yield. This would be valuable information for these chemical manufacturers.