Question 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
library(caret)
library(pls)
library(elasticnet)
library(RANN)
library(VIM)
library(e1071)
library(corrplot)
library(tidyverse)
library(knitr)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b) The `fingerprint` predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the `nearZeroVar` function from the caret package. How many predictors are left for modeling?

After using the nearZeroVar function, we were able to remove 719 predictors from the 1107 leaving us with a total of 388 predictors.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding re-sampled estimate of R2?

Looking at the graph above we see that the optimal amount of latent variables is 8 with an \(R^2\) of 0.5378.

(d) Predict the response for the test set. What is the test set estimate of R2?

The test set preformed worse than the training data with an \(R^2\) of 0.5325 and instead of the trained optimal \(R^2\) of 0.5378.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Linear Model

Before analyzing the \(R^2\) for the linear model we should check that the residuals make sense for the linear model.

The Observed vs. Predicted plot shows a linear relationship with no outliers in the data, meaning our model is calibrated well. Additionally, the Residual vs Predicted plot shows a random cloud of points with no real pattern. The linear model preformed as it should, so we can compare the \(R^2\) with other models.

Ridge Regression

Removing the first 4 models from the list since their RMSE was too high to be considered anyway, we can see a little clearer why the best model is the Ridge model using the \(\lambda\) of 0.0929.

Comparing Models

When comparing the models we will focus on the \(R^2\) and the re sampled \(R^2\).

“Comparing Training model” * R^{ 2 }
model	RSquared
PLS	0.5378
Ridge	0.4533
Linear	0.2768

When comparing the training models \(R^2\) we see that the PLS model preformed the best with an \(R^2\) of 0.5378, while the linear model preformed the worst with an \(R^2\) of 0.2768.

“Comparing Re-sampled” * R^{ 2 }
model	RSquared
PLS	0.5325
Ridge	0.5185
Linear	0.1965

We can see that by comparing the re-sampled \(R^2\) the PLS model has the closest the the training set \(R^2\) with a difference of 0.0053, meaning the training model was not optimistic, but fairly accurate. On the other hand, the linear regression model had a re-sampled \(R^2\) that was about 0.0803 smaller than the training \(R^2\), meaning the training model was overly optimistic at predicting permeability.

In both cases, the ridge model fit somewhere in between the PLS and Linear models. Therefore, the best model to predict permeability is the PLS model and I would not replace it with any of the other models tested.

Question 6.3.

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

We can see that there are some missing data that seem to be missing at random and therefore we can use the preProcess function to impute those missing data using the KNN method discussed in section 3.8.

As we can see the data has been fully pre processed and all missing values imputed.

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Given that there are 58 columns, we will want to remove some predictors. One way is to do this is start by eliminating any predictors that have zero variance. This process eliminates the column as a predictor.

We can then reduce the predictors further by removing predictors that are highly correlated with each other. This bring our total predictor count down from 58, to 36, a reduction of 22 predictors.

We will process the data using the preProcess function that will center and scale the data.

Using a PLS model again we see that 9 latent variables are optimal producing an \(R^2\) of 0.5524.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The test data preformed much better than the training data with a resampled \(R^2\) of 0.7169675, which was 0.1645 higher than the training set.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The list of top 10 predictors for the PLS model
Predictors	Overall
BiologicalMaterial11	100.00
BiologicalMaterial09	84.56
ManufacturingProcess36	60.95
ManufacturingProcess33	57.75
ManufacturingProcess28	45.75
ManufacturingProcess02	45.26
BiologicalMaterial10	44.35
ManufacturingProcess04	37.93
ManufacturingProcess12	30.95
ManufacturingProcess19	30.25

The top 10 predictor list is dominated by Process predictors, making up 70% of all top predictors.

Top 10 important predictors breakdown
type	predictors	percent
Biological	3	0.3
Process	7	0.7

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Top 10 components relationship to yield
	predictors	coeff
2	BiologicalMaterial09	0.4397
4	BiologicalMaterial11	0.3132
27	ManufacturingProcess33	0.2046
25	ManufacturingProcess28	0.0934
8	ManufacturingProcess04	0.0888
15	ManufacturingProcess12	0.0271
3	BiologicalMaterial10	-0.1333
30	ManufacturingProcess36	-0.2916
18	ManufacturingProcess19	-0.3341
6	ManufacturingProcess02	-0.5824

Based on the top predictor coefficients, I would suggest that when the end goal of the manufacturing process is to have increased yield, then the process should focus mostly on the BiologicalMaterial09 and BiologicalMaterial11 components as they are the top two predictors and have a positive coefficient in relationship to the yield.

On the other hand if the goal of the manufacturing process is to have a reduced yield then the process should focus on the ManufacturingProcess02, ManufacturingProcess19, and the ManufacturingProcess36 components as they are in the top 10 important components for yield prediction and they have negative relationship with yield, meaning lower yield.

DATA 624 - HW 7

Stefano Biguzzi

4/8/2022

Question 6.2

(a) Start R and use these commands to load the data:

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding re-sampled estimate of R2?

(d) Predict the response for the test set. What is the test set estimate of R2?

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Linear Model

Ridge Regression

Comparing Models

Question 6.3.

(a) Start R and use these commands to load the data:

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

DATA 624 - HW 7

Stefano Biguzzi

4/8/2022

Question 6.2

(a) Start R and use these commands to load the data:

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding re-sampled estimate of R2?

(d) Predict the response for the test set. What is the test set estimate of R2?

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Linear Model

Ridge Regression

Comparing Models

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

Question 6.3.

(a) Start R and use these commands to load the data:

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?