HW 7

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

library(caret) 
library(AppliedPredictiveModeling)

data(permeability)

str(permeability)

##  num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr "permeability"

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

The data presents that 388 predictors were left for modeling.

low_freq <- nearZeroVar(fingerprints)
X <- fingerprints[,-low_freq]
str(X)

##  num [1:165, 1:388] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:388] "X1" "X2" "X3" "X4" ...

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

For the data below, it was splitted into train/test set, using the createDataPartition function.

set.seed(17)

t_Row <- createDataPartition(permeability, p=0.8, list=FALSE)
X_train <- X[t_Row, ]
y_train <- permeability[t_Row, ]
X_test <- X[-t_Row, ]
y_test <- permeability[-t_Row, ]

set.seed(17)

pls_Fit <- train(x=X_train,
                y=y_train, 
                method='pls',
                metric='Rsquared',
                tuneLength=20,
                trControl=trainControl(method='cv'),
                preProcess=c('center', 'scale')
                )

pls_Result <- pls_Fit$results
pls_Fit

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 121, 120, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.78558  0.3512671   9.984752
##    2     11.48566  0.4770397   8.255759
##    3     11.68110  0.4609739   8.677375
##    4     11.85819  0.4512404   8.909198
##    5     11.74985  0.4588907   9.022578
##    6     11.88420  0.4591695   9.075945
##    7     12.06412  0.4541519   9.255652
##    8     12.11218  0.4698237   9.401114
##    9     12.15479  0.4783946   9.271606
##   10     12.42207  0.4654890   9.326995
##   11     12.70168  0.4581171   9.406034
##   12     12.94741  0.4575447   9.663124
##   13     13.10568  0.4468662   9.735696
##   14     13.27580  0.4408783   9.883621
##   15     13.49070  0.4393861   9.954010
##   16     13.92057  0.4236587  10.308067
##   17     14.30686  0.4144228  10.473196
##   18     14.73955  0.3977299  10.818888
##   19     14.91533  0.3963792  11.034378
##   20     15.29482  0.3823758  11.384072
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 9.

Predict the response for the test set. What is the test set estimate of R2?

R2 test set is 0.2986943.

pls_Pred <- predict(pls_Fit, newdata=X_test)
postResample(pred=pls_Pred, obs=y_test)

##      RMSE  Rsquared       MAE 
## 11.455821  0.530198  8.381798

Try building other models discussed in this chapter. Do any have better prediective performance?

FOr these problem, there will be 3 models. The model with the MAX R2 will be the Lasso model showing us R2 0.5459518

set.seed(17)
r_Fit <- train(x=X_train,
                  y=y_train,
                  method='ridge',
                  metric='Rsquared',
                  tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )

## Warning: model fit failed for Fold02: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold10: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

r_Fit

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 121, 120, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     17.69824  0.3387828  12.323778
##   0.1     13.16484  0.4586224   9.573189
##   0.2     12.90509  0.4751686   9.501242
##   0.3     13.03168  0.4820743   9.650479
##   0.4     13.30036  0.4863153   9.935894
##   0.5     13.67743  0.4894886  10.256849
##   0.6     14.14675  0.4912908  10.638401
##   0.7     14.66246  0.4929628  11.020308
##   0.8     15.21765  0.4943953  11.426074
##   0.9     15.81805  0.4954898  11.900490
##   1.0     16.44342  0.4964945  12.452772
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.

plot(r_Fit)

set.seed(17)
lasso_Fit <- train(x=X_train,
                  y=y_train,
                  method='lasso',
                  metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )

## Warning: model fit failed for Fold02: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold10: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

## Warning in train.default(x = X_train, y = y_train, method = "lasso", metric =
## "Rsquared", : missing values found in aggregated results

lasso_Fit

## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 121, 120, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      14.99248        NaN  11.899869
##   0.05      12.04587  0.5004433   8.961563
##   0.10      12.58317  0.4619831   8.828892
##   0.15      14.10187  0.4392581   9.776441
##   0.20      15.02759  0.4202044  10.174406
##   0.25      15.08830  0.4083345  10.103200
##   0.30      15.14589  0.4052072  10.229122
##   0.35      15.18624  0.4047464  10.293555
##   0.40      15.35506  0.3980124  10.405897
##   0.45      15.58076  0.3886588  10.627774
##   0.50      15.84203  0.3764416  10.839115
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.

plot(lasso_Fit)

set.seed(1)
enet_Fit <- train(x=X_train,
                 y=y_train,
                 method='enet',
                 metric='Rsquared',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
                 trControl=trainControl(method='cv'),
                 preProcess=c('center','scale')
                  )

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

## Warning in train.default(x = X_train, y = y_train, method = "enet", metric =
## "Rsquared", : missing values found in aggregated results

enet_Fit

## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.0     0.0       15.41970        NaN  12.091419
##   0.0     0.1       11.61723  0.4611141   8.257619
##   0.0     0.2       12.14064  0.4505336   8.783101
##   0.0     0.3       12.78913  0.4447343   9.268132
##   0.0     0.4       13.58610  0.4374408   9.827298
##   0.0     0.5       14.48706  0.4210673  10.382041
##   0.0     0.6       15.26000  0.4102524  10.671864
##   0.0     0.7       15.93989  0.4048366  10.913214
##   0.0     0.8       16.69682  0.3992267  11.321393
##   0.0     0.9       17.48924  0.3871120  11.763618
##   0.0     1.0       18.58494  0.3778306  12.330318
##   0.1     0.0       15.41970        NaN  12.091419
##   0.1     0.1       11.35996  0.4915684   8.017335
##   0.1     0.2       11.27531  0.5015887   8.043442
##   0.1     0.3       11.20513  0.5110566   8.027647
##   0.1     0.4       11.26273  0.5110763   8.129069
##   0.1     0.5       11.49214  0.5022354   8.305879
##   0.1     0.6       11.71000  0.4927793   8.430194
##   0.1     0.7       11.90351  0.4878547   8.538342
##   0.1     0.8       12.11296  0.4819953   8.661963
##   0.1     0.9       12.33892  0.4753494   8.822856
##   0.1     1.0       12.55343  0.4696305   8.992569
##   0.2     0.0       15.41970        NaN  12.091419
##   0.2     0.1       11.29819  0.4971055   8.081225
##   0.2     0.2       11.34551  0.5072120   7.928544
##   0.2     0.3       11.28566  0.5174924   7.990192
##   0.2     0.4       11.30461  0.5220751   8.080592
##   0.2     0.5       11.46627  0.5177649   8.298248
##   0.2     0.6       11.62510  0.5109649   8.433006
##   0.2     0.7       11.82620  0.5055078   8.564932
##   0.2     0.8       12.00071  0.5017981   8.651009
##   0.2     0.9       12.22749  0.4948431   8.813655
##   0.2     1.0       12.42486  0.4892806   8.962422
##   0.3     0.0       15.41970        NaN  12.091419
##   0.3     0.1       11.27466  0.5009573   8.091070
##   0.3     0.2       11.47359  0.5099557   7.881453
##   0.3     0.3       11.47396  0.5213435   8.063961
##   0.3     0.4       11.49929  0.5262336   8.167894
##   0.3     0.5       11.65714  0.5240146   8.410613
##   0.3     0.6       11.79819  0.5187685   8.587430
##   0.3     0.7       11.98769  0.5141657   8.760497
##   0.3     0.8       12.19169  0.5096006   8.902773
##   0.3     0.9       12.39220  0.5049634   9.058682
##   0.3     1.0       12.59417  0.4995325   9.233599
##   0.4     0.0       15.41970        NaN  12.091419
##   0.4     0.1       11.25404  0.5056775   8.080373
##   0.4     0.2       11.63823  0.5113009   7.858957
##   0.4     0.3       11.71973  0.5236318   8.190912
##   0.4     0.4       11.76623  0.5286027   8.344216
##   0.4     0.5       11.94615  0.5268705   8.576430
##   0.4     0.6       12.09957  0.5230824   8.833440
##   0.4     0.7       12.29631  0.5190477   9.046553
##   0.4     0.8       12.51110  0.5146867   9.232418
##   0.4     0.9       12.72124  0.5100153   9.400000
##   0.4     1.0       12.93584  0.5044749   9.573351
##   0.5     0.0       15.41970        NaN  12.091419
##   0.5     0.1       11.24428  0.5077260   8.072662
##   0.5     0.2       11.80485  0.5129863   7.828495
##   0.5     0.3       12.00793  0.5251863   8.351482
##   0.5     0.4       12.09280  0.5299907   8.561475
##   0.5     0.5       12.30083  0.5288039   8.822076
##   0.5     0.6       12.47467  0.5261406   9.117000
##   0.5     0.7       12.68591  0.5224979   9.369244
##   0.5     0.8       12.90625  0.5186852   9.569376
##   0.5     0.9       13.13479  0.5136638   9.763197
##   0.5     1.0       13.35206  0.5085455   9.954409
##   0.6     0.0       15.41970        NaN  12.091419
##   0.6     0.1       11.24510  0.5086957   8.058453
##   0.6     0.2       12.00165  0.5143831   7.848258
##   0.6     0.3       12.34253  0.5257645   8.540670
##   0.6     0.4       12.47138  0.5308852   8.805267
##   0.6     0.5       12.71386  0.5299948   9.132140
##   0.6     0.6       12.91580  0.5278109   9.424597
##   0.6     0.7       13.14201  0.5245287   9.706237
##   0.6     0.8       13.37205  0.5213038   9.919503
##   0.6     0.9       13.61274  0.5164102  10.144165
##   0.6     1.0       13.83953  0.5114326  10.351894
##   0.7     0.0       15.41970        NaN  12.091419
##   0.7     0.1       11.24866  0.5092526   8.040058
##   0.7     0.2       12.21310  0.5152180   7.909587
##   0.7     0.3       12.71338  0.5259876   8.746622
##   0.7     0.4       12.89518  0.5311812   9.102461
##   0.7     0.5       13.17641  0.5305080   9.466376
##   0.7     0.6       13.41332  0.5285888   9.771724
##   0.7     0.7       13.65402  0.5258617  10.054410
##   0.7     0.8       13.89839  0.5229935  10.301344
##   0.7     0.9       14.14595  0.5185389  10.538146
##   0.7     1.0       14.38595  0.5135687  10.753672
##   0.8     0.0       15.41970        NaN  12.091419
##   0.8     0.1       11.25438  0.5092670   8.022478
##   0.8     0.2       12.43212  0.5159349   7.999328
##   0.8     0.3       13.10814  0.5258492   8.960724
##   0.8     0.4       13.35914  0.5307945   9.435246
##   0.8     0.5       13.67805  0.5304771   9.816573
##   0.8     0.6       13.95180  0.5287587  10.124423
##   0.8     0.7       14.20887  0.5266314  10.431117
##   0.8     0.8       14.47080  0.5239292  10.705678
##   0.8     0.9       14.72472  0.5199382  10.947697
##   0.8     1.0       14.97598  0.5152163  11.169986
##   0.9     0.0       15.41970        NaN  12.091419
##   0.9     0.1       11.26398  0.5091136   8.000174
##   0.9     0.2       12.67388  0.5162232   8.143464
##   0.9     0.3       13.51715  0.5256316   9.245476
##   0.9     0.4       13.85627  0.5302268   9.805245
##   0.9     0.5       14.20852  0.5303995  10.197555
##   0.9     0.6       14.51987  0.5288872  10.527307
##   0.9     0.7       14.79832  0.5270899  10.845497
##   0.9     0.8       15.07545  0.5245864  11.125941
##   0.9     0.9       15.33779  0.5210748  11.373539
##   0.9     1.0       15.60098  0.5165384  11.612292
##   1.0     0.0       15.41970        NaN  12.091419
##   1.0     0.1       11.27752  0.5088380   7.975301
##   1.0     0.2       12.93414  0.5159634   8.304355
##   1.0     0.3       13.93146  0.5254618   9.522955
##   1.0     0.4       14.37678  0.5295917  10.179910
##   1.0     0.5       14.76285  0.5301598  10.590262
##   1.0     0.6       15.11221  0.5288528  10.965464
##   1.0     0.7       15.41130  0.5273473  11.279058
##   1.0     0.8       15.70466  0.5250636  11.577120
##   1.0     0.9       15.98121  0.5216979  11.869047
##   1.0     1.0       16.25240  0.5175524  12.137198
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.7.

plot(enet_Fit)

f. Would you recommend any of your models to replace the permeability laboratory experiment?

I would not recommend any of the models used to replace the permeability laboratory experiment. The histo graph below will show us the majority of the results, which are below 5.

hist(permeability)

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is the understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch.

Start R and use these commands to load the data:

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
df.raw <- ChemicalManufacturingProcess

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

library(mice)

## Warning: package 'mice' was built under R version 4.0.5

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

imp_Data <- mice(df.raw, meth='pmm', seed=500, printFlag = FALSE)

## Warning: Number of logged events: 675

df <- complete(imp_Data,1)
# Split df into predictor columns and yield column data frames
df.pred <- df[, -which(names(df) == "Yield")]
df.yield <- df[, which(names(df) == "Yield")]

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

 # Split the data into training and test sets
set.seed(139)

trn.rows <- createDataPartition(df$Yield, p = 0.80, list = FALSE)
 # Pre-processing, removed highly correlated and near-zero predictors

df.pp <- preProcess(df.pred, method = c("BoxCox", "center", "scale", "nzv", "corr"))
df.pred_trans <- predict(df.pp, df.pred)
# Train model
df.yield <- as.data.frame(df.yield)
df.pred_train <- df.pred_trans[trn.rows, ]
df.yield_train <- df.yield[trn.rows, ]
df.pred_test <- df.pred_trans[-trn.rows, ]
df.yield_test <- df.yield[-trn.rows, ]

set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
model <- train(x = df.pred_train, y = df.yield_train, 
method = "lmStepAIC", trControl = train.control, trace = FALSE)
model$results

##   parameter     RMSE  Rsquared      MAE   RMSESD RsquaredSD     MAESD
## 1      none 1.690168 0.5082275 1.198341 1.274591  0.2478585 0.5173695

model$finalModel

## 
## Call:
## lm(formula = .outcome ~ BiologicalMaterial05 + ManufacturingProcess01 + 
##     ManufacturingProcess03 + ManufacturingProcess04 + ManufacturingProcess09 + 
##     ManufacturingProcess10 + ManufacturingProcess13 + ManufacturingProcess18 + 
##     ManufacturingProcess20 + ManufacturingProcess26 + ManufacturingProcess28 + 
##     ManufacturingProcess29 + ManufacturingProcess32 + ManufacturingProcess33 + 
##     ManufacturingProcess35 + ManufacturingProcess37 + ManufacturingProcess43 + 
##     ManufacturingProcess45, data = dat)
## 
## Coefficients:
##            (Intercept)    BiologicalMaterial05  ManufacturingProcess01  
##                40.1278                  0.2944                  0.1446  
## ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess09  
##                -0.1290                  0.4066                  0.8204  
## ManufacturingProcess10  ManufacturingProcess13  ManufacturingProcess18  
##                -0.3212                 -0.4319                  2.1384  
## ManufacturingProcess20  ManufacturingProcess26  ManufacturingProcess28  
##                -2.0278                 -1.3721                 -0.2547  
## ManufacturingProcess29  ManufacturingProcess32  ManufacturingProcess33  
##                 1.5687                  1.6208                 -0.7606  
## ManufacturingProcess35  ManufacturingProcess37  ManufacturingProcess43  
##                -0.1428                 -0.1946                  0.1735  
## ManufacturingProcess45  
##                 0.2959

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

df.pred_test_results <- predict(model, df.pred_test)
df.compare <- defaultSummary(data.frame(obs = df.yield_test, pred = df.pred_test_results))
df.compare

##      RMSE  Rsquared       MAE 
## 1.1619742 0.6380871 0.9782865

df.compare2 <- defaultSummary(data.frame(obs = df.yield_test[2:32], pred = df.pred_test_results[2:32]))
df.compare2

##      RMSE  Rsquared       MAE 
## 1.1777534 0.6289377 0.9952139

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

model.imp <- varImp(model)
ggplot(model.imp, top = 10) + ggtitle("Top 10 Predictors in Model")

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

From looking at the top 10, I noticed that 7 are manufacturing processes and 3 of them are biological material.

par(mfrow=c(2,3))
plot(df$ManufacturingProcess32, df$Yield, xlab = "ManufacturingProcess32", ylab = "Yield")
plot(df$BiologicalMaterial06, df$Yield, xlab = "BiologicalMaterial06", ylab = "Yield")
plot(df$ManufacturingProcess13, df$Yield, xlab = "ManufacturingProcess13", ylab = "Yield")
plot(df$BiologicalMaterial03, df$Yield, xlab = "BiologicalMaterial03", ylab = "Yield")
plot(df$ManufacturingProcess17, df$Yield, xlab = "ManufacturingProcess17", ylab = "Yield")
plot(df$ManufacturingProcess36, df$Yield, xlab = "ManufacturingProcess36", ylab = "Yield")

HW 7

Amanda Arce

4/14/2021