Exercise 6.2: Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have sufficient permeability to become a drug:

a) Start R and use these commands to load the data:

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

data(permeability)

b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

The data set had 719 near zero variables. The applicable data set is now down to 388.

bad_fingers <- nearZeroVar(fingerprints)
perm_finger <- fingerprints[,-bad_fingers]

length(bad_fingers) 
## [1] 719
dim(perm_finger)[2]
## [1] 388

c) Split the data into a training and test set, pre-process the data, and tune a PLS model. How many latent variable are optimal and what is the corresponding resampled estimate of R squared?

Pre-processing

Pre-processing will consist of centering and scaling the data.The elasticnet model won’t work with scaling, so that model will only be centered.

cl<-makeCluster(detectCores())
registerDoParallel(cl)

set.seed(658)

seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 54)
#for the last model
seeds[[11]]<-sample.int(1000, 1)

myControl <- trainControl(method='cv', seeds=seeds)

new_perm <- as.data.frame(perm_finger)

perm_train <- createDataPartition(permeability, p = 0.7, list = FALSE)

X_training <- new_perm[perm_train, ]
y_training <- permeability[perm_train, ]
X_testing <- new_perm[-perm_train, ]
y_testing <- permeability[-perm_train, ]
pls_model <- train(x=X_training,
                y=y_training, 
                method='pls',
                metric='Rsquared',
                trControl=myControl,
                preProcess=c('center', 'scale')
                )
pls_model$results %>%
  filter(ncomp == pls_model$bestTune$ncomp) %>%
  select(ncomp, RMSE, Rsquared)
##   ncomp     RMSE  Rsquared
## 1     3 12.44402 0.4636497

d) Predict the response for the test set. What is the set estimate of R squared?

<pThe R-squared of the training data was .4636, while the R-squared of the testing was .5100. This represents an increase in predictive power of 10%.

pls_pred <- predict(pls_model, newdata=X_testing)
postResample(pred=pls_pred, obs=y_testing)
##       RMSE   Rsquared        MAE 
## 10.6990767  0.5100389  8.4541193

e) Try building other models discussed in this chapter. Do any have better predictive performance?

The ridge regression’s highest R-squared score was .4818 with a lambda of 1. This is an increase from .4636, or .0182, which is a 3.9% increase in predictive power on the training set.

Also of note, one was the highest lambda value tested, so it is possible that if the range for lambda is increased that the R-squared value might also increase. Further testing should be conducted to see if that is true.

The elasticnet’s highest R-squared value was .4722. Which is a slight increase from the partial least squares model’s training R-square, yet it is also slightly lower than the ridge regression’s R-squared on the training data.

The lasso model was the worst performer of all the models. It’s highest R-squared value was .4509

Initially on the training data the ridge regression had the best R-squared score being ~4% better than the partial least squares model. This score could possibly be improved further by increasing the lambda value in the ridge regression.

ruby <- train( x=X_training,
                y=y_training,
                method='ridge',
                metric='Rsquared',
                tuneGrid=expand.grid(lambda = seq(0, 1,by=0.1)),
                trControl=myControl,
                preProcess=c('center','scale'),
                na.action = na.omit
                  )
ruby
## Ridge Regression 
## 
## 117 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 105, 105, 106, 106, 105, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     15.26950  0.3654767  11.297176
##   0.1     12.96579  0.4526074   9.793751
##   0.2     12.94546  0.4602483   9.810543
##   0.3     13.11757  0.4669079  10.008800
##   0.4     13.39234  0.4724103  10.278381
##   0.5     13.78452  0.4764512  10.635906
##   0.6     14.24628  0.4784520  11.065704
##   0.7     14.77672  0.4800700  11.560552
##   0.8     15.36053  0.4810695  12.086370
##   0.9     15.98674  0.4816370  12.624526
##   1.0     16.64840  0.4818382  13.160313
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.
elasty <- train(x=X_training,
                 y=y_training,
                 method='enet',
                 metric='Rsquared',
                 tune=expand.grid(fraction=seq(0, 1, by=0.1), 
                                      lambda =seq(0, 1, by=0.1)),
                 trControl=myControl,
                 preProcess=c('center'),
                 na.action = na.omit
                  )
elasty
## Elasticnet 
## 
## 117 samples
## 388 predictors
## 
## Pre-processing: centered (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 105, 105, 106, 106, 105, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE         Rsquared   MAE         
##   0e+00   0.050        12.73731  0.4635880      9.683101
##   0e+00   0.525        14.03838  0.3840277     10.450708
##   0e+00   1.000        17.43465  0.3596392     12.674035
##   1e-04   0.050      3256.49276  0.3875031   2166.900249
##   1e-04   0.525     31388.76539  0.2802239  20853.673198
##   1e-04   1.000     59361.10575  0.2331990  39433.057183
##   1e-01   0.050        12.37650  0.4722650      9.515095
##   1e-01   0.525        12.34334  0.4617231      9.171237
##   1e-01   1.000        12.96978  0.4525790      9.798515
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.1.
lassy<- train(x=X_training,
                  y=y_training,
                  method='lasso',
                  metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
                  trControl=myControl,
                  preProcess=c('center','scale')
                  )
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Warning in train.default(x = X_training, y = y_training, method = "lasso", :
## missing values found in aggregated results
lassy
## The lasso 
## 
## 117 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 105, 105, 105, 106, 106, 105, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      15.75590        NaN  12.641787
##   0.05      12.72988  0.4509719   9.813431
##   0.10      12.10918  0.4398276   8.913847
##   0.15      12.08650  0.4261224   8.661598
##   0.20      12.33885  0.4071344   8.797435
##   0.25      12.40495  0.4097625   8.919987
##   0.30      12.56445  0.4085810   9.121651
##   0.35      12.80054  0.4026614   9.302328
##   0.40      13.04872  0.3985966   9.593485
##   0.45      13.18989  0.4007178   9.755427
##   0.50      13.29458  0.4041847   9.850746
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
stopCluster(cl)

f) Would you recommend any of your models to replace the permeability laboratory experiment?

No I would not recommend any of these models as are, to replace the permeability laboratory experiment. There doesn’t seem to be a strong enough relationship between the predictors and the predicted. These models might be able to be tuned much better than what I have done, where they become a viable option as replacement. As is, these are not viable options.

Exercise 6.3 A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing (predictors), and the response of product yield. Biological predictors cannot be changed but can beused to assess the quality ofthe raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yeild by 1% will boost revenue by approximately one hundred thousand dollars per batch:

a) Start R and use these commands to load the data:

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yeild contains the percent yield for each run.

data(ChemicalManufacturingProcess)

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8.).

Of the columns that are missing data, most are only missing one or two values. ManufacturingProcess03 is missing the most values with 15 missing, which is just over 8% of the observations. Manufacturing Processes 10 and 11 are also missing a decent amount, missing 9 and 10 values.

For this dataset I am going to use a knn imputation. Which will remove all na values, while also centering and scaling all the data.

colSums(is.na(ChemicalManufacturingProcess))
##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##                      0                      0                      0 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##                      0                      0                      0 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##                      0                      0                      0 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##                      0                      0                      0 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##                      0                      1                      3 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##                     15                      1                      1 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##                      2                      1                      1 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##                      0                      9                     10 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##                      1                      0                      1 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##                      0                      0                      0 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##                      0                      0                      0 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##                      0                      1                      1 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##                      1                      5                      5 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##                      5                      5                      5 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##                      5                      5                      0 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      5                      5                      5 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##                      5                      0                      0 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##                      0                      1                      1 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##                      0                      0                      0 
## ManufacturingProcess45 
##                      0
only_yield <- subset(ChemicalManufacturingProcess, select = c(Yield))
no_yield <- subset(ChemicalManufacturingProcess, select = -c(Yield))
imputed_data <- preProcess(no_yield, "knnImpute")
fixed_cmp <- predict(imputed_data,
                     no_yield)

c) Split the data into training and a test set, preprocess the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Using a Ridge Regression model, the optimal performance metric is a Ridge Regression with a lambda of 0.4. This model has an R-squared of .5276 and an RMSE of 1.897.

cl2<-makeCluster(detectCores())
registerDoParallel(cl2)

set.seed(791)
training2 <- createDataPartition(only_yield$Yield, p=0.7, list=FALSE)
X_training2 <- fixed_cmp[training2, ]
y_training2 <- only_yield$Yield[training2]
X_testing2 <- fixed_cmp[-training2, ]
y_testing2 <- only_yield$Yield[-training2]
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 54)
#for the last model
seeds[[11]]<-sample.int(1000, 1)
myControl <- trainControl(method='cv', seeds=seeds)
ruby2 <- train( x=X_training2,
                y=y_training2,
                method='ridge',
                metric='RMSE',
                tuneGrid=expand.grid(.lambda = seq(0, 2, by=0.2)), 
                trControl=myControl
                )
ruby2
## Ridge Regression 
## 
## 124 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 112, 112, 111, 112, 111, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0.0     3.794717  0.3112102  2.073043
##   0.2     2.037873  0.5001433  1.306535
##   0.4     1.897186  0.5276175  1.265029
##   0.6     1.909298  0.5233858  1.309832
##   0.8     1.978693  0.5147947  1.374211
##   1.0     2.072019  0.5080121  1.446059
##   1.2     2.176858  0.5031478  1.529156
##   1.4     2.287612  0.4996483  1.619042
##   1.6     2.401310  0.4970583  1.713256
##   1.8     2.516197  0.4950584  1.808623
##   2.0     2.631163  0.4934274  1.904842
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.4.
plot(ruby2)

d)Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The RMSE and the R-Squared values are both better on the testing data than the training data. The R-Squared improved to .604 compared to .5276, which is an improvement of 14.6% for the model fit. While the RMSE improved from 1.897 to 1.287, which is an improvement of 32%.

ruby2_pred <- predict(ruby2, newdata=X_testing2)
postResample(pred=ruby2_pred, obs=y_testing2)
##      RMSE  Rsquared       MAE 
## 1.2879312 0.6047784 1.0554196

e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The top 10 predictors of importance are all manufacturing processes. This is not surprising since there are roughly 3.75 manufacturing processes to every 1 biological material.

importance <- predict(ruby2$finalModel, s=ruby2$bestTune[1, "lambda"], type="coef", mode="penalty")$coefficients
importance_sort <- abs(importance)
importance_sort<- importance_sort[importance_sort > 0]
importance_sort <- sort(importance_sort, decreasing = T)
importance_sort[0:10]
## ManufacturingProcess32 ManufacturingProcess36 ManufacturingProcess09 
##              0.5196149              0.3689769              0.3607273 
## ManufacturingProcess13 ManufacturingProcess17 ManufacturingProcess37 
##              0.3473881              0.3369944              0.2281104 
## ManufacturingProcess15 ManufacturingProcess11 ManufacturingProcess34 
##              0.2159516              0.2152899              0.1962589 
## ManufacturingProcess12 
##              0.1750054

f) Explore the relationships between each of the top predictors and the response. How could this information be helpfull in improving yield in future runs of the manufacturing process?

I believe these predictors have the largest relationship with yield. So if you want to improve yield you would start with the processes that play the biggest role in yield. Also, these processes could be models for all the other processes. Where you look for differences between processes that might affect the yield.

Such as, maybe these processes happen in different facilities that are in different geographic regions. Maybe these processes are affected by unique geographic characteristics, such as humidity. The processes that perform better are in low humidity geographic areas. This could mean that some processes need to be relocated to regions of low humidity. Thus you could improve yield while also lowering costs of having to keep a facility’s humidity down.