Libraries

6.2 Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a. Start R and use these commands to load the data:

data(permeability)
dim(fingerprints)
## [1]  165 1107
dim(permeability)
## [1] 165   1
class(fingerprints)
## [1] "matrix"
class(permeability)
## [1] "matrix"

Using the above command, two matrices are loaded. The matrix fingerprints contains 1,107 binary molecular predictors for 165 compounds, while permeability contains permeability response.

b. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

We will use the function nearZeroVar from caret package to check the predictors with low frequencies or near zero frequencies. These predictors will be removed in order to get unbiased results.

388 predictors out of 1,107 were left for modeling after removing the low variance predictors.

nzv <-  nearZeroVar(fingerprints)
fingerprints <- fingerprints[,-nzv]
dim(fingerprints)
## [1] 165 388
## 388 predictors are left for modeling after droping zerovarianc predictors

c Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)

We will use createDataPartition function from caret package to split the data into training and testing sets. Using general rule, we will split in 75-25 ration for our testing and trainig data.

Preprocess: We will use preProcess function from the caret package to process our data for training the model. Within that function, nzv identifies the numeric predictors with near zero variance and remove them rom the model, corr identifies highly correlated predictors and filter them out, center subtracts the mean of the predictor’s data from the predictor values, scale method divides values by the standard deviation of the predictor, BoxCox will be simply transforming the data, and medianimpute will impute the missing values.

Train- We will use traincontrol from caret package, to perform differnt types of cross validatiosn. there are different methods for that, we will use repeatedCV. After this preprocess and train control, we will tune our model on the training data.

https://machinelearningmastery.com/pre-process-your-dataset-in-r/
train_r <- createDataPartition(permeability, p=0.75, list=F)
fp_train <- fingerprints[train_r,]
perm_train <- permeability[train_r,]
fp_test <- fingerprints[-train_r,]
perm_test <- permeability[-train_r]
p_pro <- c("nzv",  "corr", "center","scale", "medianImpute")
t_ctrl <- trainControl(method = "repeatedcv", repeats = 5)
pls_fit <-train(fp_train, perm_train, method="pls",tuneLength=10,  preProcess=p_pro,trControl=t_ctrl)
pls_fit
## Partial Least Squares 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (103), scaled (103), median imputation
##  (103), remove (285) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 113, 113, 112, 113, 113, 113, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     11.72563  0.4747334  8.743535
##    2     11.20316  0.5334524  8.367818
##    3     11.42837  0.5103012  8.831527
##    4     11.58258  0.4972196  8.765828
##    5     11.61416  0.4936065  8.754583
##    6     11.76662  0.4898154  8.776972
##    7     12.24152  0.4662024  8.999535
##    8     12.68327  0.4508898  9.502555
##    9     13.04750  0.4382922  9.823628
##   10     13.20644  0.4353327  9.881539
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
plot(pls_fit)

print( sprintf( "%-10s: No. of optimal variables", which.min(pls_fit$results[,2])) )
## [1] "2         : No. of optimal variables"
print( sprintf( "%-10s: Corresponding r-squared", which.min(pls_fit$results[,3])) )
## [1] "10        : Corresponding r-squared"

d Predict the response for the test set. What is the test set estimate of \(R^2\)

After we fit our model, we will pedict the response for our tst set using predict function.

pls_pred <- predict(pls_fit, newdata=fp_test)
postResample(pred = pls_pred, obs = perm_test)
##       RMSE   Rsquared        MAE 
## 11.6628678  0.4115886  7.9859438

e. Try building other models discussed in this chapter. Do any have better predictive performance?

We tried to build two additional models besides PLS. Linear model and elatic net using similar preprocessing the data. We recieved warning messages loop in linear model. Research revealed that this might be due to using oomany predictors in our model. Although we removed near zero variance predictors and highly correlated predictors in the preprocessing, the lm was ## however not able to fit properly. This model might notbe a good fit. Elastic net model was built using the similar preporcessing on the data.

lm_fit = suppressWarnings(train(fp_train, perm_train, method="lm", preProcess= p_pro, trControl=t_ctrl))
# warning : - prediction from a rank-deficient fit may be misleading
## This might be due to using oomany predictors in our model. Although we removed near zero variance predictors and highly correlated predictors in the preprocessing, the lm was ## however not able to fit properly. This model might notbe a good fit. 
# https://discuss.analyticsvidhya.com/t/what-does-the-warning-message-prediction-from-a-rank-deficient-fit-mean-in-logistic-regression/4282
lm_pred <- predict(lm_fit,newdata=fp_test)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
postResample(pred = lm_pred, obs = perm_test)
##        RMSE    Rsquared         MAE 
## 34.63448647  0.06670543 21.13009032
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
enet_fit <- train(fp_train, perm_train, method="enet", tuneGrid = enetGrid,preProcess= p_pro, trControl=t_ctrl)
enet_pred <- predict(enet_fit,newdata=fp_test)
postResample(pred = enet_pred, obs = perm_test)
##       RMSE   Rsquared        MAE 
## 12.1794890  0.3604331  8.8611718

f Would you recommend any of your models to replace the permeability laboratory experiment?

resamp = resamples(list(pls=pls_fit,enet=enet_fit, lm_fit) ) 
print(summary(resamp))
## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: pls, enet, Model3 
## Number of resamples: 50 
## 
## MAE 
##            Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## pls    4.179468  6.906092  7.868679  8.367818  9.287634 15.05106    0
## enet   4.663404  7.200955  8.045945  8.302044  8.996797 12.85238    0
## Model3 7.289643 13.829058 19.275556 20.865127 24.170777 51.78726    0
## 
## RMSE 
##             Min.   1st Qu.   Median     Mean  3rd Qu.      Max. NA's
## pls     5.343196  8.778338 10.34084 11.20316 13.67811  20.04514    0
## enet    6.523553  9.141823 11.07884 11.39359 13.32839  17.41499    0
## Model3 11.105107 19.729713 30.31001 34.29961 40.23166 109.10411    0
## 
## Rsquared 
##                Min.    1st Qu.    Median      Mean   3rd Qu.      Max.
## pls    0.0101119258 0.36748215 0.6014957 0.5334524 0.7237201 0.8587618
## enet   0.0420575402 0.34678604 0.5424824 0.5125444 0.7347710 0.8907623
## Model3 0.0002753923 0.05213484 0.2481770 0.2580875 0.3981954 0.7121657
##        NA's
## pls       0
## enet      0
## Model3    0
#https://stackoverflow.com/questions/36482475/create-data-partition-into-training-testing-and-validation-split-in-r
#https://stackoverflow.com/questions/46572275/the-train-function-in-r-caret-package
#https://www.rdocumentation.org/packages/caret/versions/3.21/topics/postResample
#https://machinelearningmastery.com/pre-process-your-dataset-in-r/ 

Comparing between pls model and enet model, we can prefer enet model.

6.3 A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the re-

lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro-cess. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# http://appliedpredictivemodeling.com/errata
processPredictors = as.matrix(ChemicalManufacturingProcess[,2:58])
yield = ChemicalManufacturingProcess[,1]  

b.) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).

We will use bag impute to treat the missing values. The data will be processed in the process of training the model using the function preprocess and method knnimpute. We will treat the missing values and check here also.

processPredictors_m <- preProcess(processPredictors, method="medianImpute", verbose=T)
## Computing medians for 57 predictors... done

c). Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Using createDataPartition in the caret package, we will split the data and preprocess using the methods as above. nzv identifies the numeric predictors with near zero variance and remove them rom the model, corr identifies highly correlated predictors and filter them out, center subtracts the mean of the predictor’s data from the predictor values, scale method divides values by the standard deviation of the predictor, BoxCox will be simply transforming the data, and medianimpute will impute the missing values. We will fit partial least squares on training data.

train_r <- createDataPartition(yield, p=0.75, list=F)
pp_train <- ChemicalManufacturingProcess[train_r,-1]
y_train <-  ChemicalManufacturingProcess[train_r,1]
pp_test <- ChemicalManufacturingProcess[-train_r,-1]
y_test <-  ChemicalManufacturingProcess[-train_r,1]
p_pro <- c("nzv",  "corr", "center","scale", "medianImpute")
tr_ctrl =trainControl(method = "boot", number=5)
pls_fit <-train(pp_train, y_train, method="pls", tuneLength = 10, preProcess=p_pro,trControl=t_ctrl)
pls_fit
## Partial Least Squares 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (49), scaled (49), median imputation (49),
##  remove (8) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 119, 120, 120, 119, 120, 117, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.553330  0.4788648  1.197820
##    2     1.667977  0.5264871  1.182179
##    3     1.438762  0.5570214  1.104316
##    4     1.467156  0.5458387  1.117981
##    5     1.531955  0.5296412  1.139940
##    6     1.564005  0.5303612  1.172071
##    7     1.745984  0.5199421  1.247957
##    8     1.853568  0.5105140  1.275794
##    9     1.897786  0.5109370  1.286091
##   10     1.874342  0.5138742  1.274363
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plot(pls_fit)

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

pls_pred <- predict(pls_fit, pp_test)
postResample(pred = pls_pred, obs = y_test)
##      RMSE  Rsquared       MAE 
## 1.7144886 0.2357345 1.0800501

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The dominant predictors will be identified using varImp function in caret package.

impv <- varImp(pls_fit)
## Warning: package 'pls' was built under R version 3.5.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
plot(impv, top=20)

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

We will create a list of top five dominant variables and will check the correlation between the variables itself and response variables. Exploring this will provide us better picture to pick bewteen the dominat ratio of biolgical vs manufacturing predictors. Negative correlations and positive correlations between the predictors and response will further provide us an overview for tuning the model and reducing the predictors as wqe have seen in our linear model, more predictors will make the models complex and it can impact the model fit. We explored the dominant models in step above, we got an inference that manufacturing predictors were dominant than bilogical.

imp_train <- pp_train %>%select(ManufacturingProcess32,ManufacturingProcess13,ManufacturingProcess36, ManufacturingProcess09, ManufacturingProcess17)
cor(imp_train)
##                        ManufacturingProcess32 ManufacturingProcess13
## ManufacturingProcess32             1.00000000             -0.1459677
## ManufacturingProcess13            -0.14596770              1.0000000
## ManufacturingProcess36                     NA                     NA
## ManufacturingProcess09             0.06097087             -0.8254165
## ManufacturingProcess17            -0.07185669              0.8045073
##                        ManufacturingProcess36 ManufacturingProcess09
## ManufacturingProcess32                     NA             0.06097087
## ManufacturingProcess13                     NA            -0.82541646
## ManufacturingProcess36                      1                     NA
## ManufacturingProcess09                     NA             1.00000000
## ManufacturingProcess17                     NA            -0.76063541
##                        ManufacturingProcess17
## ManufacturingProcess32            -0.07185669
## ManufacturingProcess13             0.80450727
## ManufacturingProcess36                     NA
## ManufacturingProcess09            -0.76063541
## ManufacturingProcess17             1.00000000
cor(imp_train, y_train)
##                              [,1]
## ManufacturingProcess32  0.6106951
## ManufacturingProcess13 -0.5605128
## ManufacturingProcess36         NA
## ManufacturingProcess09  0.5243719
## ManufacturingProcess17 -0.5164468