Your Document Title

library(tidyverse)
library(mice)
library(caret)
library(e1071)
library(psych)
library(DataExplorer)
library(RANN)
library(MASS)
library(elasticnet)

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

filterdf <- nearZeroVar(fingerprints)
fingerprints_df <- fingerprints[,-filterdf]

dim(fingerprints_df)[2]

## [1] 388

There are 388 predictors left for modeling

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(1)
train_fingerprints <- createDataPartition(permeability, p=0.75, list=FALSE)
finger_train_x <- fingerprints_df[train_fingerprints, ]
finger_train_y <- permeability[train_fingerprints, ]
finger_test_x <- fingerprints_df[-train_fingerprints, ]
finger_test_y <- permeability[-train_fingerprints, ]

set.seed(1)
PLS_model <- train(x=finger_train_x,
                y=finger_train_y, 
                method='pls',
                metric='Rsquared',
                tuneLength=20,
                trControl=trainControl(method='cv'),
                preProcess=c('center', 'scale')
                )

PLS_model

## Partial Least Squares 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 113, 113, 112, 112, 113, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.68224  0.3836730  9.580939
##    2     11.33319  0.5086760  8.003997
##    3     11.61830  0.4903854  8.701066
##    4     11.69287  0.4808316  8.964132
##    5     11.66322  0.4931162  8.712503
##    6     11.51278  0.5096492  8.423922
##    7     11.79005  0.4947336  8.784082
##    8     11.58711  0.5096538  8.389693
##    9     11.71172  0.5082200  8.361761
##   10     11.85087  0.5007964  8.393775
##   11     11.66169  0.5179056  8.317172
##   12     11.81490  0.5060889  8.470356
##   13     12.14477  0.4846395  8.821789
##   14     12.35655  0.4653682  8.948059
##   15     12.70879  0.4412342  9.311175
##   16     13.16777  0.4071006  9.522325
##   17     13.53504  0.3924018  9.765694
##   18     13.64435  0.3882088  9.678479
##   19     13.74642  0.3813970  9.866950
##   20     13.88144  0.3741560  9.937950
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 11.

Optimal latent variable is 11, with an R^2 = .5179056

plot(PLS_model)

(d) Predict the response for the test set. What is the test set estimate of R2?

PLS_model_pred <- predict(PLS_model, newdata=finger_test_x)
postResample(pred=PLS_model_pred, obs=finger_test_y)

##      RMSE  Rsquared       MAE 
## 12.096136  0.496428  8.753309

The R^2 estimate is 0.496428 which is less than the resample

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

LM

set.seed(1)
lm_model <- train(x=finger_train_x,
                y=finger_train_y, 
                method='lm',
                metric='Rsquared',
                tuneLength=20,
                trControl=trainControl(method='cv'),
               preProc = c("center", "scale") )

lm_model

## Linear Regression 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 113, 113, 112, 112, 113, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   22.78946  0.2885173  15.69457
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Not sure if lm works but wanted to try it, but either way it gave the least R^2 value.

Ridge Model

## Define the candidate set of values
ridgeGrid <- data.frame(.lambda = seq(0, 1, by=0.1))
set.seed(1)
ridge_model <- train(x=finger_train_x,
                y=finger_train_y,
               method = "ridge",
               tuneGrid = ridgeGrid,
               trControl = trainControl(method='cv') ,
               preProc = c("center", "scale")
              )

ridge_model

## Ridge Regression 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 113, 113, 112, 112, 113, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     11.87179  0.4968166   8.792555
##   0.1     12.20891  0.4831642   8.573664
##   0.2     12.15268  0.5093523   8.633298
##   0.3     12.37568  0.5208169   8.998255
##   0.4     12.77267  0.5262564   9.462909
##   0.5     13.27544  0.5291125   9.964177
##   0.6     13.84680  0.5307843  10.484285
##   0.7     14.48153  0.5316392  11.030774
##   0.8     15.16239  0.5320450  11.620427
##   0.9     15.87961  0.5321711  12.245829
##   1.0     16.62565  0.5321188  12.905411
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.

The RMSE value is 11.87179 which has an R^2 of 0.4968166

plot(ridge_model)

ridge_model_pred <- predict(ridge_model, newdata=finger_test_x)
postResample(pred=ridge_model_pred, obs=finger_test_y)

##       RMSE   Rsquared        MAE 
## 12.9605527  0.3993369  8.5320310

The predicted values of RMSE is 12.9605527 and the R^2 value is 0.3993369.

Enet Model

set.seed(1)
enet_model <- train(x=finger_train_x,
                y=finger_train_y,
                 method='enet',
                 metric='Rsquared',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
                 trControl=trainControl(method='cv'),
                 preProcess=c('center','scale')
                  )
enet_model

## Elasticnet 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 113, 113, 112, 112, 113, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.0     0.0       15.49399        NaN  12.326628
##   0.0     0.1       11.57212  0.5079003   8.180273
##   0.0     0.2       11.26345  0.5209785   7.928649
##   0.0     0.3       11.44024  0.5121047   8.069737
##   0.0     0.4       11.53101  0.5088036   8.158015
##   0.0     0.5       11.73077  0.4889010   8.236416
##   0.0     0.6       11.88519  0.4784817   8.427092
##   0.0     0.7       12.03835  0.4680518   8.622767
##   0.0     0.8       11.85334  0.4834996   8.554933
##   0.0     0.9       11.77615  0.4963126   8.605190
##   0.0     1.0       11.87179  0.4968166   8.792555
##   0.1     0.0       15.49399        NaN  12.326628
##   0.1     0.1       11.60682  0.4868923   8.108766
##   0.1     0.2       11.21221  0.5134091   7.861051
##   0.1     0.3       10.82269  0.5428031   7.582170
##   0.1     0.4       10.97811  0.5354053   7.576320
##   0.1     0.5       11.22024  0.5284125   7.735419
##   0.1     0.6       11.38751  0.5254398   7.917165
##   0.1     0.7       11.51884  0.5209136   8.015020
##   0.1     0.8       11.76388  0.5078942   8.234282
##   0.1     0.9       12.02459  0.4933800   8.424731
##   0.1     1.0       12.20891  0.4831642   8.573664
##   0.2     0.0       15.49399        NaN  12.326628
##   0.2     0.1       11.59491  0.4889050   8.139073
##   0.2     0.2       11.45381  0.5045354   7.888623
##   0.2     0.3       11.05524  0.5330971   7.757323
##   0.2     0.4       11.03593  0.5386451   7.701004
##   0.2     0.5       11.19396  0.5361719   7.763050
##   0.2     0.6       11.35458  0.5364813   7.971367
##   0.2     0.7       11.50717  0.5352093   8.132110
##   0.2     0.8       11.71843  0.5281958   8.295294
##   0.2     0.9       11.96002  0.5179472   8.471142
##   0.2     1.0       12.15268  0.5093523   8.633298
##   0.3     0.0       15.49399        NaN  12.326628
##   0.3     0.1       11.58394  0.4903471   8.100621
##   0.3     0.2       11.61481  0.5011155   7.867715
##   0.3     0.3       11.32039  0.5259028   7.905974
##   0.3     0.4       11.20784  0.5393092   7.831133
##   0.3     0.5       11.38370  0.5370329   7.984787
##   0.3     0.6       11.55185  0.5391761   8.203340
##   0.3     0.7       11.73194  0.5394694   8.425060
##   0.3     0.8       11.94710  0.5345508   8.619181
##   0.3     0.9       12.17567  0.5277370   8.826829
##   0.3     1.0       12.37568  0.5208169   8.998255
##   0.4     0.0       15.49399        NaN  12.326628
##   0.4     0.1       11.59222  0.4897373   8.059071
##   0.4     0.2       11.75463  0.4989544   7.830388
##   0.4     0.3       11.62031  0.5195555   8.063408
##   0.4     0.4       11.47838  0.5371835   8.018928
##   0.4     0.5       11.66908  0.5361411   8.232440
##   0.4     0.6       11.86954  0.5392247   8.514532
##   0.4     0.7       12.08607  0.5406064   8.807609
##   0.4     0.8       12.31863  0.5367827   9.049062
##   0.4     0.9       12.55784  0.5321485   9.288048
##   0.4     1.0       12.77267  0.5262564   9.462909
##   0.5     0.0       15.49399        NaN  12.326628
##   0.5     0.1       11.58069  0.4892344   8.000376
##   0.5     0.2       11.89501  0.4973353   7.802022
##   0.5     0.3       11.94262  0.5145833   8.224020
##   0.5     0.4       11.81398  0.5342985   8.265433
##   0.5     0.5       12.03463  0.5342573   8.519582
##   0.5     0.6       12.29309  0.5371820   8.882012
##   0.5     0.7       12.52638  0.5401378   9.217409
##   0.5     0.8       12.78687  0.5375714   9.511956
##   0.5     0.9       13.04521  0.5338213   9.769124
##   0.5     1.0       13.27544  0.5291125   9.964177
##   0.6     0.0       15.49399        NaN  12.326628
##   0.6     0.1       11.57412  0.4883074   7.936659
##   0.6     0.2       12.06478  0.4950943   7.780480
##   0.6     0.3       12.27466  0.5110898   8.403381
##   0.6     0.4       12.19057  0.5316392   8.542455
##   0.6     0.5       12.44829  0.5327154   8.860062
##   0.6     0.6       12.76982  0.5350137   9.274702
##   0.6     0.7       13.03268  0.5389010   9.641580
##   0.6     0.8       13.32693  0.5372657   9.977058
##   0.6     0.9       13.60104  0.5343743  10.254008
##   0.6     1.0       13.84680  0.5307843  10.484285
##   0.7     0.0       15.49399        NaN  12.326628
##   0.7     0.1       11.58002  0.4866645   7.890607
##   0.7     0.2       12.24178  0.4936033   7.767757
##   0.7     0.3       12.61794  0.5088075   8.589239
##   0.7     0.4       12.61160  0.5291910   8.841753
##   0.7     0.5       12.91682  0.5313053   9.231047
##   0.7     0.6       13.29424  0.5331332   9.700262
##   0.7     0.7       13.61361  0.5369971  10.125374
##   0.7     0.8       13.93034  0.5364490  10.492951
##   0.7     0.9       14.22548  0.5341632  10.792614
##   0.7     1.0       14.48153  0.5316392  11.030774
##   0.8     0.0       15.49399        NaN  12.326628
##   0.8     0.1       11.58853  0.4857452   7.851179
##   0.8     0.2       12.43789  0.4919862   7.788830
##   0.8     0.3       12.97567  0.5071434   8.784728
##   0.8     0.4       13.07740  0.5265231   9.168645
##   0.8     0.5       13.43141  0.5298976   9.631066
##   0.8     0.6       13.86554  0.5313883  10.179628
##   0.8     0.7       14.23891  0.5351190  10.649056
##   0.8     0.8       14.57626  0.5355594  11.025626
##   0.8     0.9       14.89365  0.5337658  11.357953
##   0.8     1.0       15.16239  0.5320450  11.620427
##   0.9     0.0       15.49399        NaN  12.326628
##   0.9     0.1       11.59997  0.4848836   7.811398
##   0.9     0.2       12.64637  0.4907037   7.833536
##   0.9     0.3       13.35579  0.5057084   9.017988
##   0.9     0.4       13.57607  0.5241330   9.558738
##   0.9     0.5       13.98369  0.5284859  10.111824
##   0.9     0.6       14.47397  0.5296863  10.705896
##   0.9     0.7       14.89603  0.5331521  11.209826
##   0.9     0.8       15.26056  0.5343882  11.612153
##   0.9     0.9       15.59870  0.5331777  11.960549
##   0.9     1.0       15.87961  0.5321711  12.245829
##   1.0     0.0       15.49399        NaN  12.326628
##   1.0     0.1       11.61831  0.4839836   7.782999
##   1.0     0.2       12.86899  0.4896013   7.923387
##   1.0     0.3       13.75891  0.5044200   9.309156
##   1.0     0.4       14.10368  0.5221169   9.976050
##   1.0     0.5       14.56369  0.5272186  10.613105
##   1.0     0.6       15.10849  0.5281940  11.240835
##   1.0     0.7       15.57667  0.5314206  11.773773
##   1.0     0.8       15.97233  0.5332450  12.223449
##   1.0     0.9       16.33181  0.5325321  12.599066
##   1.0     1.0       16.62565  0.5321188  12.905411
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.3 and lambda = 0.1.

The optimal value for the RMSE is 11.58394 and the R^2 is 0.4903471.

plot(enet_model)

enet_model_pred <- predict(enet_model, newdata=finger_test_x)
postResample(pred=enet_model_pred, obs=finger_test_y)

##       RMSE   Rsquared        MAE 
## 11.4378605  0.4871299  8.0276679

the predicted value for the RMSE is 11.4378605 and the R^2 is 0.4871299

Lasso Model

set.seed(1)
lasso_model <- train(x=finger_train_x,
                  y=finger_train_y,
                  method='lasso',
                  metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
lasso_model

## The lasso 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 113, 113, 112, 112, 113, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      15.49399        NaN  12.326628
##   0.05      12.40236  0.4940001   9.355177
##   0.10      11.57212  0.5079003   8.180273
##   0.15      11.32927  0.5193654   7.991047
##   0.20      11.26345  0.5209785   7.928649
##   0.25      11.25275  0.5244972   7.950370
##   0.30      11.44024  0.5121047   8.069737
##   0.35      11.49146  0.5121655   8.131440
##   0.40      11.53101  0.5088036   8.158015
##   0.45      11.60969  0.4991166   8.138080
##   0.50      11.73077  0.4889010   8.236416
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.25.

The optimal RMSE is 11.25275 and the R^Squared is 0.5244972

plot(lasso_model)

lasso_model_pred <- predict(lasso_model, newdata=finger_test_x)
postResample(pred=lasso_model_pred, obs=finger_test_y)

##       RMSE   Rsquared        MAE 
## 11.6861943  0.4507955  8.2491897

The predicted value of RMSE is 11.6861943 and R^2 is 0.4507955

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

plot_missing(ChemicalManufacturingProcess)

cmpp <- preProcess(ChemicalManufacturingProcess[,-c(1)], method = c('bagImpute'))
cmp_df <- predict(cmpp, ChemicalManufacturingProcess[,-c(1)])

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

set.seed(1)
traincmp <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
cmp_train_x <- cmp_df[traincmp, ]
cmp_train_y <- ChemicalManufacturingProcess$Yield[traincmp]
cmp_test_x <- cmp_df[-traincmp, ]
cmp_test_y <- ChemicalManufacturingProcess$Yield[-traincmp]

set.seed(1)
enet_model_cmp <- train(x=cmp_train_x,
                y=cmp_train_y,
               method = "enet",
                tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
               trControl = trainControl(method='cv') ,
               preProc = c("center", "scale")
              )

enet_model_cmp

## Elasticnet 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 129, 130, 131, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.0     0.0       1.853746        NaN  1.5210788
##   0.0     0.1       1.108802  0.6410481  0.9022989
##   0.0     0.2       1.303297  0.5631134  0.9992883
##   0.0     0.3       1.583838  0.5489283  1.1146112
##   0.0     0.4       2.287769  0.4875303  1.3692967
##   0.0     0.5       1.684536  0.5001356  1.1973993
##   0.0     0.6       2.059977  0.4420677  1.3481055
##   0.0     0.7       2.161475  0.4321423  1.3840513
##   0.0     0.8       2.443081  0.4195810  1.4886711
##   0.0     0.9       3.067832  0.4030265  1.7024398
##   0.0     1.0       3.824290  0.3952336  1.9395205
##   0.1     0.0       1.853746        NaN  1.5210788
##   0.1     0.1       1.490538  0.5661304  1.2060613
##   0.1     0.2       1.240842  0.6017858  1.0172436
##   0.1     0.3       1.169557  0.6048527  0.9613165
##   0.1     0.4       1.147134  0.6179441  0.9440404
##   0.1     0.5       1.155216  0.6207792  0.9457192
##   0.1     0.6       1.303150  0.6058186  0.9878986
##   0.1     0.7       1.491811  0.5894862  1.0699288
##   0.1     0.8       1.727467  0.5782577  1.1732226
##   0.1     0.9       1.984839  0.5685932  1.2693750
##   0.1     1.0       2.177773  0.5610600  1.3383789
##   0.2     0.0       1.853746        NaN  1.5210788
##   0.2     0.1       1.526443  0.5586829  1.2338041
##   0.2     0.2       1.280408  0.6017426  1.0457293
##   0.2     0.3       1.184354  0.6017799  0.9695032
##   0.2     0.4       1.157860  0.6115876  0.9508594
##   0.2     0.5       1.166888  0.6149914  0.9612497
##   0.2     0.6       1.239303  0.6060876  0.9860136
##   0.2     0.7       1.389546  0.5904273  1.0425643
##   0.2     0.8       1.564758  0.5789470  1.1249489
##   0.2     0.9       1.723698  0.5727972  1.1956186
##   0.2     1.0       1.905257  0.5663482  1.2656174
##   0.3     0.0       1.853746        NaN  1.5210788
##   0.3     0.1       1.535986  0.5583653  1.2411778
##   0.3     0.2       1.294169  0.6011788  1.0556333
##   0.3     0.3       1.187222  0.6020190  0.9698249
##   0.3     0.4       1.169856  0.6047459  0.9593144
##   0.3     0.5       1.179968  0.6096699  0.9749501
##   0.3     0.6       1.247948  0.5999842  1.0040788
##   0.3     0.7       1.362622  0.5895562  1.0492750
##   0.3     0.8       1.557902  0.5708063  1.1359355
##   0.3     0.9       1.634741  0.5688482  1.1824395
##   0.3     1.0       1.777734  0.5646716  1.2404784
##   0.4     0.0       1.853746        NaN  1.5210788
##   0.4     0.1       1.536867  0.5600898  1.2414998
##   0.4     0.2       1.297976  0.6004476  1.0583370
##   0.4     0.3       1.187945  0.6018818  0.9679570
##   0.4     0.4       1.177398  0.6014972  0.9648777
##   0.4     0.5       1.196388  0.6042344  0.9881038
##   0.4     0.6       1.271206  0.5947145  1.0276171
##   0.4     0.7       1.377311  0.5859141  1.0714683
##   0.4     0.8       1.560743  0.5669782  1.1520750
##   0.4     0.9       1.643326  0.5625403  1.1987117
##   0.4     1.0       1.713231  0.5614377  1.2357967
##   0.5     0.0       1.853746        NaN  1.5210788
##   0.5     0.1       1.534881  0.5617004  1.2394971
##   0.5     0.2       1.297847  0.5995894  1.0585426
##   0.5     0.3       1.189272  0.6002391  0.9669963
##   0.5     0.4       1.181599  0.6006974  0.9682521
##   0.5     0.5       1.208742  0.6014286  0.9981950
##   0.5     0.6       1.279406  0.5940028  1.0427398
##   0.5     0.7       1.383606  0.5830325  1.0905340
##   0.5     0.8       1.551068  0.5648307  1.1652597
##   0.5     0.9       1.648545  0.5575129  1.2163921
##   0.5     1.0       1.683826  0.5581534  1.2430704
##   0.6     0.0       1.853746        NaN  1.5210788
##   0.6     0.1       1.531742  0.5630129  1.2364334
##   0.6     0.2       1.296388  0.5983830  1.0575548
##   0.6     0.3       1.190594  0.5983148  0.9670304
##   0.6     0.4       1.188874  0.5983546  0.9747279
##   0.6     0.5       1.220291  0.6003454  1.0063323
##   0.6     0.6       1.283729  0.5962420  1.0527588
##   0.6     0.7       1.398420  0.5807566  1.1101491
##   0.6     0.8       1.553107  0.5631643  1.1843810
##   0.6     0.9       1.663155  0.5536033  1.2381119
##   0.6     1.0       1.677112  0.5553603  1.2555313
##   0.7     0.0       1.853746        NaN  1.5210788
##   0.7     0.1       1.527999  0.5641473  1.2328263
##   0.7     0.2       1.294280  0.5968792  1.0559591
##   0.7     0.3       1.191627  0.5967194  0.9665409
##   0.7     0.4       1.196133  0.5963030  0.9815643
##   0.7     0.5       1.233279  0.5994186  1.0135455
##   0.7     0.6       1.292743  0.5981875  1.0614967
##   0.7     0.7       1.416831  0.5794876  1.1310712
##   0.7     0.8       1.564809  0.5619064  1.2058131
##   0.7     0.9       1.665224  0.5521636  1.2563436
##   0.7     1.0       1.686715  0.5532071  1.2738327
##   0.8     0.0       1.853746        NaN  1.5210788
##   0.8     0.1       1.523914  0.5651704  1.2289568
##   0.8     0.2       1.291588  0.5954534  1.0537625
##   0.8     0.3       1.192773  0.5952014  0.9662337
##   0.8     0.4       1.204255  0.5940977  0.9878180
##   0.8     0.5       1.247732  0.5980912  1.0209501
##   0.8     0.6       1.305575  0.5993286  1.0710943
##   0.8     0.7       1.438217  0.5790400  1.1534081
##   0.8     0.8       1.584188  0.5609944  1.2292977
##   0.8     0.9       1.668546  0.5526401  1.2748146
##   0.8     1.0       1.709048  0.5516303  1.2958008
##   0.9     0.0       1.853746        NaN  1.5210788
##   0.9     0.1       1.519604  0.5661112  1.2249625
##   0.9     0.2       1.288112  0.5945624  1.0507470
##   0.9     0.3       1.194273  0.5934335  0.9671657
##   0.9     0.4       1.212204  0.5923461  0.9927288
##   0.9     0.5       1.263402  0.5965827  1.0290884
##   0.9     0.6       1.323371  0.5994523  1.0827423
##   0.9     0.7       1.461939  0.5794052  1.1736862
##   0.9     0.8       1.599335  0.5613846  1.2504815
##   0.9     0.9       1.683221  0.5534399  1.2961717
##   0.9     1.0       1.741940  0.5504190  1.3217183
##   1.0     0.0       1.853746        NaN  1.5210788
##   1.0     0.1       1.515259  0.5669473  1.2209890
##   1.0     0.2       1.284603  0.5935182  1.0479365
##   1.0     0.3       1.195889  0.5915600  0.9691750
##   1.0     0.4       1.220144  0.5907195  0.9966857
##   1.0     0.5       1.279544  0.5950658  1.0376527
##   1.0     0.6       1.344316  0.5987442  1.0970822
##   1.0     0.7       1.486172  0.5802634  1.1940411
##   1.0     0.8       1.619926  0.5623211  1.2730445
##   1.0     0.9       1.707270  0.5544432  1.3184057
##   1.0     1.0       1.783928  0.5492656  1.3502329
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1 and lambda = 0.

plot(enet_model_cmp)

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

enet_model_cmp_pred <- predict(enet_model_cmp, newdata=cmp_test_x)
postResample(pred=enet_model_cmp_pred, obs=cmp_test_y)

##      RMSE  Rsquared       MAE 
## 1.0654033 0.5992974 0.8569888

R^2 is .6019381

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

You can see the elastic net zero’s using the enet package.

(coeffs_enet <- predict.enet(enet_model_cmp$finalModel, s=enet_model_cmp$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients)

##   BiologicalMaterial01   BiologicalMaterial02   BiologicalMaterial03 
##             0.00000000             0.00000000             0.00000000 
##   BiologicalMaterial04   BiologicalMaterial05   BiologicalMaterial06 
##             0.00000000             0.00000000             0.08009576 
##   BiologicalMaterial07   BiologicalMaterial08   BiologicalMaterial09 
##             0.00000000             0.00000000             0.00000000 
##   BiologicalMaterial10   BiologicalMaterial11   BiologicalMaterial12 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 
##             0.02803795             0.00000000             0.04459625 
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09 
##            -0.03900458             0.00000000             0.45465024 
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15 
##            -0.20893475             0.00000000             0.05529367 
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18 
##             0.00000000            -0.22093188             0.00000000 
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33 
##             0.00000000             0.80671216             0.00000000 
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 
##             0.12023039             0.00000000            -0.18851032 
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39 
##            -0.11091984             0.00000000             0.10247064 
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42 
##             0.00000000             0.00000000             0.01716628 
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45 
##             0.00000000             0.00000000             0.05066704

coeffs.sorted <- abs(coeffs_enet)
coeffs.sorted <- coeffs.sorted[coeffs.sorted>0]
(coeffs.sorted <- sort(coeffs.sorted, decreasing = T))

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess17 
##             0.80671216             0.45465024             0.22093188 
## ManufacturingProcess13 ManufacturingProcess36 ManufacturingProcess34 
##             0.20893475             0.18851032             0.12023039 
## ManufacturingProcess37 ManufacturingProcess39   BiologicalMaterial06 
##             0.11091984             0.10247064             0.08009576 
## ManufacturingProcess15 ManufacturingProcess45 ManufacturingProcess06 
##             0.05529367             0.05066704             0.04459625 
## ManufacturingProcess07 ManufacturingProcess04 ManufacturingProcess42 
##             0.03900458             0.02803795             0.01716628

varImp(enet_model_cmp)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   82.21
## ManufacturingProcess36   79.21
## BiologicalMaterial06     75.61
## BiologicalMaterial03     71.87
## ManufacturingProcess17   70.62
## BiologicalMaterial12     66.86
## ManufacturingProcess09   62.20
## ManufacturingProcess06   55.36
## BiologicalMaterial02     53.61
## ManufacturingProcess31   46.58
## ManufacturingProcess33   45.64
## BiologicalMaterial11     42.39
## BiologicalMaterial04     39.70
## ManufacturingProcess29   37.04
## ManufacturingProcess11   37.02
## ManufacturingProcess12   35.87
## BiologicalMaterial08     31.86
## BiologicalMaterial09     30.98
## BiologicalMaterial01     29.67

plot(varImp(enet_model_cmp))

ManufacturingProcess is the most important of the predicators and ManufacturingProcess dominates the BiologicalMaterial

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

coeffsmanproc <- coeffs.sorted[grep('ManufacturingProcess', names(coeffs.sorted))] %>%
             names() %>% coeffs_enet[.]
coeffsmanproc[coeffsmanproc>0]

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess34 
##             0.80671216             0.45465024             0.12023039 
## ManufacturingProcess39 ManufacturingProcess15 ManufacturingProcess45 
##             0.10247064             0.05529367             0.05066704 
## ManufacturingProcess06 ManufacturingProcess04 ManufacturingProcess42 
##             0.04459625             0.02803795             0.01716628

Knowing this information would be helpful in future runs because one would know how to alter the information in order to get the predicators that they want.

Your Document Title

Document Author

2021-04-19

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

(d) Predict the response for the test set. What is the test set estimate of R2?

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

LM

Ridge Model

Enet Model

Lasso Model

(a) Start R and use these commands to load the data:

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

(d) Predict the response for the test set. What is the test set estimate of R2?

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

LM

Ridge Model

Enet Model

Lasso Model

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

(a) Start R and use these commands to load the data:

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?