library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.3.5     v purrr   0.3.2
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## -- Conflicts ----------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret)
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 3.6.3
library(RANN)
## Warning: package 'RANN' was built under R version 3.6.3

Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 3.6.3
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

dim(fingerprints)
## [1]  165 1107
head(permeability)
##   permeability
## 1       12.520
## 2        1.120
## 3       19.405
## 4        1.730
## 5        1.680
## 6        0.510

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

filter <- nearZeroVar(fingerprints)
high_freq_pred <- fingerprints[, -filter]
dim(high_freq_pred)
## [1] 165 388

After applying the nerZeroVar function and filtering out the low frequency predictors, we are left with 388 out of the original 1,107.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

We will split the data into 80% train and 20% test.

set.seed(123)
rows <- createDataPartition(permeability, p=0.8, list=FALSE) 
Train_X <- high_freq_pred[rows,]
Train_Y <- permeability[rows,]
Test_X <- high_freq_pred[-rows,]
Test_Y <- permeability[-rows,]
set.seed(100)
plsTune <- train(Train_X, Train_Y,
method = "pls",
tuneLength = 20,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale"))
plsTune
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.42163  0.3424330  10.281577
##    2     11.39344  0.5301068   8.356199
##    3     11.34480  0.5424005   8.785836
##    4     11.40675  0.5425886   8.900816
##    5     11.22351  0.5621411   8.744281
##    6     11.21466  0.5572367   8.570960
##    7     11.10536  0.5572047   8.626906
##    8     11.18684  0.5480061   8.821416
##    9     11.42932  0.5372119   8.880582
##   10     11.70320  0.5319391   8.955978
##   11     12.12762  0.5172641   9.280675
##   12     12.26055  0.5090364   9.395535
##   13     12.36214  0.5025415   9.436744
##   14     12.49014  0.4946650   9.502231
##   15     12.63210  0.4859888   9.481292
##   16     12.66467  0.4795197   9.577942
##   17     13.02992  0.4579860   9.747579
##   18     13.36040  0.4399996  10.087336
##   19     13.55156  0.4345577  10.154426
##   20     13.82935  0.4147223  10.376443
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.

The optimal number of latent variables is 7 with an R2 of 0.55.

plot(plsTune, main="PLS Model")

It is also observed on the graph above that the lowest RMSE is achieved with 7 variables.

(d) Predict the response for the test set. What is the test set estimate of R2?

plspredict <- predict(plsTune, Test_X)

postResample(pred=plspredict, obs = Test_Y)
##       RMSE   Rsquared        MAE 
## 11.9744490  0.3618213  8.3021291

The predictions on the test set yield an R2 of 0.36, which is lower than the training set R2.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

We will try building a ridge regression and elastic net model, which use penalization to reduce RMSE.

ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))

enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
set.seed(100)
ridgeRegFit <- train(Train_X, Train_Y,
method = "ridge",
## Fit the model over many penalty values
tuneGrid = ridgeGrid,
trControl = trainControl(method = "cv", number = 10),
## put the predictors on the same scale
preProc = c("center", "scale"))
ridgeRegFit
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE       
##   0.000000000   16.65470  0.3246295   11.980223
##   0.007142857  495.59047  0.2665162  301.038042
##   0.014285714   15.35646  0.3824577   11.036398
##   0.021428571   14.34723  0.4126134   10.369434
##   0.028571429   13.62339  0.4391678    9.969653
##   0.035714286   13.21904  0.4571164    9.708772
##   0.042857143   12.97686  0.4683140    9.534774
##   0.050000000   12.76421  0.4788721    9.397963
##   0.057142857   12.61672  0.4868497    9.281034
##   0.064285714   12.48945  0.4941136    9.188233
##   0.071428571   12.39857  0.4997212    9.123385
##   0.078571429   12.29491  0.5066144    9.037590
##   0.085714286   29.82917  0.5099260   20.002219
##   0.092857143   12.15823  0.5158238    8.962629
##   0.100000000   12.10331  0.5201908    8.944217
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

The ridge regression used an optimal penalty of lambda 0.1 which yielded a RMSE of 12.10.

set.seed(100)
enetTune <- train(Train_X, Train_Y,
method = "enet",
tuneGrid = enetGrid,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale"))
enetTune
## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE       
##   0.00    0.05       12.64612  0.4536458    9.556511
##   0.00    0.10       12.13241  0.4865922    9.005788
##   0.00    0.15       11.95181  0.4928998    9.010095
##   0.00    0.20       11.91184  0.4963333    8.989394
##   0.00    0.25       11.99180  0.4897273    8.992464
##   0.00    0.30       12.11661  0.4832279    9.047403
##   0.00    0.35       12.14583  0.4838835    9.169429
##   0.00    0.40       12.20325  0.4810649    9.236379
##   0.00    0.45       12.40468  0.4705176    9.321607
##   0.00    0.50       12.69342  0.4564725    9.462913
##   0.00    0.55       13.01673  0.4417871    9.621293
##   0.00    0.60       13.40272  0.4235588    9.906992
##   0.00    0.65       13.81848  0.4046498   10.241059
##   0.00    0.70       14.23234  0.3885519   10.585033
##   0.00    0.75       14.64289  0.3745751   10.854173
##   0.00    0.80       14.99464  0.3641484   11.038646
##   0.00    0.85       15.41385  0.3537012   11.256812
##   0.00    0.90       15.82419  0.3431272   11.516630
##   0.00    0.95       16.24695  0.3327372   11.765172
##   0.00    1.00       16.65470  0.3246295   11.980223
##   0.01    0.05       21.52719  0.4318614   14.362612
##   0.01    0.10       32.66582  0.4821915   20.404042
##   0.01    0.15       45.49499  0.4835579   27.316633
##   0.01    0.20       58.45234  0.4828223   34.022937
##   0.01    0.25       71.30469  0.4870870   40.698275
##   0.01    0.30       84.46400  0.4727429   47.499952
##   0.01    0.35       97.79259  0.4480955   54.406874
##   0.01    0.40      111.04406  0.4292212   61.234496
##   0.01    0.45      124.24639  0.4134935   68.048571
##   0.01    0.50      137.48703  0.4002412   74.504475
##   0.01    0.55      150.84513  0.3839576   80.815224
##   0.01    0.60      164.21263  0.3692603   87.118405
##   0.01    0.65      177.59402  0.3563203   93.420407
##   0.01    0.70      190.99314  0.3446008   99.704695
##   0.01    0.75      204.38414  0.3341151  105.961178
##   0.01    0.80      217.72725  0.3258306  112.190910
##   0.01    0.85      231.01640  0.3202982  118.397017
##   0.01    0.90      244.31078  0.3150213  124.583862
##   0.01    0.95      257.61766  0.3095325  130.791371
##   0.01    1.00      270.93781  0.3041826  137.010579
##   0.10    0.05       12.52168  0.4911655    9.600791
##   0.10    0.10       11.69767  0.5082528    8.570414
##   0.10    0.15       11.25855  0.5446966    8.131415
##   0.10    0.20       11.13511  0.5608032    8.145854
##   0.10    0.25       11.12998  0.5653095    8.287737
##   0.10    0.30       11.14717  0.5681918    8.355882
##   0.10    0.35       11.11414  0.5724906    8.389723
##   0.10    0.40       11.16058  0.5728561    8.471979
##   0.10    0.45       11.29909  0.5669204    8.603777
##   0.10    0.50       11.41533  0.5592956    8.701229
##   0.10    0.55       11.52421  0.5517792    8.764815
##   0.10    0.60       11.64067  0.5445765    8.809319
##   0.10    0.65       11.72329  0.5396557    8.831925
##   0.10    0.70       11.78851  0.5361196    8.843319
##   0.10    0.75       11.86101  0.5320383    8.853427
##   0.10    0.80       11.91842  0.5290675    8.869200
##   0.10    0.85       11.96955  0.5265827    8.886626
##   0.10    0.90       12.01848  0.5240601    8.913869
##   0.10    0.95       12.06135  0.5221562    8.928521
##   0.10    1.00       12.10331  0.5201908    8.944217
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.35 and lambda = 0.1.

The elastic net used an optimal penalty of lambda 0.1 and fraction 0.35, which yielded a RMSE of 11.11 and R2 of 0.57.

It seems that the elastic net model has superior predictive performance than any of the previous models discussed in this exercise.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

enetpredict <- predict(enetTune, Test_X)

postResample(pred=enetpredict, obs = Test_Y)
##      RMSE  Rsquared       MAE 
## 11.565840  0.403555  7.592838

With an R2 of about 0.4 for the predictions on our test set from our best model, I don’t feel confident that we could replace the laboratory experiments with any of these models.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

dim(ChemicalManufacturingProcess)
## [1] 176  58

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

On the missing values plot below, we can observe that roughly 28 variables have missing values that range from 0.57% to 8.52% of values missing.

plot_missing(ChemicalManufacturingProcess)

I will use the preProcess function from section 3.8 to fill in the missing values in the data.

preProcValues <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
data_imp <- predict(preProcValues, ChemicalManufacturingProcess)

As we can see below, we were able to successfully impute the missing values.

plot_missing(data_imp)

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Based on the previous exercise and seeing that the elastic net model had better predictive performance, I will use this model to train the data.

We will split the data into 80% train and 20% test.

set.seed(123)
index <- createDataPartition(data_imp$Yield, p=0.8, list=FALSE) 
Train <- data_imp[index, ]
Test <- data_imp[-index, ]
set.seed(100)
enetTune2 <- train(x = Train[, 2:58], y = Train$Yield,
method = "enet",
tuneGrid = enetGrid,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale"))
enetTune2
## Elasticnet 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.00    0.05      0.6320445  0.6289027  0.5192234
##   0.00    0.10      0.6312174  0.6246148  0.4996591
##   0.00    0.15      0.6707398  0.6092577  0.5107690
##   0.00    0.20      0.8867853  0.5356139  0.5984773
##   0.00    0.25      1.0539429  0.5271847  0.6495962
##   0.00    0.30      1.1354445  0.4920009  0.6814887
##   0.00    0.35      1.1304057  0.5006036  0.6797520
##   0.00    0.40      1.1827263  0.4975483  0.6881896
##   0.00    0.45      1.2396502  0.4883940  0.7054864
##   0.00    0.50      1.2967375  0.4799601  0.7234204
##   0.00    0.55      1.5040716  0.4676975  0.7805039
##   0.00    0.60      1.8012295  0.4601443  0.8613045
##   0.00    0.65      2.0731065  0.4553634  0.9351509
##   0.00    0.70      2.4061716  0.4506399  1.0288289
##   0.00    0.75      2.7035810  0.4369008  1.1198793
##   0.00    0.80      3.0230849  0.4125565  1.2144071
##   0.00    0.85      3.3642178  0.3940718  1.3111083
##   0.00    0.90      3.6064118  0.3777339  1.3805766
##   0.00    0.95      3.6457075  0.3617068  1.3955417
##   0.00    1.00      3.6766196  0.3485683  1.4081152
##   0.01    0.05      0.8152492  0.5668917  0.6658257
##   0.01    0.10      0.6974381  0.5842443  0.5752620
##   0.01    0.15      0.6596535  0.5834005  0.5386233
##   0.01    0.20      0.6303441  0.6161989  0.5108115
##   0.01    0.25      0.6295030  0.6278037  0.5016113
##   0.01    0.30      0.6494164  0.6231636  0.5025318
##   0.01    0.35      0.6236470  0.6363887  0.4918321
##   0.01    0.40      0.6958367  0.5876740  0.5198814
##   0.01    0.45      0.8340710  0.5529686  0.5673578
##   0.01    0.50      0.9373883  0.5376276  0.6036635
##   0.01    0.55      1.0668299  0.5200565  0.6462376
##   0.01    0.60      1.1674330  0.5025675  0.6857252
##   0.01    0.65      1.3020902  0.4763807  0.7303421
##   0.01    0.70      1.4107179  0.4575943  0.7653105
##   0.01    0.75      1.5592194  0.4446220  0.8098116
##   0.01    0.80      1.7243504  0.4359284  0.8578185
##   0.01    0.85      1.8689982  0.4282880  0.8998960
##   0.01    0.90      2.0021643  0.4219321  0.9379184
##   0.01    0.95      2.1341409  0.4166028  0.9750575
##   0.01    1.00      2.2651000  0.4114656  1.0120085
##   0.10    0.05      0.8868079  0.5113900  0.7241041
##   0.10    0.10      0.7998069  0.5640424  0.6537250
##   0.10    0.15      0.7311998  0.5784946  0.5989290
##   0.10    0.20      0.6809861  0.5805522  0.5665708
##   0.10    0.25      0.6718308  0.5719534  0.5494798
##   0.10    0.30      0.6672095  0.5841149  0.5319346
##   0.10    0.35      0.6494949  0.6023835  0.5194206
##   0.10    0.40      0.6383150  0.6156173  0.5110375
##   0.10    0.45      0.6541795  0.6174460  0.5077503
##   0.10    0.50      0.6898163  0.6146906  0.5126136
##   0.10    0.55      0.7163428  0.6121797  0.5181153
##   0.10    0.60      0.7347934  0.6094709  0.5209760
##   0.10    0.65      0.7763528  0.6035221  0.5343849
##   0.10    0.70      0.8628456  0.5715720  0.5679158
##   0.10    0.75      0.9554315  0.5506102  0.5988784
##   0.10    0.80      1.0134619  0.5453907  0.6210562
##   0.10    0.85      1.0684958  0.5341790  0.6445722
##   0.10    0.90      1.1438742  0.5200650  0.6705506
##   0.10    0.95      1.2318152  0.5071714  0.6986565
##   0.10    1.00      1.3145618  0.4975579  0.7249490
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.35 and lambda = 0.01.
plot(enetTune2)

The optimal value for the performance metric is a penalty of lambda 0.01 and fraction 0.35, which yields a RMSE of 0.62 and R2 of 0.63.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

enetpredict2 <- predict(enetTune2, Test[, 2:58])

postResample(pred=enetpredict2, obs = Test$Yield)
##      RMSE  Rsquared       MAE 
## 0.6894332 0.5369033 0.5970996

The predictions on the test set yield a RMSE of 0.689, which is better than the one on the training set and R2 of 0.536, which is slightly lower than the one from the training set. Overall it seems that the model has performed very well on the test set.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

In this model, the most important predictor at the top of the list is “ManufacturingProcess32”. However, there is not a predictor type that dominates the list, we have both biological and process predictors similarly as important in the list.

important <- varImp(enetTune2)
important
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     94.06
## BiologicalMaterial03     81.27
## ManufacturingProcess13   80.63
## ManufacturingProcess36   79.17
## ManufacturingProcess31   76.84
## BiologicalMaterial02     76.04
## ManufacturingProcess17   75.92
## ManufacturingProcess09   73.04
## BiologicalMaterial12     69.48
## ManufacturingProcess06   66.28
## BiologicalMaterial11     59.72
## ManufacturingProcess33   58.60
## ManufacturingProcess29   54.77
## BiologicalMaterial04     53.93
## ManufacturingProcess11   49.55
## BiologicalMaterial01     45.62
## BiologicalMaterial08     44.93
## BiologicalMaterial09     40.88
## ManufacturingProcess30   40.31

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

corr_data1 <- data_imp %>%
  select(Yield, ManufacturingProcess32, BiologicalMaterial06, BiologicalMaterial03, ManufacturingProcess13, ManufacturingProcess36, ManufacturingProcess31, BiologicalMaterial02, ManufacturingProcess17, ManufacturingProcess09, BiologicalMaterial12)

corr_data2 <- data_imp %>%
  select(Yield, ManufacturingProcess06, BiologicalMaterial11, ManufacturingProcess33, ManufacturingProcess29, BiologicalMaterial04, ManufacturingProcess11, BiologicalMaterial01, BiologicalMaterial08, BiologicalMaterial09, ManufacturingProcess30)

plot_correlation(corr_data1)

plot_correlation(corr_data2)

As we can see on the correlation plots above, some of the most important predictor variables have strong positive or negative relationships with the response variable. However, there are a few variables that do not seem to have strong correlations with the response and yet were identified as important for the model. Knowing which variables are positively or negatively correlated to the yield can help improve it by making the necessary adjustments in the manufacturing process to increase yield.