Homework 7

library(tidyverse)
library(dbplyr)
library(caret)
# library(grid)
# library(gridExtra)
library(DMwR)
library(pls)
library(elasticnet)

Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
library(knitr)
data(permeability)
summary(permeability)

##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

#using library(caret)
near.zero.var <- nearZeroVar(fingerprints)
near.zero.var

##   [1]    7    8    9   10   13   14   17   18   19   22   23   24   30   31   32
##  [16]   33   34   45   77   81   82   83   84   85   89   90   91   92   95  100
##  [31]  104  105  106  107  109  110  112  113  114  115  116  117  119  120  122
##  [46]  123  124  128  131  132  134  135  136  137  139  140  144  145  147  148
##  [61]  149  151  155  160  161  164  165  166  216  217  218  219  220  222  243
##  [76]  252  259  273  275  277  282  283  287  288  289  292  346  347  348  349
##  [91]  350  351  352  353  354  363  364  365  369  375  379  384  391  393  397
## [106]  399  402  404  405  407  408  409  410  411  412  413  414  415  416  417
## [121]  418  419  420  421  422  423  424  425  426  427  428  429  430  431  432
## [136]  433  434  435  436  437  438  439  440  441  442  443  444  445  446  447
## [151]  448  449  450  451  452  453  454  455  456  457  458  459  460  461  462
## [166]  463  464  465  466  467  468  469  470  471  472  473  474  475  476  477
## [181]  478  479  480  481  482  483  484  485  486  487  488  489  490  491  492
## [196]  493  494  495  498  500  501  502  513  523  525  526  527  528  530  531
## [211]  532  533  534  535  536  537  538  539  540  541  542  543  544  545  546
## [226]  547  548  550  552  555  562  563  564  566  567  569  570  572  575  578
## [241]  579  580  581  582  583  584  585  586  587  588  589  596  605  606  607
## [256]  608  609  610  611  612  614  615  616  617  618  619  620  622  623  624
## [271]  625  626  627  628  629  630  631  632  633  634  635  636  637  638  639
## [286]  640  641  642  643  644  645  646  647  648  649  650  651  652  653  654
## [301]  655  656  657  658  659  660  661  662  663  664  665  666  667  668  669
## [316]  670  671  672  673  674  675  676  677  678  680  681  682  683  684  685
## [331]  686  687  688  689  690  691  692  693  694  695  696  697  706  707  708
## [346]  709  710  711  712  713  714  715  716  717  718  720  721  722  723  724
## [361]  725  726  727  728  729  730  731  734  735  736  737  738  739  740  741
## [376]  742  743  744  745  746  747  748  749  756  757  758  759  760  761  762
## [391]  763  764  765  766  767  768  769  770  771  772  777  778  779  781  783
## [406]  784  785  786  787  788  789  790  791  794  796  797  799  802  803  804
## [421]  807  808  809  810  811  814  815  816  817  818  819  820  821  822  823
## [436]  824  825  826  827  828  829  830  831  832  833  834  835  836  837  838
## [451]  839  840  841  842  843  844  845  846  847  848  849  850  851  852  853
## [466]  854  855  856  857  858  859  860  861  862  863  864  865  866  867  868
## [481]  869  870  871  872  873  874  875  876  877  878  879  880  881  882  883
## [496]  884  885  886  887  888  889  890  891  892  893  894  895  896  897  898
## [511]  899  900  901  902  903  904  905  906  907  908  909  910  911  912  913
## [526]  914  915  916  917  918  919  920  921  922  923  924  925  926  927  928
## [541]  929  930  931  932  933  934  935  936  937  938  939  940  941  942  943
## [556]  944  945  946  947  948  949  950  951  952  953  954  955  956  957  958
## [571]  959  960  961  962  963  964  965  966  967  968  969  970  971  972  973
## [586]  974  975  976  977  978  979  980  981  982  983  984  985  986  987  988
## [601]  989  990  991  992  993  994  995  996  997  998  999 1000 1001 1002 1003
## [616] 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018
## [631] 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033
## [646] 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048
## [661] 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063
## [676] 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078
## [691] 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093
## [706] 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107

length(near.zero.var)

## [1] 719

fingerprints.nz <- fingerprints[,-near.zero.var]
dim(fingerprints.nz)

## [1] 165 388

str(fingerprints.nz)

##  num [1:165, 1:388] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:388] "X1" "X2" "X3" "X4" ...

There are 388 variables left out of 719 variables that have zero or near zero variance predictors.

So there are 388 predictors left for modeling. 719 columns were removed.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

highCorr <- findCorrelation(cor(fingerprints.nz), 0.90)
fingerprints.nz.2 <- fingerprints.nz[, -highCorr]
dim(fingerprints.nz.2)

## [1] 165 110

set.seed(777)
split_index <- createDataPartition(permeability, p = 0.7, list = FALSE)
X_train <- fingerprints.nz.2[split_index, ]
y_train <- permeability[split_index, ]
X_test <- fingerprints.nz.2[-split_index, ]
y_test <- permeability[-split_index, ]

Using PLS Model

#using library(pls)
plsFit <-  train(x = X_train, y = y_train, method = "pls", preProc = c("center", "scale"), trControl = trainControl("cv", number = 10), tuneLength = 25)

plsFit$results %>% filter(ncomp == plsFit$bestTune$ncomp) %>% select(ncomp, RMSE, Rsquared)

##   ncomp     RMSE  Rsquared
## 1     5 12.18165 0.4601662

plot(plsFit, main = "Training Set RMSE")

(d) Predict the response for the test set. What is the test set estimate of \(R^2\) ?

Using our testing data set. we split our test set into X predictors and Y response variable

pls.prediction <- predict(plsFit, newdata = X_test)
results <- data.frame(Model = "PLS", RMSE = caret::RMSE(pls.prediction, y_test), Rsquared = caret::R2(pls.prediction, y_test))
results

##   Model    RMSE  Rsquared
## 1   PLS 10.4266 0.5764468

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Consider the following Models:

Ridge regression

ridgeFit <- train(x = X_train, y = y_train, method = 'ridge', metric = 'Rsquared', tuneGrid = data.frame(.lambda = seq(0, 1, by = 0.1)), trControl = trainControl(method = 'cv'),
                   preProcess = c('center', 'scale'))

plot(ridgeFit)

ridge.predictions <- predict(ridgeFit, newdata = X_test)
ridge.results <- data.frame(Model = "Ridge Regression", RMSE = caret::RMSE(ridge.predictions, y_test), Rsquared = caret::R2(ridge.predictions, y_test))
ridge.results

##              Model    RMSE  Rsquared
## 1 Ridge Regression 13.2473 0.6062653

Lasso Regression

lassoFit <- train(x = X_train, y = y_train, method ='lasso', metric = 'Rsquared', tuneGrid = data.frame(.fraction = seq(0, 0.5, by = 0.05)), trControl=trainControl(method = 'cv'),
                  preProcess = c('center', 'scale'))

plot(lassoFit)

lasso.predictions <- predict(lassoFit, newdata = X_test)
lasso.results <- data.frame(Model = "Lasso Regression", RMSE = caret::RMSE(lasso.predictions, y_test), Rsquared = caret::R2(lasso.predictions, y_test))
lasso.results

##              Model      RMSE   Rsquared
## 1 Lasso Regression 194506524 0.01593297

Elastic Net Regression

elasticFit <- train(x = X_train, y = y_train, method = 'enet', metric = 'Rsquared', tuneGrid = data.frame(.fraction = seq(0, 1, by = 0.1), .lambda = seq(0, 1, by = 0.1)),
                 trControl = trainControl(method = 'cv'), preProcess = c('center', 'scale'))

plot(elasticFit)

elastic.predictions <- predict(elasticFit, newdata = X_test)
elastic.results <- data.frame(Model = "Elastic Net Regression", RMSE = caret::RMSE(elastic.predictions, y_test), Rsquared = caret::R2(elastic.predictions, y_test))
elastic.results

##                    Model     RMSE  Rsquared
## 1 Elastic Net Regression 12.79535 0.6084177

I see few improvements comparing to PLS.

Let’s summarize and organize:

plsFit$results %>%
  filter(ncomp == plsFit$bestTune$ncomp) %>%
  mutate("Model" = "PLS") %>%
  select(Model, RMSE, Rsquared) %>%
  as.data.frame() %>%
  bind_rows(ridge.results) %>%
  bind_rows(lasso.results) %>%
  bind_rows(elastic.results) %>%
  arrange(desc(Rsquared))

##                    Model         RMSE   Rsquared
## 1 Elastic Net Regression 1.279535e+01 0.60841769
## 2       Ridge Regression 1.324730e+01 0.60626528
## 3                    PLS 1.218165e+01 0.46016619
## 4       Lasso Regression 1.945065e+08 0.01593297

Looks like Ridge and Elastic net models both have higher RMSE and Rsquared.Therefore have better performance.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

Because Ridge Regression and Elestic have higher Rsquared and RMSE, and they are close, so I will recommend those two to replace the permeability experiment.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

A. Start R and use these commands to load the data:

#library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

dim(ChemicalManufacturingProcess)

## [1] 176  58

summary(ChemicalManufacturingProcess)

##      Yield       BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##  Min.   :35.25   Min.   :4.580        Min.   :46.87        Min.   :56.97       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68        1st Qu.:64.98       
##  Median :39.97   Median :6.305        Median :55.09        Median :67.22       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69        Mean   :67.70       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74        3rd Qu.:70.43       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75        Max.   :78.25       
##                                                                                
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##  Min.   : 9.38        Min.   :13.24        Min.   :40.60       
##  1st Qu.:11.24        1st Qu.:17.23        1st Qu.:46.05       
##  Median :12.10        Median :18.49        Median :48.46       
##  Mean   :12.35        Mean   :18.60        Mean   :48.91       
##  3rd Qu.:13.22        3rd Qu.:19.90        3rd Qu.:51.34       
##  Max.   :23.09        Max.   :24.85        Max.   :59.38       
##                                                                
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##  Min.   :100.0        Min.   :15.88        Min.   :11.44       
##  1st Qu.:100.0        1st Qu.:17.06        1st Qu.:12.60       
##  Median :100.0        Median :17.51        Median :12.84       
##  Mean   :100.0        Mean   :17.49        Mean   :12.85       
##  3rd Qu.:100.0        3rd Qu.:17.88        3rd Qu.:13.13       
##  Max.   :100.8        Max.   :19.14        Max.   :14.08       
##                                                                
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##  Min.   :1.770        Min.   :135.8        Min.   :18.35       
##  1st Qu.:2.460        1st Qu.:143.8        1st Qu.:19.73       
##  Median :2.710        Median :146.1        Median :20.12       
##  Mean   :2.801        Mean   :147.0        Mean   :20.20       
##  3rd Qu.:2.990        3rd Qu.:149.6        3rd Qu.:20.75       
##  Max.   :6.870        Max.   :158.7        Max.   :22.21       
##                                                                
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##  Min.   : 0.00          Min.   : 0.00          Min.   :1.47          
##  1st Qu.:10.80          1st Qu.:19.30          1st Qu.:1.53          
##  Median :11.40          Median :21.00          Median :1.54          
##  Mean   :11.21          Mean   :16.68          Mean   :1.54          
##  3rd Qu.:12.15          3rd Qu.:21.50          3rd Qu.:1.55          
##  Max.   :14.10          Max.   :22.50          Max.   :1.60          
##  NA's   :1              NA's   :3              NA's   :15            
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##  Min.   :911.0          Min.   : 923.0         Min.   :203.0         
##  1st Qu.:928.0          1st Qu.: 986.8         1st Qu.:205.7         
##  Median :934.0          Median : 999.2         Median :206.8         
##  Mean   :931.9          Mean   :1001.7         Mean   :207.4         
##  3rd Qu.:936.0          3rd Qu.:1008.9         3rd Qu.:208.7         
##  Max.   :946.0          Max.   :1175.3         Max.   :227.4         
##  NA's   :1              NA's   :1              NA's   :2             
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##  Min.   :177.0          Min.   :177.0          Min.   :38.89         
##  1st Qu.:177.0          1st Qu.:177.0          1st Qu.:44.89         
##  Median :177.0          Median :178.0          Median :45.73         
##  Mean   :177.5          Mean   :177.6          Mean   :45.66         
##  3rd Qu.:178.0          3rd Qu.:178.0          3rd Qu.:46.52         
##  Max.   :178.0          Max.   :178.0          Max.   :49.36         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##  Min.   : 7.500         Min.   : 7.500         Min.   :   0.0        
##  1st Qu.: 8.700         1st Qu.: 9.000         1st Qu.:   0.0        
##  Median : 9.100         Median : 9.400         Median :   0.0        
##  Mean   : 9.179         Mean   : 9.386         Mean   : 857.8        
##  3rd Qu.: 9.550         3rd Qu.: 9.900         3rd Qu.:   0.0        
##  Max.   :11.600         Max.   :11.500         Max.   :4549.0        
##  NA's   :9              NA's   :10             NA's   :1             
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##  Min.   :32.10          Min.   :4701           Min.   :5904          
##  1st Qu.:33.90          1st Qu.:4828           1st Qu.:6010          
##  Median :34.60          Median :4856           Median :6032          
##  Mean   :34.51          Mean   :4854           Mean   :6039          
##  3rd Qu.:35.20          3rd Qu.:4882           3rd Qu.:6061          
##  Max.   :38.60          Max.   :5055           Max.   :6233          
##                         NA's   :1                                    
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##  Min.   :   0           Min.   :31.30          Min.   :   0          
##  1st Qu.:4561           1st Qu.:33.50          1st Qu.:4813          
##  Median :4588           Median :34.40          Median :4835          
##  Mean   :4566           Mean   :34.34          Mean   :4810          
##  3rd Qu.:4619           3rd Qu.:35.10          3rd Qu.:4862          
##  Max.   :4852           Max.   :40.00          Max.   :4971          
##                                                                      
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##  Min.   :5890           Min.   :   0           Min.   :-1.8000       
##  1st Qu.:6001           1st Qu.:4553           1st Qu.:-0.6000       
##  Median :6022           Median :4582           Median :-0.3000       
##  Mean   :6028           Mean   :4556           Mean   :-0.1642       
##  3rd Qu.:6050           3rd Qu.:4610           3rd Qu.: 0.0000       
##  Max.   :6146           Max.   :4759           Max.   : 3.6000       
##                                                                      
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##  Min.   : 0.000         Min.   :0.000          Min.   : 0.000        
##  1st Qu.: 3.000         1st Qu.:2.000          1st Qu.: 4.000        
##  Median : 5.000         Median :3.000          Median : 8.000        
##  Mean   : 5.406         Mean   :3.017          Mean   : 8.834        
##  3rd Qu.: 8.000         3rd Qu.:4.000          3rd Qu.:14.000        
##  Max.   :12.000         Max.   :6.000          Max.   :23.000        
##  NA's   :1              NA's   :1              NA's   :1             
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##  Min.   :   0           Min.   :   0           Min.   :   0          
##  1st Qu.:4832           1st Qu.:6020           1st Qu.:4560          
##  Median :4855           Median :6047           Median :4587          
##  Mean   :4828           Mean   :6016           Mean   :4563          
##  3rd Qu.:4877           3rd Qu.:6070           3rd Qu.:4609          
##  Max.   :4990           Max.   :6161           Max.   :4710          
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##  Min.   : 0.000         Min.   : 0.00          Min.   : 0.000        
##  1st Qu.: 0.000         1st Qu.:19.70          1st Qu.: 8.800        
##  Median :10.400         Median :19.90          Median : 9.100        
##  Mean   : 6.592         Mean   :20.01          Mean   : 9.161        
##  3rd Qu.:10.750         3rd Qu.:20.40          3rd Qu.: 9.700        
##  Max.   :11.500         Max.   :22.00          Max.   :11.200        
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##  Min.   : 0.00          Min.   :143.0          Min.   :56.00         
##  1st Qu.:70.10          1st Qu.:155.0          1st Qu.:62.00         
##  Median :70.80          Median :158.0          Median :64.00         
##  Mean   :70.18          Mean   :158.5          Mean   :63.54         
##  3rd Qu.:71.40          3rd Qu.:162.0          3rd Qu.:65.00         
##  Max.   :72.50          Max.   :173.0          Max.   :70.00         
##  NA's   :5                                     NA's   :5             
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##  Min.   :2.300          Min.   :463.0          Min.   :0.01700       
##  1st Qu.:2.500          1st Qu.:490.0          1st Qu.:0.01900       
##  Median :2.500          Median :495.0          Median :0.02000       
##  Mean   :2.494          Mean   :495.6          Mean   :0.01957       
##  3rd Qu.:2.500          3rd Qu.:501.5          3rd Qu.:0.02000       
##  Max.   :2.600          Max.   :522.0          Max.   :0.02200       
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##  Min.   :0.000          Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.700          1st Qu.:2.000          1st Qu.:7.100         
##  Median :1.000          Median :3.000          Median :7.200         
##  Mean   :1.014          Mean   :2.534          Mean   :6.851         
##  3rd Qu.:1.300          3rd Qu.:3.000          3rd Qu.:7.300         
##  Max.   :2.300          Max.   :3.000          Max.   :7.500         
##                                                                      
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##  Min.   :0.00000        Min.   :0.00000        Min.   : 0.00         
##  1st Qu.:0.00000        1st Qu.:0.00000        1st Qu.:11.40         
##  Median :0.00000        Median :0.00000        Median :11.60         
##  Mean   :0.01771        Mean   :0.02371        Mean   :11.21         
##  3rd Qu.:0.00000        3rd Qu.:0.00000        3rd Qu.:11.70         
##  Max.   :0.10000        Max.   :0.20000        Max.   :12.10         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##  Min.   : 0.0000        Min.   :0.000          Min.   :0.000         
##  1st Qu.: 0.6000        1st Qu.:1.800          1st Qu.:2.100         
##  Median : 0.8000        Median :1.900          Median :2.200         
##  Mean   : 0.9119        Mean   :1.805          Mean   :2.138         
##  3rd Qu.: 1.0250        3rd Qu.:1.900          3rd Qu.:2.300         
##  Max.   :11.0000        Max.   :2.100          Max.   :2.600         
##

B. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sec 3.8).

Finding the missing values:

is.na <- sort(colSums(is.na(ChemicalManufacturingProcess)))
is.na[is.na > 0]

## ManufacturingProcess01 ManufacturingProcess04 ManufacturingProcess05 
##                      1                      1                      1 
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess12 
##                      1                      1                      1 
## ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 
##                      1                      1                      1 
## ManufacturingProcess24 ManufacturingProcess40 ManufacturingProcess41 
##                      1                      1                      1 
## ManufacturingProcess06 ManufacturingProcess02 ManufacturingProcess25 
##                      2                      3                      5 
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 
##                      5                      5                      5 
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 
##                      5                      5                      5 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      5                      5                      5 
## ManufacturingProcess36 ManufacturingProcess10 ManufacturingProcess11 
##                      5                      9                     10 
## ManufacturingProcess03 
##                     15

Using KNN imputation function to fill the missing values

#using library(DMwR)
knn.df <- knnImputation(ChemicalManufacturingProcess[, 1:57], k = 3, meth = "weighAvg")
anyNA(knn.df)

## [1] FALSE

The above KNN imputation shows that there are no more missing values present in the dataset.

C. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

set.seed(777)

near_zero <- nearZeroVar(knn.df)
knn.df <- knn.df[,-near_zero]
library(caret)
inTraining <- createDataPartition(knn.df$Yield, p = 0.80, list=FALSE)
training <- knn.df[ inTraining,]
testing <- knn.df[-inTraining,]

X <- training[,2:(length(training))]
Y <- training$Yield

X_test <- testing[,2:(length(testing))]
Y_test <- testing$Yield

Using 10 fold cross validation method

fitControl <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 10)

Choosing PLS as our model:

model.pls <- train(X, Y, method='pls', metric='RMSE',
                   tuneLength=20, trControl = fitControl,
                   preProcess= c('center','scale'))

model.pls

## Partial Least Squares 
## 
## 144 samples
##  55 predictor
## 
## Pre-processing: centered (55), scaled (55) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 131, 131, 129, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.454508  0.4230447  1.175809
##    2     1.702412  0.4816284  1.170186
##    3     1.416921  0.5408574  1.073509
##    4     1.698803  0.5284711  1.153657
##    5     2.004785  0.4983074  1.254895
##    6     2.072503  0.4920976  1.280428
##    7     2.172860  0.4936115  1.305929
##    8     2.192098  0.5020414  1.312443
##    9     2.123742  0.5183810  1.280424
##   10     2.125281  0.5145890  1.278942
##   11     2.091210  0.4995785  1.284465
##   12     2.107278  0.4940098  1.297461
##   13     2.124265  0.4912011  1.303125
##   14     2.126824  0.4892324  1.303143
##   15     2.129470  0.4935038  1.302971
##   16     2.176529  0.4937558  1.314730
##   17     2.263813  0.4875490  1.339907
##   18     2.278907  0.4818896  1.344732
##   19     2.332034  0.4750830  1.365788
##   20     2.424879  0.4699668  1.396539
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

#plot the model
plot(model.pls)

It looks like RMSE was used to select the optimal model using the smallest value. the final value for the model was ncomp = 3, yielding an RMSE of 1.414029 and an optimal Rsquared of 0.5367331.

D. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled perfomance metric on the training set?

pls.pred <- predict(model.pls, X_test)
postResample(pls.pred, Y_test)

##      RMSE  Rsquared       MAE 
## 1.2631505 0.6179720 0.9834409

When evaluated against the test data, it looks like RMSE dropped to 1.263 from 1.414 in the PLS model where it was trained on. However the Rsquared has increased nicely. from 0.536 to 0.617.

E. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

model.pls$finalModel$coefficients

## , , 1 comps
## 
##                            .outcome
## BiologicalMaterial01    0.078530963
## BiologicalMaterial02    0.104634121
## BiologicalMaterial03    0.093245060
## BiologicalMaterial04    0.078967650
## BiologicalMaterial05    0.030481271
## BiologicalMaterial06    0.099600843
## BiologicalMaterial08    0.082325119
## BiologicalMaterial09    0.031353267
## BiologicalMaterial10    0.050221655
## BiologicalMaterial11    0.076941108
## BiologicalMaterial12    0.075616456
## ManufacturingProcess01 -0.029099303
## ManufacturingProcess02 -0.055710562
## ManufacturingProcess03 -0.011817223
## ManufacturingProcess04 -0.054566522
## ManufacturingProcess05  0.013594199
## ManufacturingProcess06  0.081175221
## ManufacturingProcess07 -0.011699158
## ManufacturingProcess08  0.003586114
## ManufacturingProcess09  0.099166693
## ManufacturingProcess10  0.038070848
## ManufacturingProcess11  0.065679799
## ManufacturingProcess12  0.070631571
## ManufacturingProcess13 -0.095961431
## ManufacturingProcess14  0.005942549
## ManufacturingProcess15  0.053627471
## ManufacturingProcess16 -0.015213406
## ManufacturingProcess17 -0.075380538
## ManufacturingProcess18 -0.011866856
## ManufacturingProcess19  0.038871157
## ManufacturingProcess20 -0.013308912
## ManufacturingProcess21  0.003018978
## ManufacturingProcess22 -0.001045175
## ManufacturingProcess23 -0.017620798
## ManufacturingProcess24 -0.052867152
## ManufacturingProcess25  0.003005819
## ManufacturingProcess26  0.008758739
## ManufacturingProcess27  0.002253490
## ManufacturingProcess28  0.065013133
## ManufacturingProcess29  0.032993017
## ManufacturingProcess30  0.042162167
## ManufacturingProcess31 -0.013131067
## ManufacturingProcess32  0.127059217
## ManufacturingProcess33  0.087046825
## ManufacturingProcess34  0.041990777
## ManufacturingProcess35 -0.034549158
## ManufacturingProcess36 -0.112875142
## ManufacturingProcess37 -0.028634487
## ManufacturingProcess38 -0.013227767
## ManufacturingProcess39  0.011605692
## ManufacturingProcess40 -0.002240412
## ManufacturingProcess41  0.002356966
## ManufacturingProcess42  0.006264572
## ManufacturingProcess43  0.034045175
## ManufacturingProcess44  0.024515530
## 
## , , 2 comps
## 
##                            .outcome
## BiologicalMaterial01    0.030688616
## BiologicalMaterial02    0.085145791
## BiologicalMaterial03    0.092815276
## BiologicalMaterial04    0.043547324
## BiologicalMaterial05    0.001395865
## BiologicalMaterial06    0.077704608
## BiologicalMaterial08    0.026762212
## BiologicalMaterial09   -0.005868124
## BiologicalMaterial10   -0.010402914
## BiologicalMaterial11    0.020981192
## BiologicalMaterial12    0.018578149
## ManufacturingProcess01 -0.015406532
## ManufacturingProcess02 -0.005185489
## ManufacturingProcess03 -0.040264483
## ManufacturingProcess04 -0.022738149
## ManufacturingProcess05 -0.037608382
## ManufacturingProcess06  0.183716444
## ManufacturingProcess07 -0.031163560
## ManufacturingProcess08  0.033537823
## ManufacturingProcess09  0.232184301
## ManufacturingProcess10  0.051625745
## ManufacturingProcess11  0.136528666
## ManufacturingProcess12  0.142999418
## ManufacturingProcess13 -0.238577370
## ManufacturingProcess14  0.009570907
## ManufacturingProcess15  0.090801378
## ManufacturingProcess16 -0.052376555
## ManufacturingProcess17 -0.246839174
## ManufacturingProcess18  0.019070708
## ManufacturingProcess19  0.012203917
## ManufacturingProcess20  0.015374911
## ManufacturingProcess21 -0.088680330
## ManufacturingProcess22  0.023095695
## ManufacturingProcess23  0.012559835
## ManufacturingProcess24 -0.062175258
## ManufacturingProcess25 -0.023212145
## ManufacturingProcess26 -0.010197069
## ManufacturingProcess27 -0.022719954
## ManufacturingProcess28  0.013768644
## ManufacturingProcess29  0.022047807
## ManufacturingProcess30  0.079501894
## ManufacturingProcess31 -0.044143951
## ManufacturingProcess32  0.239946586
## ManufacturingProcess33  0.122286314
## ManufacturingProcess34  0.145481540
## ManufacturingProcess35 -0.054925500
## ManufacturingProcess36 -0.208145089
## ManufacturingProcess37 -0.106952621
## ManufacturingProcess38  0.003352090
## ManufacturingProcess39  0.068447790
## ManufacturingProcess40 -0.008064720
## ManufacturingProcess41 -0.008512968
## ManufacturingProcess42  0.062499296
## ManufacturingProcess43  0.053040744
## ManufacturingProcess44  0.097587496
## 
## , , 3 comps
## 
##                            .outcome
## BiologicalMaterial01    0.023247662
## BiologicalMaterial02    0.082135176
## BiologicalMaterial03    0.080129675
## BiologicalMaterial04    0.049456439
## BiologicalMaterial05    0.031421636
## BiologicalMaterial06    0.065087490
## BiologicalMaterial08    0.003243837
## BiologicalMaterial09   -0.048286111
## BiologicalMaterial10   -0.015402768
## BiologicalMaterial11   -0.016718416
## BiologicalMaterial12   -0.026525383
## ManufacturingProcess01  0.001171243
## ManufacturingProcess02 -0.017067281
## ManufacturingProcess03 -0.022789238
## ManufacturingProcess04  0.046043122
## ManufacturingProcess05 -0.051037200
## ManufacturingProcess06  0.201923064
## ManufacturingProcess07 -0.046134058
## ManufacturingProcess08  0.057343264
## ManufacturingProcess09  0.240672868
## ManufacturingProcess10  0.024136039
## ManufacturingProcess11  0.122814370
## ManufacturingProcess12  0.100974552
## ManufacturingProcess13 -0.238496904
## ManufacturingProcess14  0.061852767
## ManufacturingProcess15  0.160962614
## ManufacturingProcess16 -0.020454811
## ManufacturingProcess17 -0.262948893
## ManufacturingProcess18  0.054710472
## ManufacturingProcess19  0.080875391
## ManufacturingProcess20  0.051238664
## ManufacturingProcess21 -0.114858537
## ManufacturingProcess22  0.026812178
## ManufacturingProcess23  0.020052349
## ManufacturingProcess24 -0.070976182
## ManufacturingProcess25 -0.006460317
## ManufacturingProcess26  0.009886727
## ManufacturingProcess27 -0.003634942
## ManufacturingProcess28 -0.028296286
## ManufacturingProcess29  0.056370890
## ManufacturingProcess30  0.083480715
## ManufacturingProcess31 -0.034691698
## ManufacturingProcess32  0.341128410
## ManufacturingProcess33  0.162000325
## ManufacturingProcess34  0.207640599
## ManufacturingProcess35 -0.051745169
## ManufacturingProcess36 -0.275998626
## ManufacturingProcess37 -0.162223255
## ManufacturingProcess38  0.001280808
## ManufacturingProcess39  0.099142288
## ManufacturingProcess40 -0.022535565
## ManufacturingProcess41 -0.031243256
## ManufacturingProcess42  0.093432000
## ManufacturingProcess43  0.059184631
## ManufacturingProcess44  0.128894622

#important variables
varImp(model.pls)

## pls variable importance
## 
##   only 20 most important variables shown (out of 55)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess36   86.15
## ManufacturingProcess09   82.59
## ManufacturingProcess13   82.24
## ManufacturingProcess17   77.17
## ManufacturingProcess06   66.33
## BiologicalMaterial02     61.84
## BiologicalMaterial06     59.97
## BiologicalMaterial08     57.53
## ManufacturingProcess33   56.99
## BiologicalMaterial11     55.29
## ManufacturingProcess12   55.28
## BiologicalMaterial12     55.17
## BiologicalMaterial01     52.79
## BiologicalMaterial03     51.80
## ManufacturingProcess11   50.45
## BiologicalMaterial04     50.37
## ManufacturingProcess28   47.62
## ManufacturingProcess34   46.23
## ManufacturingProcess02   40.31

We can clearly see from the above output that ManufacturingProcess32 has the highter coefficient value = 0.341128410.

F. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Using our PLS model, we can see that ManufacturingProcess32 has the highest positive impact (0.341128410), followed by ManufacturingProcess36 which has the second highest impact but negatively (-0.275998626). there are also some BiologicalMaterials that also impact positively and negatively. So there are some predictors that positively impact the manufacture and others that negatively impact the manufacture. Therefore, we need to increase those that have lower impact on the yield and decrease those that have negative impact on the yield.

In general, though the biological materials cannot be changed during the refinement process, identifying which ingredients/materials are more vital will help to ensure a higher yield as the company can focus on obtaining high quality ingredients of those materials. Likewise, knowing the most important manufacturing process steps allows the company to pinpoint where they can start fine tuning the procedure.