CUNY 624 Homework 7 Student: Joel Park

6.2 Developing a model to predict permeability could save signficant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

A. Start R and use these commands to load the data.

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

According to ?permeability, this data set was used to develop a model for predicting compounds’ permeability.

Evaluating the underlying structure of the fingerprints and permeability dataset.

paste0("Fingerprints Matrix: ")
## [1] "Fingerprints Matrix: "
str(fingerprints)
##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
paste0("Permeability Response: ")
## [1] "Permeability Response: "
str(permeability)
##  num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr "permeability"

B. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

?nearZeroVar: Diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
near_zero_var <- nearZeroVar(fingerprints)
near_zero_var
##   [1]    7    8    9   10   13   14   17   18   19   22   23   24   30   31
##  [15]   32   33   34   45   77   81   82   83   84   85   89   90   91   92
##  [29]   95  100  104  105  106  107  109  110  112  113  114  115  116  117
##  [43]  119  120  122  123  124  128  131  132  134  135  136  137  139  140
##  [57]  144  145  147  148  149  151  155  160  161  164  165  166  216  217
##  [71]  218  219  220  222  243  252  259  273  275  277  282  283  287  288
##  [85]  289  292  346  347  348  349  350  351  352  353  354  363  364  365
##  [99]  369  375  379  384  391  393  397  399  402  404  405  407  408  409
## [113]  410  411  412  413  414  415  416  417  418  419  420  421  422  423
## [127]  424  425  426  427  428  429  430  431  432  433  434  435  436  437
## [141]  438  439  440  441  442  443  444  445  446  447  448  449  450  451
## [155]  452  453  454  455  456  457  458  459  460  461  462  463  464  465
## [169]  466  467  468  469  470  471  472  473  474  475  476  477  478  479
## [183]  480  481  482  483  484  485  486  487  488  489  490  491  492  493
## [197]  494  495  498  500  501  502  513  523  525  526  527  528  530  531
## [211]  532  533  534  535  536  537  538  539  540  541  542  543  544  545
## [225]  546  547  548  550  552  555  562  563  564  566  567  569  570  572
## [239]  575  578  579  580  581  582  583  584  585  586  587  588  589  596
## [253]  605  606  607  608  609  610  611  612  614  615  616  617  618  619
## [267]  620  622  623  624  625  626  627  628  629  630  631  632  633  634
## [281]  635  636  637  638  639  640  641  642  643  644  645  646  647  648
## [295]  649  650  651  652  653  654  655  656  657  658  659  660  661  662
## [309]  663  664  665  666  667  668  669  670  671  672  673  674  675  676
## [323]  677  678  680  681  682  683  684  685  686  687  688  689  690  691
## [337]  692  693  694  695  696  697  706  707  708  709  710  711  712  713
## [351]  714  715  716  717  718  720  721  722  723  724  725  726  727  728
## [365]  729  730  731  734  735  736  737  738  739  740  741  742  743  744
## [379]  745  746  747  748  749  756  757  758  759  760  761  762  763  764
## [393]  765  766  767  768  769  770  771  772  777  778  779  781  783  784
## [407]  785  786  787  788  789  790  791  794  796  797  799  802  803  804
## [421]  807  808  809  810  811  814  815  816  817  818  819  820  821  822
## [435]  823  824  825  826  827  828  829  830  831  832  833  834  835  836
## [449]  837  838  839  840  841  842  843  844  845  846  847  848  849  850
## [463]  851  852  853  854  855  856  857  858  859  860  861  862  863  864
## [477]  865  866  867  868  869  870  871  872  873  874  875  876  877  878
## [491]  879  880  881  882  883  884  885  886  887  888  889  890  891  892
## [505]  893  894  895  896  897  898  899  900  901  902  903  904  905  906
## [519]  907  908  909  910  911  912  913  914  915  916  917  918  919  920
## [533]  921  922  923  924  925  926  927  928  929  930  931  932  933  934
## [547]  935  936  937  938  939  940  941  942  943  944  945  946  947  948
## [561]  949  950  951  952  953  954  955  956  957  958  959  960  961  962
## [575]  963  964  965  966  967  968  969  970  971  972  973  974  975  976
## [589]  977  978  979  980  981  982  983  984  985  986  987  988  989  990
## [603]  991  992  993  994  995  996  997  998  999 1000 1001 1002 1003 1004
## [617] 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018
## [631] 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032
## [645] 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046
## [659] 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060
## [673] 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074
## [687] 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088
## [701] 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102
## [715] 1103 1104 1105 1106 1107
paste0("We have successfully identified ", length(near_zero_var), " variables that have zero or near zero variance predictors.")
## [1] "We have successfully identified 719 variables that have zero or near zero variance predictors."

These variables can be removed from the matrix as they will not likely contribute to the construction of the model.

fingerprints_1 <- fingerprints[,-near_zero_var]
str(fingerprints_1)
##  num [1:165, 1:388] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:388] "X1" "X2" "X3" "X4" ...
paste0("There are ", ncol(fingerprints_1), " variables left.")
## [1] "There are 388 variables left."

C. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

Referencing 6.5 “Computing”:

Before we train our data, we need to create a training and testing data set.

# Before we start with the splitting of datasets and model building, we need to combine the response variable with the training data.
fingerprints_1 <- as.data.frame(fingerprints)
fingerprints_1$permeability <- as.vector(permeability)

# We should remove any zero or near zero variance as this will destabilize the models and not contribute.
z_var <- nearZeroVar(fingerprints_1)

# Now that we identified the zero or near zero variance variables
# Let us eliminate them from our dataset.
fingerprints_1 <- fingerprints_1[,-z_var]
# Straight splitting training and testing data set.
# Reference: https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function
smp_size <- floor(0.8 * nrow(fingerprints_1))
train_ind <- sample(seq_len(nrow(fingerprints_1)), size = smp_size)

trainingData <- fingerprints_1[train_ind,]
testingData <- fingerprints_1[-train_ind,]

Now that the data is successfully split into training and testing datasets, we can now fit a partial least squares model.

The pls package has functions for PLS and PCR. SIMPLS and the Rannar algorithm are both available. The plsr function can be used very much like the lm function. The number of components can be fixed using the ncomp argument or, if left to the default, the maximum number of components will be calculated. The plsr function has options for either K-fold or leave-one-out cross-validation (via the validation argument) or the PLS algorithm to use, such as SIMPLS (using the method argument). Predictions on new samples can be calculated using the predict function.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
plsFit <- plsr(permeability ~ ., data = fingerprints_1)
summary(plsFit)
## Data:    X dimension: 165 388 
##  Y dimension: 165 1
## Fit method: kernelpls
## Number of components considered: 164
## TRAINING: % variance explained
##               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X               29.17    42.33    49.34    52.71    60.37    66.18
## permeability    26.69    47.90    54.71    62.75    66.37    70.18
##               7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## X               68.87    71.80    73.87     75.68     78.30     80.31
## permeability    72.85    74.36    76.34     78.09     79.14     80.26
##               13 comps  14 comps  15 comps  16 comps  17 comps  18 comps
## X                82.70     84.13     85.17     86.76     87.79     89.11
## permeability     81.06     82.04     83.08     83.80     84.79     85.39
##               19 comps  20 comps  21 comps  22 comps  23 comps  24 comps
## X                90.06      90.8     91.34     91.91     92.52     92.84
## permeability     85.94      86.4     86.92     87.29     87.59     88.11
##               25 comps  26 comps  27 comps  28 comps  29 comps  30 comps
## X                93.39     93.88     94.14     94.48     94.70     94.95
## permeability     88.44     88.77     89.15     89.39     89.72     89.94
##               31 comps  32 comps  33 comps  34 comps  35 comps  36 comps
## X                95.18     95.45     95.69     95.93     96.19     96.44
## permeability     90.09     90.20     90.32     90.44     90.52     90.59
##               37 comps  38 comps  39 comps  40 comps  41 comps  42 comps
## X                96.64     96.84     97.08     97.22     97.38     97.58
## permeability     90.68     90.74     90.78     90.84     90.88     90.90
##               43 comps  44 comps  45 comps  46 comps  47 comps  48 comps
## X                97.72     97.85     97.97     98.09     98.17     98.29
## permeability     90.93     90.96     90.99     91.01     91.03     91.04
##               49 comps  50 comps  51 comps  52 comps  53 comps  54 comps
## X                98.40     98.49     98.55     98.64      98.7     98.79
## permeability     91.06     91.07     91.08     91.09      91.1     91.10
##               55 comps  56 comps  57 comps  58 comps  59 comps  60 comps
## X                98.85     98.91     98.97     99.02     99.10     99.15
## permeability     91.11     91.11     91.12     91.13     91.13     91.14
##               61 comps  62 comps  63 comps  64 comps  65 comps  66 comps
## X                99.19     99.25     99.29     99.34     99.38     99.43
## permeability     91.14     91.15     91.15     91.16     91.16     91.16
##               67 comps  68 comps  69 comps  70 comps  71 comps  72 comps
## X                99.46     99.49     99.53     99.57     99.60     99.62
## permeability     91.16     91.16     91.16     91.16     91.16     91.16
##               73 comps  74 comps  75 comps  76 comps  77 comps  78 comps
## X                99.64     99.67     99.69     99.72     99.74     99.76
## permeability     91.17     91.17     91.17     91.17     91.17     91.17
##               79 comps  80 comps  81 comps  82 comps  83 comps  84 comps
## X                99.78     99.79     99.80     99.82     99.83     99.86
## permeability     91.17     91.17     91.17     91.17     91.17     91.17
##               85 comps  86 comps  87 comps  88 comps  89 comps  90 comps
## X                99.87     99.89     99.90     99.90     99.91     99.92
## permeability     91.17     91.17     91.17     91.17     91.17     91.17
##               91 comps  92 comps  93 comps  94 comps  95 comps  96 comps
## X                99.93     99.94     99.95     99.95     99.96     99.97
## permeability     91.17     91.17     91.17     91.17     91.17     91.17
##               97 comps  98 comps  99 comps  100 comps  101 comps
## X                99.97     99.98     99.98      99.99      99.99
## permeability     91.17     91.17     91.17      91.17      91.17
##               102 comps  103 comps  104 comps  105 comps  106 comps
## X                 99.99     100.00     100.00     100.00     100.25
## permeability      91.17      91.17      91.17      91.17      91.17
##               107 comps  108 comps  109 comps  110 comps  111 comps
## X                100.51     100.76     101.02     101.27     101.52
## permeability      91.17      91.17      91.17      91.17      91.17
##               112 comps  113 comps  114 comps  115 comps  116 comps
## X                101.78     102.03     102.28     102.54     102.79
## permeability      91.17      91.17      91.17      91.17      91.17
##               117 comps  118 comps  119 comps  120 comps  121 comps
## X                103.05     103.30     103.55     103.81     104.06
## permeability      91.17      91.17      91.17      91.17      91.17
##               122 comps  123 comps  124 comps  125 comps  126 comps
## X                104.32     104.57     104.82     105.08     105.33
## permeability      91.17      91.17      91.17      91.17      91.17
##               127 comps  128 comps  129 comps  130 comps  131 comps
## X                105.58     105.84     106.09     106.35     106.60
## permeability      91.17      91.17      91.17      91.17      91.17
##               132 comps  133 comps  134 comps  135 comps  136 comps
## X                106.85     107.11     107.36     107.62     107.87
## permeability      91.17      91.17      91.17      91.17      91.17
##               137 comps  138 comps  139 comps  140 comps  141 comps
## X                108.12     108.38     108.63     108.88     109.14
## permeability      91.17      91.17      91.17      91.17      91.17
##               142 comps  143 comps  144 comps  145 comps  146 comps
## X                109.39     109.64     109.90     110.15     110.41
## permeability      91.17      91.17      91.17      91.17      91.17
##               147 comps  148 comps  149 comps  150 comps  151 comps
## X                110.66     110.91     111.17     111.42     111.68
## permeability      91.17      91.17      91.17      91.17      91.17
##               152 comps  153 comps  154 comps  155 comps  156 comps
## X                111.93     112.18     112.44     112.69     112.94
## permeability      91.17      91.17      91.17      91.17      91.17
##               157 comps  158 comps  159 comps  160 comps  161 comps
## X                113.20     113.45     113.70     113.96     114.21
## permeability      91.17      91.17      91.17      91.17      91.17
##               162 comps  163 comps  164 comps
## X                114.47     114.72     114.97
## permeability      91.17      91.17      91.17

The function above had provided ~164 components and appears to be a bit overwhelming. Furthermore, the split and testing data were a simple split. To improve the parameters of our models, we should be considering the 10-fold cross-validation instead of the straightforward training and testing split. (We will be using cross-validation on the testing data set and reserving the testing data set as a validation set.) This can be accomplished by the trainControl function. We will be using this when we use the train function.

# 10 fold Cross validation
ctrl <- trainControl(method= "cv", number = 10)

Now that we successfully standardized the ctrl function, let us pre-process the data and tune a partial least squares model.

X <- testingData[1:(length(testingData)-1)]
plsTune <- train(X, testingData$permeability,
                 method = "pls",
                 tuneLength = 20,
                 trControl = ctrl,
                 preProc = c("center", "scale"))

plsTune
## Partial Least Squares 
## 
##  33 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 29, 30, 29, 31, 30, 30, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     11.61465  0.6060687   9.373817
##    2     12.13671  0.4793820   9.078608
##    3     13.20101  0.5627764   9.930050
##    4     13.49176  0.4795114  10.549720
##    5     14.09133  0.5767719  11.043676
##    6     14.52496  0.5156036  11.774945
##    7     14.33107  0.4252268  11.578746
##    8     14.75028  0.4471746  11.799307
##    9     14.92486  0.4288702  11.870360
##   10     15.66922  0.4649396  12.524820
##   11     16.21786  0.4452180  12.887372
##   12     17.11636  0.3482682  13.518803
##   13     17.92739  0.3343669  14.151109
##   14     18.02092  0.3973943  14.010458
##   15     18.45013  0.4274785  14.360400
##   16     18.40075  0.4455370  14.386058
##   17     18.31947  0.4250162  14.334778
##   18     18.20637  0.4074364  14.169310
##   19     18.12241  0.4108052  14.015812
##   20     18.09508  0.4123270  13.987278
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.

So above demonstrates a more finely tuned PLS where the model utilized ncomp=5 as its optimal model.

plot(plsTune)

The plot demonstrates the models utilizing different number of components and comparing individual models (Number of Components) by the RMSE. Again, the plot demonstrates that ncomp=5 model has the lowest RMSE. The corresponding \(R^2\) is 0.4980947.

D. Predict the response for the test set. What is the test set estimate of \(R^2\)?

We will use the predict() function with 2 latent variables (2 components) for our predicted values.

# Using our testing data set
# Splitting the test set into X predictors and Y response variable
X_test <- testingData[1:(length(testingData)-1)]
Y_test <- testingData[length(testingData)]
  
perm_predict <- predict(plsTune, X_test , ncomp = 2)
# Displaying the head of the predicted values of X
head(perm_predict)
## [1]  5.440539  4.999530  7.394288 10.289289  7.466676  5.277277
# To display the full predicted values, please uncomment the following line.
print(perm_predict)
##  [1]  5.4405387  4.9995297  7.3942878 10.2892886  7.4666762  5.2772768
##  [7]  8.2565242  1.8461556 18.9523525  5.5115093 17.0050075 -0.7995239
## [13] 10.9017248 11.1658602  4.0627791  3.1270625 -1.0284262  8.6617259
## [19]  6.1622974 19.0518750 20.0552504 11.3725030 16.1423673 -6.2918250
## [25] 15.5415223  4.8723623  3.3136476 11.8699725 -2.2939271 -1.1165278
## [31]  1.7088595 11.6394563 10.6668170

Now that we have successfully created a prediction on the testing data set, let us create an accuracy including the test set \(R^2\).

RMSE <- function(m, o) {
  sqrt(mean((m-o)^2))
}

paste0("RMSE for Test Dataset: ", RMSE(as.vector(perm_predict), Y_test$permeability))
## [1] "RMSE for Test Dataset: 9.6434696735637"

The R-squared value can help us understand how much of the model explains the test data variance. Let us calculate the value.

pls.eval = data.frame(obs = Y_test$permeability, pred=perm_predict)
defaultSummary(pls.eval)
##      RMSE  Rsquared       MAE 
## 9.6434697 0.3119753 6.7925292

As we can see, the R squared value here is 0.8294, or in other words, 82.94% of the data variance can be explained by this model.

Below is the plot of how the fitted data compares to the actual data.

plot(Y_test$permeability, perm_predict, main="Test Dataset", xlab="Observed", ylab="Tuned PLS Predicted")
abline(0,1,col="red")

The R-squared value here highlights the importance of the PLS algorithm. Whereas PCR tends to focus on creating components that maximizes variables’ variance, even at the expense of being noncontributory to the response variable, PLS attempts to balance both aspects. Thus, PLS should contain latent variables or components that have some correlation with the response variable.

E. Try building other models discussed in this chapter. Do you have better predictive performance?

Let us trial several different models. (As of note, we will be using the testing and training datasets as performed above. This includes the dataset that has already been cleaned from zero or near zero variances. We will also split the training data set into X and Y for simplicity track.)

# Splitting training data set into X and Y
X_train <- trainingData[1:(length(trainingData)-1)]
Y_train <- trainingData$permeability

Linear Regression:

lm_perm <- train(X_train, Y_train, 
                 method = "lm",
                 trControl = ctrl,
                 preProc = c("center", "scale"))
  
summary(lm_perm)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.6510  -0.0125   0.0000   0.2800  14.5205 
## 
## Coefficients: (292 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.39358    0.63939  20.947   <2e-16 ***
## X1           22.12565   10.11885   2.187   0.0355 *  
## X2           14.03468   29.03659   0.483   0.6319    
## X3           14.17997   21.86138   0.649   0.5208    
## X4                 NA         NA      NA       NA    
## X5                 NA         NA      NA       NA    
## X6            3.27877    1.23117   2.663   0.0116 *  
## X11          -0.74092    5.10234  -0.145   0.8854    
## X12           4.17457    5.71157   0.731   0.4697    
## X15          -4.94277    4.98079  -0.992   0.3278    
## X16           7.80708    3.43581   2.272   0.0293 *  
## X20                NA         NA      NA       NA    
## X21                NA         NA      NA       NA    
## X25          -5.83852    3.82140  -1.528   0.1355    
## X26                NA         NA      NA       NA    
## X27                NA         NA      NA       NA    
## X28                NA         NA      NA       NA    
## X29                NA         NA      NA       NA    
## X35          -3.51088   11.11660  -0.316   0.7540    
## X36          -0.81162    6.93560  -0.117   0.9075    
## X37                NA         NA      NA       NA    
## X38                NA         NA      NA       NA    
## X39                NA         NA      NA       NA    
## X40                NA         NA      NA       NA    
## X41          -8.17118    9.11818  -0.896   0.3763    
## X42         -11.42802   14.35742  -0.796   0.4314    
## X43                NA         NA      NA       NA    
## X44                NA         NA      NA       NA    
## X46                NA         NA      NA       NA    
## X47                NA         NA      NA       NA    
## X48         -11.60800   20.55294  -0.565   0.5758    
## X49                NA         NA      NA       NA    
## X50                NA         NA      NA       NA    
## X51                NA         NA      NA       NA    
## X52                NA         NA      NA       NA    
## X53                NA         NA      NA       NA    
## X54                NA         NA      NA       NA    
## X55                NA         NA      NA       NA    
## X56                NA         NA      NA       NA    
## X57                NA         NA      NA       NA    
## X58                NA         NA      NA       NA    
## X59                NA         NA      NA       NA    
## X60                NA         NA      NA       NA    
## X61                NA         NA      NA       NA    
## X62                NA         NA      NA       NA    
## X63                NA         NA      NA       NA    
## X64                NA         NA      NA       NA    
## X65                NA         NA      NA       NA    
## X66                NA         NA      NA       NA    
## X67                NA         NA      NA       NA    
## X68                NA         NA      NA       NA    
## X69                NA         NA      NA       NA    
## X70                NA         NA      NA       NA    
## X71                NA         NA      NA       NA    
## X72                NA         NA      NA       NA    
## X73                NA         NA      NA       NA    
## X74                NA         NA      NA       NA    
## X75                NA         NA      NA       NA    
## X76                NA         NA      NA       NA    
## X78                NA         NA      NA       NA    
## X79                NA         NA      NA       NA    
## X80                NA         NA      NA       NA    
## X86         -59.17380   25.51009  -2.320   0.0263 *  
## X87          68.74019   26.99965   2.546   0.0155 *  
## X88          35.53315   27.59269   1.288   0.2063    
## X93          -0.06263    1.18594  -0.053   0.9582    
## X94           1.89226    6.09053   0.311   0.7579    
## X96           8.74994    8.18956   1.068   0.2926    
## X97                NA         NA      NA       NA    
## X98           6.21666   20.97024   0.296   0.7686    
## X99           2.81161    5.59384   0.503   0.6184    
## X101               NA         NA      NA       NA    
## X102        -24.15461   15.71553  -1.537   0.1333    
## X103         -0.77736   12.73568  -0.061   0.9517    
## X108               NA         NA      NA       NA    
## X111         11.96785   33.84953   0.354   0.7258    
## X118         -0.36445    2.97826  -0.122   0.9033    
## X121        -20.46101   32.80013  -0.624   0.5368    
## X125         10.44544   11.92529   0.876   0.3871    
## X126         -0.76771    4.63840  -0.166   0.8695    
## X127               NA         NA      NA       NA    
## X129               NA         NA      NA       NA    
## X130               NA         NA      NA       NA    
## X133               NA         NA      NA       NA    
## X138               NA         NA      NA       NA    
## X141          0.75569    1.26242   0.599   0.5533    
## X142               NA         NA      NA       NA    
## X143        -21.43629   19.37998  -1.106   0.2762    
## X146        -24.64831   16.95287  -1.454   0.1549    
## X150         -1.21120   24.15616  -0.050   0.9603    
## X152               NA         NA      NA       NA    
## X153               NA         NA      NA       NA    
## X154               NA         NA      NA       NA    
## X156         10.33382    7.76888   1.330   0.1921    
## X157         12.88424   15.14073   0.851   0.4006    
## X158          3.47863   12.71677   0.274   0.7860    
## X159         -2.53599    7.56262  -0.335   0.7394    
## X162               NA         NA      NA       NA    
## X163               NA         NA      NA       NA    
## X167               NA         NA      NA       NA    
## X168               NA         NA      NA       NA    
## X169               NA         NA      NA       NA    
## X170               NA         NA      NA       NA    
## X171               NA         NA      NA       NA    
## X172               NA         NA      NA       NA    
## X173               NA         NA      NA       NA    
## X174               NA         NA      NA       NA    
## X175               NA         NA      NA       NA    
## X176               NA         NA      NA       NA    
## X177               NA         NA      NA       NA    
## X178               NA         NA      NA       NA    
## X179               NA         NA      NA       NA    
## X180               NA         NA      NA       NA    
## X181               NA         NA      NA       NA    
## X182         17.45184   21.63591   0.807   0.4253    
## X183               NA         NA      NA       NA    
## X184               NA         NA      NA       NA    
## X185               NA         NA      NA       NA    
## X186               NA         NA      NA       NA    
## X187               NA         NA      NA       NA    
## X188               NA         NA      NA       NA    
## X189               NA         NA      NA       NA    
## X190               NA         NA      NA       NA    
## X191               NA         NA      NA       NA    
## X192               NA         NA      NA       NA    
## X193               NA         NA      NA       NA    
## X194               NA         NA      NA       NA    
## X195               NA         NA      NA       NA    
## X196               NA         NA      NA       NA    
## X197               NA         NA      NA       NA    
## X198               NA         NA      NA       NA    
## X199               NA         NA      NA       NA    
## X200               NA         NA      NA       NA    
## X201               NA         NA      NA       NA    
## X202               NA         NA      NA       NA    
## X203               NA         NA      NA       NA    
## X204               NA         NA      NA       NA    
## X205               NA         NA      NA       NA    
## X206               NA         NA      NA       NA    
## X207               NA         NA      NA       NA    
## X208               NA         NA      NA       NA    
## X209               NA         NA      NA       NA    
## X210               NA         NA      NA       NA    
## X211               NA         NA      NA       NA    
## X212               NA         NA      NA       NA    
## X213               NA         NA      NA       NA    
## X214               NA         NA      NA       NA    
## X215               NA         NA      NA       NA    
## X221          0.60194    5.21080   0.116   0.9087    
## X223               NA         NA      NA       NA    
## X224               NA         NA      NA       NA    
## X225         17.55713    9.33078   1.882   0.0682 .  
## X226         -7.11967    5.17438  -1.376   0.1776    
## X227               NA         NA      NA       NA    
## X228               NA         NA      NA       NA    
## X229         27.72218   21.26151   1.304   0.2008    
## X230        -37.38979   19.63822  -1.904   0.0652 .  
## X231         -8.30609    5.67171  -1.464   0.1520    
## X232               NA         NA      NA       NA    
## X233               NA         NA      NA       NA    
## X234               NA         NA      NA       NA    
## X235         -9.43213    9.79838  -0.963   0.3423    
## X236               NA         NA      NA       NA    
## X237        -14.98088   30.13161  -0.497   0.6222    
## X238          9.60810   16.31921   0.589   0.5598    
## X239               NA         NA      NA       NA    
## X240               NA         NA      NA       NA    
## X241         63.24454   29.37072   2.153   0.0383 *  
## X242        -62.60554   38.20380  -1.639   0.1102    
## X244               NA         NA      NA       NA    
## X245               NA         NA      NA       NA    
## X246               NA         NA      NA       NA    
## X247         34.25861   19.06973   1.796   0.0811 .  
## X248        -32.43997   19.73289  -1.644   0.1091    
## X249               NA         NA      NA       NA    
## X250               NA         NA      NA       NA    
## X251        -34.63868   40.93272  -0.846   0.4032    
## X253               NA         NA      NA       NA    
## X254               NA         NA      NA       NA    
## X255               NA         NA      NA       NA    
## X256               NA         NA      NA       NA    
## X257         27.13300   25.00014   1.085   0.2852    
## X258        -12.89598    7.94433  -1.623   0.1135    
## X260        -41.18735   27.66173  -1.489   0.1455    
## X261               NA         NA      NA       NA    
## X262        -30.25000   17.58327  -1.720   0.0942 .  
## X263               NA         NA      NA       NA    
## X264               NA         NA      NA       NA    
## X265               NA         NA      NA       NA    
## X266               NA         NA      NA       NA    
## X267               NA         NA      NA       NA    
## X268               NA         NA      NA       NA    
## X269               NA         NA      NA       NA    
## X270        -11.71451   20.14019  -0.582   0.5645    
## X271         21.42322   14.64611   1.463   0.1525    
## X272         23.61026   23.11685   1.021   0.3141    
## X274               NA         NA      NA       NA    
## X276               NA         NA      NA       NA    
## X278         -0.33940    1.27367  -0.266   0.7914    
## X279               NA         NA      NA       NA    
## X280          0.57010    2.40131   0.237   0.8137    
## X281               NA         NA      NA       NA    
## X284               NA         NA      NA       NA    
## X285               NA         NA      NA       NA    
## X286               NA         NA      NA       NA    
## X290               NA         NA      NA       NA    
## X291               NA         NA      NA       NA    
## X293          4.77611    3.38935   1.409   0.1676    
## X294          0.04874    7.36279   0.007   0.9948    
## X295         -5.47944    5.78033  -0.948   0.3497    
## X296               NA         NA      NA       NA    
## X297         12.93794    6.18444   2.092   0.0438 *  
## X298               NA         NA      NA       NA    
## X299               NA         NA      NA       NA    
## X300               NA         NA      NA       NA    
## X301               NA         NA      NA       NA    
## X302               NA         NA      NA       NA    
## X303               NA         NA      NA       NA    
## X304               NA         NA      NA       NA    
## X305               NA         NA      NA       NA    
## X306         -0.62784    3.04688  -0.206   0.8379    
## X307               NA         NA      NA       NA    
## X308               NA         NA      NA       NA    
## X309               NA         NA      NA       NA    
## X310               NA         NA      NA       NA    
## X311         10.32867   21.07024   0.490   0.6271    
## X312        -11.42604   20.90012  -0.547   0.5881    
## X313         -6.70152   11.76107  -0.570   0.5724    
## X314               NA         NA      NA       NA    
## X315         -0.37020    3.37318  -0.110   0.9132    
## X316          0.83611    4.69428   0.178   0.8597    
## X317               NA         NA      NA       NA    
## X318               NA         NA      NA       NA    
## X319         39.17942   21.22046   1.846   0.0733 .  
## X320         -9.42523   14.90688  -0.632   0.5313    
## X321               NA         NA      NA       NA    
## X322               NA         NA      NA       NA    
## X323               NA         NA      NA       NA    
## X324               NA         NA      NA       NA    
## X325               NA         NA      NA       NA    
## X326               NA         NA      NA       NA    
## X327               NA         NA      NA       NA    
## X328               NA         NA      NA       NA    
## X329          9.21901   12.28354   0.751   0.4580    
## X330               NA         NA      NA       NA    
## X331               NA         NA      NA       NA    
## X332               NA         NA      NA       NA    
## X333               NA         NA      NA       NA    
## X334         -8.52114   10.21223  -0.834   0.4097    
## X335               NA         NA      NA       NA    
## X336               NA         NA      NA       NA    
## X337        -14.90089    7.22491  -2.062   0.0467 *  
## X338         13.20154    7.24298   1.823   0.0769 .  
## X339               NA         NA      NA       NA    
## X340          0.63232   10.25530   0.062   0.9512    
## X341               NA         NA      NA       NA    
## X342          9.08702   17.00156   0.534   0.5964    
## X343               NA         NA      NA       NA    
## X344               NA         NA      NA       NA    
## X345         -1.23216    1.32326  -0.931   0.3582    
## X355               NA         NA      NA       NA    
## X356               NA         NA      NA       NA    
## X357         -8.88744   14.83078  -0.599   0.5529    
## X358        -11.48474    4.55849  -2.519   0.0165 *  
## X359         -6.17840    7.06317  -0.875   0.3877    
## X360               NA         NA      NA       NA    
## X361          0.43850    3.11934   0.141   0.8890    
## X362               NA         NA      NA       NA    
## X366               NA         NA      NA       NA    
## X367               NA         NA      NA       NA    
## X368               NA         NA      NA       NA    
## X370         18.84214   14.96102   1.259   0.2162    
## X371        -11.36639   14.02921  -0.810   0.4233    
## X372               NA         NA      NA       NA    
## X373               NA         NA      NA       NA    
## X374         -3.56972    2.38222  -1.498   0.1430    
## X376         -1.96140   10.25448  -0.191   0.8494    
## X377               NA         NA      NA       NA    
## X378               NA         NA      NA       NA    
## X380               NA         NA      NA       NA    
## X381               NA         NA      NA       NA    
## X382               NA         NA      NA       NA    
## X383               NA         NA      NA       NA    
## X385               NA         NA      NA       NA    
## X386               NA         NA      NA       NA    
## X387               NA         NA      NA       NA    
## X388               NA         NA      NA       NA    
## X389               NA         NA      NA       NA    
## X390               NA         NA      NA       NA    
## X392               NA         NA      NA       NA    
## X394               NA         NA      NA       NA    
## X395               NA         NA      NA       NA    
## X396               NA         NA      NA       NA    
## X398               NA         NA      NA       NA    
## X400               NA         NA      NA       NA    
## X401               NA         NA      NA       NA    
## X403               NA         NA      NA       NA    
## X406               NA         NA      NA       NA    
## X496          4.12560    3.25668   1.267   0.2136    
## X497               NA         NA      NA       NA    
## X499               NA         NA      NA       NA    
## X503         -1.04095    4.31277  -0.241   0.8107    
## X504               NA         NA      NA       NA    
## X505               NA         NA      NA       NA    
## X506               NA         NA      NA       NA    
## X507          6.55895    5.80080   1.131   0.2659    
## X508               NA         NA      NA       NA    
## X509          0.51040    7.70596   0.066   0.9476    
## X510               NA         NA      NA       NA    
## X511               NA         NA      NA       NA    
## X512               NA         NA      NA       NA    
## X514               NA         NA      NA       NA    
## X515               NA         NA      NA       NA    
## X516               NA         NA      NA       NA    
## X517               NA         NA      NA       NA    
## X518               NA         NA      NA       NA    
## X519               NA         NA      NA       NA    
## X520               NA         NA      NA       NA    
## X521               NA         NA      NA       NA    
## X522               NA         NA      NA       NA    
## X524               NA         NA      NA       NA    
## X529               NA         NA      NA       NA    
## X549               NA         NA      NA       NA    
## X551               NA         NA      NA       NA    
## X553               NA         NA      NA       NA    
## X554               NA         NA      NA       NA    
## X556               NA         NA      NA       NA    
## X557               NA         NA      NA       NA    
## X558               NA         NA      NA       NA    
## X559               NA         NA      NA       NA    
## X560               NA         NA      NA       NA    
## X561               NA         NA      NA       NA    
## X565               NA         NA      NA       NA    
## X568               NA         NA      NA       NA    
## X571               NA         NA      NA       NA    
## X573               NA         NA      NA       NA    
## X574               NA         NA      NA       NA    
## X576               NA         NA      NA       NA    
## X577               NA         NA      NA       NA    
## X590               NA         NA      NA       NA    
## X591               NA         NA      NA       NA    
## X592               NA         NA      NA       NA    
## X593               NA         NA      NA       NA    
## X594               NA         NA      NA       NA    
## X595               NA         NA      NA       NA    
## X597               NA         NA      NA       NA    
## X598               NA         NA      NA       NA    
## X599               NA         NA      NA       NA    
## X600               NA         NA      NA       NA    
## X601               NA         NA      NA       NA    
## X602               NA         NA      NA       NA    
## X603               NA         NA      NA       NA    
## X604               NA         NA      NA       NA    
## X613               NA         NA      NA       NA    
## X621               NA         NA      NA       NA    
## X679               NA         NA      NA       NA    
## X698          1.31932    2.04511   0.645   0.5231    
## X699               NA         NA      NA       NA    
## X700               NA         NA      NA       NA    
## X701               NA         NA      NA       NA    
## X702               NA         NA      NA       NA    
## X703               NA         NA      NA       NA    
## X704         -1.50108    3.16413  -0.474   0.6382    
## X705               NA         NA      NA       NA    
## X719               NA         NA      NA       NA    
## X732         -0.18368    1.17882  -0.156   0.8771    
## X733               NA         NA      NA       NA    
## X750          0.52797    2.15613   0.245   0.8080    
## X751               NA         NA      NA       NA    
## X752               NA         NA      NA       NA    
## X753               NA         NA      NA       NA    
## X754               NA         NA      NA       NA    
## X755               NA         NA      NA       NA    
## X773               NA         NA      NA       NA    
## X774               NA         NA      NA       NA    
## X775               NA         NA      NA       NA    
## X776               NA         NA      NA       NA    
## X780               NA         NA      NA       NA    
## X782               NA         NA      NA       NA    
## X792               NA         NA      NA       NA    
## X793               NA         NA      NA       NA    
## X795               NA         NA      NA       NA    
## X798               NA         NA      NA       NA    
## X800               NA         NA      NA       NA    
## X801               NA         NA      NA       NA    
## X805               NA         NA      NA       NA    
## X806               NA         NA      NA       NA    
## X812               NA         NA      NA       NA    
## X813               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.346 on 35 degrees of freedom
## Multiple R-squared:  0.9452, Adjusted R-squared:  0.7948 
## F-statistic: 6.287 on 96 and 35 DF,  p-value: 1.856e-08
lm_predict <- predict(lm_perm, X_test)
pls.eval = data.frame(obs = Y_test$permeability, pred=lm_predict)
defaultSummary(pls.eval)
##        RMSE    Rsquared         MAE 
## 22.25724007  0.09042036 12.33126817

Ridge Regression:

library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
# Creating the RidgeGrid to better fine tune the lambda value
ridgeGrid <- data.frame(.lambda = seq(0, .1, length=15))

ridge_perm <- train(X_train, Y_train, 
                 method = "ridge",
                 trControl = ctrl,
                 tuneGrid = ridgeGrid,
                 preProc = c("center", "scale"))

summary(ridge_perm)
##             Length Class      Mode     
## call            4  -none-     call     
## actions       184  -none-     list     
## allset        388  -none-     numeric  
## beta.pure   71392  -none-     numeric  
## vn            388  -none-     character
## mu              1  -none-     numeric  
## normx         388  -none-     numeric  
## meanx         388  -none-     numeric  
## lambda          1  -none-     numeric  
## L1norm        184  -none-     numeric  
## penalty       184  -none-     numeric  
## df            184  -none-     numeric  
## Cp            184  -none-     numeric  
## sigma2          1  -none-     numeric  
## xNames        388  -none-     character
## problemType     1  -none-     character
## tuneValue       1  data.frame list     
## obsLevels       1  -none-     logical  
## param           0  -none-     list
ridge_predict <- predict(ridge_perm, X_test, s = 1, mode = "fraction")
pls.eval = data.frame(obs = Y_test$permeability, pred=ridge_predict)
defaultSummary(pls.eval)
##       RMSE   Rsquared        MAE 
## 12.9334035  0.1523053  9.0754735

Lasso Regression:

enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
                        .fraction = seq(.05, 1, length = 20))

enetTune <- train(X_train, Y_train,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
## Warning: model fit failed for Fold05: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
summary(enetTune)
##             Length Class      Mode     
## call            4  -none-     call     
## actions       184  -none-     list     
## allset        388  -none-     numeric  
## beta.pure   71392  -none-     numeric  
## vn            388  -none-     character
## mu              1  -none-     numeric  
## normx         388  -none-     numeric  
## meanx         388  -none-     numeric  
## lambda          1  -none-     numeric  
## L1norm        184  -none-     numeric  
## penalty       184  -none-     numeric  
## df            184  -none-     numeric  
## Cp            184  -none-     numeric  
## sigma2          1  -none-     numeric  
## xNames        388  -none-     character
## problemType     1  -none-     character
## tuneValue       2  data.frame list     
## obsLevels       1  -none-     logical  
## param           0  -none-     list
lasso_predict <- predict(enetTune, X_test)
pls.eval = data.frame(obs = Y_test, pred=lasso_predict)
colnames(pls.eval) <- c("obs", "pred")
defaultSummary(pls.eval)
##       RMSE   Rsquared        MAE 
## 12.2386446  0.1136488  8.5947579

We can compare the different models noted here. If we compared models by the RMSE and R-squared, we can see that PLS performed the best,then Ridge.

F. Would you recommend any of your models to replace the permeability laboratory experiment?

The only model that I would recommend is the PLS model as that has the best R-squared and RMSE. However, if the R-squared value is still too small or the model still does not produce an accurate or acceptable level of permeability, than none of these models should be considered.

6.3 A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

A. Start R and use these commands to load the data.

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 decribing the process predictors) for the 176 manufaccturing runs. yield contains the percent yield for each run.

According to ?ChemicalManufacturingProcess, this data set contains information about a chemical manufacturing process, in which the goal is to understand the relationship between the process and the resuling final product yield. Raw material in this process is put through a sequence of 27 steps to generate the final pharmaceutical product. The starting material is generated from a biological unit and has a range of quality and characteristics. The objective in this project was to develop a model to predict percent yield of the manufacturing process.

Of the 57 characteristics, there were 12 measurements of the biological starting material, and 45 measurements of the manufacturing process. The process variables included measurements such as temperature, drying time, washing time, and concentrations of by–products at various steps. Some of the process measurements can be controlled, while others are observed. Predictors are continuous, count, categorical; some are correlated, and some contain missing values. Samples are not independent because sets of samples come from the same batch of biological starting material.

Let’s take a look at the head of the dataset.

colnames(ChemicalManufacturingProcess)
##  [1] "Yield"                  "BiologicalMaterial01"  
##  [3] "BiologicalMaterial02"   "BiologicalMaterial03"  
##  [5] "BiologicalMaterial04"   "BiologicalMaterial05"  
##  [7] "BiologicalMaterial06"   "BiologicalMaterial07"  
##  [9] "BiologicalMaterial08"   "BiologicalMaterial09"  
## [11] "BiologicalMaterial10"   "BiologicalMaterial11"  
## [13] "BiologicalMaterial12"   "ManufacturingProcess01"
## [15] "ManufacturingProcess02" "ManufacturingProcess03"
## [17] "ManufacturingProcess04" "ManufacturingProcess05"
## [19] "ManufacturingProcess06" "ManufacturingProcess07"
## [21] "ManufacturingProcess08" "ManufacturingProcess09"
## [23] "ManufacturingProcess10" "ManufacturingProcess11"
## [25] "ManufacturingProcess12" "ManufacturingProcess13"
## [27] "ManufacturingProcess14" "ManufacturingProcess15"
## [29] "ManufacturingProcess16" "ManufacturingProcess17"
## [31] "ManufacturingProcess18" "ManufacturingProcess19"
## [33] "ManufacturingProcess20" "ManufacturingProcess21"
## [35] "ManufacturingProcess22" "ManufacturingProcess23"
## [37] "ManufacturingProcess24" "ManufacturingProcess25"
## [39] "ManufacturingProcess26" "ManufacturingProcess27"
## [41] "ManufacturingProcess28" "ManufacturingProcess29"
## [43] "ManufacturingProcess30" "ManufacturingProcess31"
## [45] "ManufacturingProcess32" "ManufacturingProcess33"
## [47] "ManufacturingProcess34" "ManufacturingProcess35"
## [49] "ManufacturingProcess36" "ManufacturingProcess37"
## [51] "ManufacturingProcess38" "ManufacturingProcess39"
## [53] "ManufacturingProcess40" "ManufacturingProcess41"
## [55] "ManufacturingProcess42" "ManufacturingProcess43"
## [57] "ManufacturingProcess44" "ManufacturingProcess45"
str(ChemicalManufacturingProcess)
## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.7 14.7 14.7 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
head(ChemicalManufacturingProcess)
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0

B. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sec 3.8).

What is missing from this dataset?

is_na <- sort(colSums(is.na(ChemicalManufacturingProcess)))

is_na[is_na > 0]
## ManufacturingProcess01 ManufacturingProcess04 ManufacturingProcess05 
##                      1                      1                      1 
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess12 
##                      1                      1                      1 
## ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 
##                      1                      1                      1 
## ManufacturingProcess24 ManufacturingProcess40 ManufacturingProcess41 
##                      1                      1                      1 
## ManufacturingProcess06 ManufacturingProcess02 ManufacturingProcess25 
##                      2                      3                      5 
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 
##                      5                      5                      5 
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 
##                      5                      5                      5 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      5                      5                      5 
## ManufacturingProcess36 ManufacturingProcess10 ManufacturingProcess11 
##                      5                      9                     10 
## ManufacturingProcess03 
##                     15

Given the numerous different columns that exist in this dataframe, we will impute the missing data using the KNN algorithm (with 3 neighbors).

# Reference: https://www.r-bloggers.com/missing-value-treatment/
library(DMwR)
## Loading required package: grid
knn_output_df <- knnImputation(ChemicalManufacturingProcess[, 1:57], k = 3, meth = "weighAvg")
anyNA(knn_output_df)
## [1] FALSE

We have successfully imputed the missing data points with KNN with 3 neighbors. Let us take a look at the head of this dataset.

head(knn_output_df)
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1               11.57415               21.43705               1.547340
## 2                0.00000                0.00000               1.553157
## 3                0.00000                0.00000               1.544921
## 4                0.00000                0.00000               1.552545
## 5               10.70000                0.00000               1.550000
## 6               12.00000                0.00000               1.550000
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1               933.3301               988.3955               206.1187
## 2               917.0000              1032.2000               210.0000
## 3               912.0000              1003.6000               207.1000
## 4               911.0000              1014.6000               213.3000
## 5               918.0000              1027.5000               205.7000
## 6               924.0000              1016.8000               208.9000
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                177.266                177.266                  43.00
## 2                177.000                178.000                  46.57
## 3                178.000                178.000                  45.07
## 4                177.000                177.000                  44.92
## 5                178.000                178.000                  44.96
## 6                178.000                178.000                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1               8.963704               9.175013                      0
## 2               9.561514              10.221835                      0
## 3               9.337431               9.594266                      0
## 4               9.261935              10.158557                      0
## 5               8.905721               9.786672                      0
## 6               8.944334               9.817275                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1               6.694152               4.322566               11.57939
## 2               3.000000               0.000000                3.00000
## 3               4.000000               1.000000                4.00000
## 4               5.000000               2.000000                5.00000
## 5               8.000000               4.000000               18.00000
## 6               9.000000               1.000000                1.00000
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                    0.0                   0.00                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44
## 1                    3.0                    1.8
## 2                    0.9                    1.9
## 3                    1.0                    1.8
## 4                    1.1                    1.8
## 5                    1.1                    1.7
## 6                    2.2                    1.8

C. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Let us evaluate multiple different models.

First, remove zero or near zero variables.

near_zero <- nearZeroVar(knn_output_df)
knn_output_df <- knn_output_df[,-near_zero]

Split the data into training and testing data sets.

# Reference: https://topepo.github.io/caret/model-training-and-tuning.html
library(caret)
inTraining <- createDataPartition(knn_output_df$Yield, p = 0.80, list=FALSE)
training <- knn_output_df[ inTraining,]
testing <- knn_output_df[-inTraining,]

X <- training[,2:(length(training))]
Y <- training$Yield

X_test <- testing[,2:(length(testing))]
Y_test <- testing$Yield

Let us create a repeated K-fold validation using the training data.

fitControl <- trainControl(## 10-fold CV
                          method = "repeatedcv",
                          number = 10,
                          ## repeated ten times
                          repeats = 10)

Now let us try to fit multiple different models. (We also center and scale the data.)

Linear Regression:

lmFit1 <- train(Yield ~ ., data = training,
                method = "lm",
                trControl = fitControl,
                preProcess = c("center", "scale"))
lmFit1
## Linear Regression 
## 
## 144 samples
##  55 predictor
## 
## Pre-processing: centered (55), scaled (55) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 131, 128, 131, 129, 131, 130, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.755443  0.3699803  1.884209
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
lm_predict <- predict(lmFit1, X_test)
pls.eval = data.frame(obs = Y_test, pred=lm_predict)
defaultSummary(pls.eval)
##         RMSE     Rsquared          MAE 
## 35.501311024  0.004407862  7.060969737

L1 Regularization (Lasso):

enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
                        .fraction = seq(.05, 1, length = 20))

enetTune <- train(X, Y,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))

enetTune
## Elasticnet 
## 
## 144 samples
##  55 predictor
## 
## Pre-processing: centered (55), scaled (55) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 129, 132, 130, 129, 128, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.05      1.293704  0.5727579  1.0568796
##   0.00    0.10      1.184856  0.5856747  0.9585989
##   0.00    0.15      1.454452  0.5259946  1.0348906
##   0.00    0.20      1.642978  0.5069678  1.1012191
##   0.00    0.25      1.694782  0.4968470  1.1265830
##   0.00    0.30      1.699699  0.4912833  1.1438488
##   0.00    0.35      1.637276  0.4904836  1.1415942
##   0.00    0.40      1.479154  0.5005173  1.1359263
##   0.00    0.45      1.726742  0.4401298  1.2208945
##   0.00    0.50      2.049722  0.3917739  1.3190214
##   0.00    0.55      2.287806  0.3737031  1.3881701
##   0.00    0.60      2.513011  0.3601518  1.4553294
##   0.00    0.65      2.727934  0.3499126  1.5191724
##   0.00    0.70      2.895844  0.3417457  1.5696663
##   0.00    0.75      3.037842  0.3346265  1.6125323
##   0.00    0.80      3.134344  0.3300129  1.6417944
##   0.00    0.85      3.224111  0.3255748  1.6692207
##   0.00    0.90      3.314813  0.3211657  1.6974550
##   0.00    0.95      3.399163  0.3164010  1.7241917
##   0.00    1.00      3.472128  0.3118088  1.7470118
##   0.01    0.05      1.556894  0.5300258  1.2520307
##   0.01    0.10      1.323959  0.5761841  1.0772309
##   0.01    0.15      1.215661  0.5774209  1.0039198
##   0.01    0.20      1.182875  0.5858000  0.9584362
##   0.01    0.25      1.163978  0.5947054  0.9316350
##   0.01    0.30      1.320091  0.5428456  0.9909924
##   0.01    0.35      1.451525  0.5229400  1.0416603
##   0.01    0.40      1.543579  0.5164543  1.0712703
##   0.01    0.45      1.618062  0.5104602  1.0964253
##   0.01    0.50      1.655560  0.5050705  1.1131580
##   0.01    0.55      1.671184  0.5010500  1.1242356
##   0.01    0.60      1.646733  0.4999308  1.1237128
##   0.01    0.65      1.603948  0.4998720  1.1178206
##   0.01    0.70      1.550242  0.4999949  1.1111020
##   0.01    0.75      1.524891  0.4975729  1.1122286
##   0.01    0.80      1.456444  0.5009853  1.0956618
##   0.01    0.85      1.390057  0.5113694  1.0824917
##   0.01    0.90      1.340287  0.5289814  1.0669862
##   0.01    0.95      1.341712  0.5266201  1.0661031
##   0.01    1.00      1.397548  0.4991660  1.1038129
##   0.10    0.05      1.668319  0.4620199  1.3414137
##   0.10    0.10      1.506849  0.5447006  1.2124366
##   0.10    0.15      1.365241  0.5728497  1.1032467
##   0.10    0.20      1.262713  0.5784281  1.0344739
##   0.10    0.25      1.215298  0.5753146  1.0025234
##   0.10    0.30      1.197250  0.5766440  0.9784317
##   0.10    0.35      1.180563  0.5844490  0.9531289
##   0.10    0.40      1.169930  0.5900891  0.9396729
##   0.10    0.45      1.265215  0.5562936  0.9732962
##   0.10    0.50      1.382741  0.5392798  1.0094634
##   0.10    0.55      1.449038  0.5323189  1.0321538
##   0.10    0.60      1.501292  0.5289396  1.0508984
##   0.10    0.65      1.556295  0.5251250  1.0720837
##   0.10    0.70      1.569835  0.5191121  1.0869812
##   0.10    0.75      1.552184  0.5126267  1.0923810
##   0.10    0.80      1.559334  0.5032859  1.1023833
##   0.10    0.85      1.570190  0.4953804  1.1115091
##   0.10    0.90      1.571994  0.4905710  1.1167554
##   0.10    0.95      1.566373  0.4883040  1.1188133
##   0.10    1.00      1.562787  0.4863931  1.1217230
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.25 and lambda = 0.01.
lasso_predict <- predict(enetTune, X_test)
pls.eval = data.frame(obs = Y_test, pred=lasso_predict)
defaultSummary(pls.eval)
##      RMSE  Rsquared       MAE 
## 1.1671136 0.5665374 0.8481957

L2 Ridge Regression:

ridgeGrid <- data.frame(.lambda = seq(0, .1, length=15))

ridgeFit <- train(X, Y, 
                 method = "ridge",
                 trControl = ctrl,
                 tuneGrid = ridgeGrid,
                 preProc = c("center", "scale"))

ridgeFit
## Ridge Regression 
## 
## 144 samples
##  55 predictor
## 
## Pre-processing: centered (55), scaled (55) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 129, 130, 130, 131, 129, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE      Rsquared   MAE     
##   0.000000000  3.462524  0.4042364  1.728815
##   0.007142857  1.685784  0.5059081  1.194027
##   0.014285714  1.373601  0.5486381  1.074836
##   0.021428571  1.290372  0.5675075  1.048995
##   0.028571429  1.296655  0.5742562  1.043411
##   0.035714286  1.337157  0.5634232  1.065428
##   0.042857143  1.379675  0.5532834  1.081110
##   0.050000000  1.416236  0.5463701  1.092619
##   0.057142857  1.446525  0.5416899  1.101289
##   0.064285714  1.471596  0.5384086  1.108356
##   0.071428571  1.492538  0.5360243  1.113998
##   0.078571429  1.510242  0.5342394  1.118955
##   0.085714286  1.525399  0.5328709  1.123079
##   0.092857143  1.538535  0.5318016  1.126555
##   0.100000000  1.550056  0.5309537  1.129562
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.02142857.
ridge_predict <- predict(ridgeFit, X_test, s = 1, mode = "fraction")
pls.eval = data.frame(obs = Y_test, pred=ridge_predict)
defaultSummary(pls.eval)
##       RMSE   Rsquared        MAE 
## 9.38631653 0.03346543 2.37322610

Partial Least Squares:

plsTune <- train(X, Y,
                 method = "pls",
                 tuneLength = 20,
                 trControl = ctrl,
                 preProc = c("center", "scale"))

summary(plsTune)
## Data:    X dimension: 144 55 
##  Y dimension: 144 1
## Fit method: oscorespls
## Number of components considered: 3
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps
## X           19.03    29.47    36.01
## .outcome    46.07    60.68    67.60
plot(plsTune)

We will be using ncomp=3.

pls_predict <- predict(plsTune, X_test , ncomp = 3)
pls.eval = data.frame(obs = Y_test, pred=pls_predict)
defaultSummary(pls.eval)
##      RMSE  Rsquared       MAE 
## 1.2345155 0.5448403 0.9492541

Comparing the RMSE and R-squared values, the L1 Lasso Regression is the best model to use here.

D. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled perfomance metric on the training set?

The training set RMSE is 1.192336. The testing set RMSE is 1.0321298. The testing dataset overall performed well in the testing data set. (In other words, they are comparable.)

The predicted responses are below.

lasso_predict
##        5        9       11       15       16       27       52       53 
## 42.77002 41.75710 41.52341 40.82298 41.24813 36.27978 41.47797 42.21578 
##       64       65       67       71       77       81       86       91 
## 39.40874 41.85053 41.65124 39.51028 40.31666 39.89095 40.45945 39.50926 
##       97       99      100      102      108      125      132      144 
## 39.47321 38.55110 38.31076 38.15294 35.83006 39.79271 40.25121 40.00585 
##      151      158      163      166      167      170      174      175 
## 38.74265 39.12953 40.13682 38.46058 38.03143 39.67253 42.06144 40.49185

E. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

From the model we have developed from cross-validation, the lasso model demonstrates an optimal model of fraction = 0.1 and lambda = 0.0. Let us rebuild the model using the enet function such that we can see the most important coefficients or predictors that have contributed to the model.

# Rebuild the model
lassoModel <- enet(x = as.matrix(X), y = Y,
                   lambda = 0.0, normalize = TRUE)

lassoCoef <- predict(lassoModel, newx = as.matrix(X_test),
                     s=.1, mode = "fraction", type = "coefficients")

sort(lassoCoef$coefficients)
## ManufacturingProcess36 ManufacturingProcess17 ManufacturingProcess13 
##          -2.455557e+02          -1.993376e-01          -1.370443e-01 
## ManufacturingProcess37   BiologicalMaterial01   BiologicalMaterial02 
##          -1.024329e-01           0.000000e+00           0.000000e+00 
##   BiologicalMaterial04   BiologicalMaterial06   BiologicalMaterial08 
##           0.000000e+00           0.000000e+00           0.000000e+00 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##           0.000000e+00           0.000000e+00           0.000000e+00 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##           0.000000e+00           0.000000e+00           0.000000e+00 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##           0.000000e+00           0.000000e+00           0.000000e+00 
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 
##           0.000000e+00           0.000000e+00           0.000000e+00 
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 
##           0.000000e+00           0.000000e+00           0.000000e+00 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess18 
##           0.000000e+00           0.000000e+00           0.000000e+00 
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess22 
##           0.000000e+00           0.000000e+00           0.000000e+00 
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 
##           0.000000e+00           0.000000e+00           0.000000e+00 
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 
##           0.000000e+00           0.000000e+00           0.000000e+00 
## ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess35 
##           0.000000e+00           0.000000e+00           0.000000e+00 
## ManufacturingProcess38 ManufacturingProcess40 ManufacturingProcess41 
##           0.000000e+00           0.000000e+00           0.000000e+00 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess30 
##           0.000000e+00           0.000000e+00           7.597564e-04 
##   BiologicalMaterial03 ManufacturingProcess06   BiologicalMaterial05 
##           1.401715e-02           2.461960e-02           2.964591e-02 
## ManufacturingProcess39 ManufacturingProcess44 ManufacturingProcess29 
##           3.568471e-02           5.826039e-02           1.105585e-01 
## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess34 
##           1.384560e-01           2.793302e-01           1.673640e+00

There are a few variables with 0 coefficients. In other words, in a lasso model, these variables are selected out. Let us remove these variables.

list_coef <- lassoCoef$coefficients
sort(list_coef[list_coef != 0])
## ManufacturingProcess36 ManufacturingProcess17 ManufacturingProcess13 
##          -2.455557e+02          -1.993376e-01          -1.370443e-01 
## ManufacturingProcess37 ManufacturingProcess30   BiologicalMaterial03 
##          -1.024329e-01           7.597564e-04           1.401715e-02 
## ManufacturingProcess06   BiologicalMaterial05 ManufacturingProcess39 
##           2.461960e-02           2.964591e-02           3.568471e-02 
## ManufacturingProcess44 ManufacturingProcess29 ManufacturingProcess32 
##           5.826039e-02           1.105585e-01           1.384560e-01 
## ManufacturingProcess09 ManufacturingProcess34 
##           2.793302e-01           1.673640e+00

There are 9 Manufacturing Processing variables vs. 1 Biological Material variables. In this particular model, it appears that the manufacturing process dominates over the biological material variables.

F. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

The above coefficients can help guide us in regards to the manufacturing process. The top responses appear to be Manufacturing Process 35 and Manufacturing Process 36. Because yield is what we are interested in, if we desire to increase the yield, a data scientist’s recommendations would be to increase the Manufacturing Process 34 and decrease the Manufacturing Process 36. The former appears to have aimprovement when increased, whereas the latter appears to decrease the yield every time process 36 is increased.