6.2 Developing a model to predict permeability (see Sect 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response

b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
dim(fingerprints)
## [1]  165 1107
fp <- fingerprints[, -nearZeroVar(fingerprints)]
dim(fp)
## [1] 165 388

After running nearZeroVar, our number of predictors drops to 388 out of the original 1107.

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(100)

fp <- as.data.frame(fp)

smp <- floor(0.75 * nrow(fp))

x <- sample(seq_len(nrow(fp)), size = smp)
y <- sample(seq_len(nrow(permeability)), size = smp)

#x <- createDataPartition(fp$X1, p = .75, list = FALSE)
#y <- createDataPartition(permeability, p = .75, list = FALSE)

fpTrain <- fp[x,]
fpTest <- fp[-x,]
pTrain <- permeability[y,]
pTest <- permeability[-y,]

ctrl <- trainControl(method = "cv", number = 10)

plsTune <- train(fpTrain, pTrain,
                 method = "pls",
                 tuneLength = 20, trControl = ctrl,
                 preProc = c("center", "scale"))

plsTune
## Partial Least Squares 
## 
## 123 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 110, 111, 110, 111, 111, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared    MAE     
##    1     15.68753  0.07384986  12.33870
##    2     15.11851  0.08123332  11.72096
##    3     15.41619  0.08618764  12.18908
##    4     15.22515  0.11307945  11.99908
##    5     15.67184  0.09944382  12.38331
##    6     15.48768  0.09404962  12.13557
##    7     16.03970  0.07151399  12.66405
##    8     16.56358  0.05268528  13.00251
##    9     16.66535  0.05799473  13.09828
##   10     17.37128  0.06056492  13.78941
##   11     17.92980  0.05852196  14.23970
##   12     18.83832  0.04997884  14.98638
##   13     19.34537  0.04078344  15.59109
##   14     19.45846  0.04847269  15.57522
##   15     19.80906  0.04963971  15.83474
##   16     20.07748  0.05520941  15.95051
##   17     20.44437  0.05958450  16.19336
##   18     20.64521  0.06211526  16.40822
##   19     20.79998  0.06972072  16.69736
##   20     21.33585  0.06149764  17.05324
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.

The \(R^2\) value for the chosen model (1) is 0.08589549.

d) Predict the response for the test set. What is the test set esitmate of \(R^2\)?

pls.pred <- predict(plsTune, newdata = fpTest)
postResample(pred = pls.pred, obs = pTest)
##        RMSE    Rsquared         MAE 
## 18.68255753  0.01121507 14.09012196

The \(R^2\) of the test set prediction is 0.01121507

e) Try building other models discussed in this chapter. Do any have better predictive performance?

ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
ridgeRegFit <- train(fpTrain, pTrain, method = "ridge",
                     tuneGrid = ridgeGrid, trControl = ctrl,
                     preProc = c("center", "scale"))
## Warning: model fit failed for Fold01: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold02: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold04: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
ridgeRegFit
## Ridge Regression 
## 
## 123 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 111, 111, 111, 111, 111, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared    MAE      
##   0.000000000   20.51098  0.06905898   16.41406
##   0.007142857  194.28218  0.09077920  130.66816
##   0.014285714   20.66957  0.07865991   16.47118
##   0.021428571  644.30493  0.08393098  530.58719
##   0.028571429  113.58739  0.05258217   89.28957
##   0.035714286   19.54970  0.07040693   15.41411
##   0.042857143   19.24159  0.06840641   15.19483
##   0.050000000   19.04286  0.06565406   15.02679
##   0.057142857   19.01522  0.06335369   14.96659
##   0.064285714   18.81534  0.06142334   14.83894
##   0.071428571   18.72133  0.05974456   14.73841
##   0.078571429   18.62610  0.05779726   14.65993
##   0.085714286   18.55481  0.05628458   14.59050
##   0.092857143   18.49479  0.05491843   14.52929
##   0.100000000   18.44258  0.05369448   14.47271
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
rr.pred <- predict(ridgeRegFit, newdata = fpTest)
postResample(pred = rr.pred, obs = pTest)
##         RMSE     Rsquared          MAE 
## 23.542228994  0.008685525 18.737656451

The \(R^2\) for a penalized regression model, in this case a ridge-regression model, is 0.008685525

f) Would you recommend any of your models to replace the permeability laboratory experiment?

qqnorm(pls.pred, main = "PLS")
qqline(pls.pred)

qqnorm(rr.pred, main = "Ridge-Regression")
qqline(rr.pred)

Though the \(R^2\) score for both models are low, we should ultimately examine the residuals of both to see how well the model truly performs. In an examination of the two, it appears that the PLS model does a better job at fitting the data, as opposed to the ridge-regression model, which constantly under and over predicts.

This is something that can definitely be improved upon. Would it replace the permeability laboratory experiment? Probably not at this junction, but the potential is certainly there.

6.3 A chemical manufacturing process for a pharmaceutical product was discussed in Sect 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boose revenue by approximately one hundred thousand dollars per batch:

a) Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect 3.8).

We first need to see which of the predictors have missing values.

summary(ChemicalManufacturingProcess)
##      Yield       BiologicalMaterial01 BiologicalMaterial02
##  Min.   :35.25   Min.   :4.580        Min.   :46.87       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68       
##  Median :39.97   Median :6.305        Median :55.09       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75       
##                                                           
##  BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
##  Min.   :56.97        Min.   : 9.38        Min.   :13.24       
##  1st Qu.:64.98        1st Qu.:11.24        1st Qu.:17.23       
##  Median :67.22        Median :12.10        Median :18.49       
##  Mean   :67.70        Mean   :12.35        Mean   :18.60       
##  3rd Qu.:70.43        3rd Qu.:13.22        3rd Qu.:19.90       
##  Max.   :78.25        Max.   :23.09        Max.   :24.85       
##                                                                
##  BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
##  Min.   :40.60        Min.   :100.0        Min.   :15.88       
##  1st Qu.:46.05        1st Qu.:100.0        1st Qu.:17.06       
##  Median :48.46        Median :100.0        Median :17.51       
##  Mean   :48.91        Mean   :100.0        Mean   :17.49       
##  3rd Qu.:51.34        3rd Qu.:100.0        3rd Qu.:17.88       
##  Max.   :59.38        Max.   :100.8        Max.   :19.14       
##                                                                
##  BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
##  Min.   :11.44        Min.   :1.770        Min.   :135.8       
##  1st Qu.:12.60        1st Qu.:2.460        1st Qu.:143.8       
##  Median :12.84        Median :2.710        Median :146.1       
##  Mean   :12.85        Mean   :2.801        Mean   :147.0       
##  3rd Qu.:13.13        3rd Qu.:2.990        3rd Qu.:149.6       
##  Max.   :14.08        Max.   :6.870        Max.   :158.7       
##                                                                
##  BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
##  Min.   :18.35        Min.   : 0.00          Min.   : 0.00         
##  1st Qu.:19.73        1st Qu.:10.80          1st Qu.:19.30         
##  Median :20.12        Median :11.40          Median :21.00         
##  Mean   :20.20        Mean   :11.21          Mean   :16.68         
##  3rd Qu.:20.75        3rd Qu.:12.15          3rd Qu.:21.50         
##  Max.   :22.21        Max.   :14.10          Max.   :22.50         
##                       NA's   :1              NA's   :3             
##  ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
##  Min.   :1.47           Min.   :911.0          Min.   : 923.0        
##  1st Qu.:1.53           1st Qu.:928.0          1st Qu.: 986.8        
##  Median :1.54           Median :934.0          Median : 999.2        
##  Mean   :1.54           Mean   :931.9          Mean   :1001.7        
##  3rd Qu.:1.55           3rd Qu.:936.0          3rd Qu.:1008.9        
##  Max.   :1.60           Max.   :946.0          Max.   :1175.3        
##  NA's   :15             NA's   :1              NA's   :1             
##  ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
##  Min.   :203.0          Min.   :177.0          Min.   :177.0         
##  1st Qu.:205.7          1st Qu.:177.0          1st Qu.:177.0         
##  Median :206.8          Median :177.0          Median :178.0         
##  Mean   :207.4          Mean   :177.5          Mean   :177.6         
##  3rd Qu.:208.7          3rd Qu.:178.0          3rd Qu.:178.0         
##  Max.   :227.4          Max.   :178.0          Max.   :178.0         
##  NA's   :2              NA's   :1              NA's   :1             
##  ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
##  Min.   :38.89          Min.   : 7.500         Min.   : 7.500        
##  1st Qu.:44.89          1st Qu.: 8.700         1st Qu.: 9.000        
##  Median :45.73          Median : 9.100         Median : 9.400        
##  Mean   :45.66          Mean   : 9.179         Mean   : 9.386        
##  3rd Qu.:46.52          3rd Qu.: 9.550         3rd Qu.: 9.900        
##  Max.   :49.36          Max.   :11.600         Max.   :11.500        
##                         NA's   :9              NA's   :10            
##  ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
##  Min.   :   0.0         Min.   :32.10          Min.   :4701          
##  1st Qu.:   0.0         1st Qu.:33.90          1st Qu.:4828          
##  Median :   0.0         Median :34.60          Median :4856          
##  Mean   : 857.8         Mean   :34.51          Mean   :4854          
##  3rd Qu.:   0.0         3rd Qu.:35.20          3rd Qu.:4882          
##  Max.   :4549.0         Max.   :38.60          Max.   :5055          
##  NA's   :1                                     NA's   :1             
##  ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
##  Min.   :5904           Min.   :   0           Min.   :31.30         
##  1st Qu.:6010           1st Qu.:4561           1st Qu.:33.50         
##  Median :6032           Median :4588           Median :34.40         
##  Mean   :6039           Mean   :4566           Mean   :34.34         
##  3rd Qu.:6061           3rd Qu.:4619           3rd Qu.:35.10         
##  Max.   :6233           Max.   :4852           Max.   :40.00         
##                                                                      
##  ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
##  Min.   :   0           Min.   :5890           Min.   :   0          
##  1st Qu.:4813           1st Qu.:6001           1st Qu.:4553          
##  Median :4835           Median :6022           Median :4582          
##  Mean   :4810           Mean   :6028           Mean   :4556          
##  3rd Qu.:4862           3rd Qu.:6050           3rd Qu.:4610          
##  Max.   :4971           Max.   :6146           Max.   :4759          
##                                                                      
##  ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
##  Min.   :-1.8000        Min.   : 0.000         Min.   :0.000         
##  1st Qu.:-0.6000        1st Qu.: 3.000         1st Qu.:2.000         
##  Median :-0.3000        Median : 5.000         Median :3.000         
##  Mean   :-0.1642        Mean   : 5.406         Mean   :3.017         
##  3rd Qu.: 0.0000        3rd Qu.: 8.000         3rd Qu.:4.000         
##  Max.   : 3.6000        Max.   :12.000         Max.   :6.000         
##                         NA's   :1              NA's   :1             
##  ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
##  Min.   : 0.000         Min.   :   0           Min.   :   0          
##  1st Qu.: 4.000         1st Qu.:4832           1st Qu.:6020          
##  Median : 8.000         Median :4855           Median :6047          
##  Mean   : 8.834         Mean   :4828           Mean   :6016          
##  3rd Qu.:14.000         3rd Qu.:4877           3rd Qu.:6070          
##  Max.   :23.000         Max.   :4990           Max.   :6161          
##  NA's   :1              NA's   :5              NA's   :5             
##  ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
##  Min.   :   0           Min.   : 0.000         Min.   : 0.00         
##  1st Qu.:4560           1st Qu.: 0.000         1st Qu.:19.70         
##  Median :4587           Median :10.400         Median :19.90         
##  Mean   :4563           Mean   : 6.592         Mean   :20.01         
##  3rd Qu.:4609           3rd Qu.:10.750         3rd Qu.:20.40         
##  Max.   :4710           Max.   :11.500         Max.   :22.00         
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
##  Min.   : 0.000         Min.   : 0.00          Min.   :143.0         
##  1st Qu.: 8.800         1st Qu.:70.10          1st Qu.:155.0         
##  Median : 9.100         Median :70.80          Median :158.0         
##  Mean   : 9.161         Mean   :70.18          Mean   :158.5         
##  3rd Qu.: 9.700         3rd Qu.:71.40          3rd Qu.:162.0         
##  Max.   :11.200         Max.   :72.50          Max.   :173.0         
##  NA's   :5              NA's   :5                                    
##  ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
##  Min.   :56.00          Min.   :2.300          Min.   :463.0         
##  1st Qu.:62.00          1st Qu.:2.500          1st Qu.:490.0         
##  Median :64.00          Median :2.500          Median :495.0         
##  Mean   :63.54          Mean   :2.494          Mean   :495.6         
##  3rd Qu.:65.00          3rd Qu.:2.500          3rd Qu.:501.5         
##  Max.   :70.00          Max.   :2.600          Max.   :522.0         
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
##  Min.   :0.01700        Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.01900        1st Qu.:0.700          1st Qu.:2.000         
##  Median :0.02000        Median :1.000          Median :3.000         
##  Mean   :0.01957        Mean   :1.014          Mean   :2.534         
##  3rd Qu.:0.02000        3rd Qu.:1.300          3rd Qu.:3.000         
##  Max.   :0.02200        Max.   :2.300          Max.   :3.000         
##  NA's   :5                                                           
##  ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
##  Min.   :0.000          Min.   :0.00000        Min.   :0.00000       
##  1st Qu.:7.100          1st Qu.:0.00000        1st Qu.:0.00000       
##  Median :7.200          Median :0.00000        Median :0.00000       
##  Mean   :6.851          Mean   :0.01771        Mean   :0.02371       
##  3rd Qu.:7.300          3rd Qu.:0.00000        3rd Qu.:0.00000       
##  Max.   :7.500          Max.   :0.10000        Max.   :0.20000       
##                         NA's   :1              NA's   :1             
##  ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
##  Min.   : 0.00          Min.   : 0.0000        Min.   :0.000         
##  1st Qu.:11.40          1st Qu.: 0.6000        1st Qu.:1.800         
##  Median :11.60          Median : 0.8000        Median :1.900         
##  Mean   :11.21          Mean   : 0.9119        Mean   :1.805         
##  3rd Qu.:11.70          3rd Qu.: 1.0250        3rd Qu.:1.900         
##  Max.   :12.10          Max.   :11.0000        Max.   :2.100         
##                                                                      
##  ManufacturingProcess45
##  Min.   :0.000         
##  1st Qu.:2.100         
##  Median :2.200         
##  Mean   :2.138         
##  3rd Qu.:2.300         
##  Max.   :2.600         
## 

We see that ManufacturingProcess01-ManufacturingProcess08, ManufacturingProcess10-ManufacturingProcess12, ManufacturingProcess14, ManufacturingProcess22-ManufacturingProcess31, ManufacturingProcess33-ManufacturingProcess36, ManufacturingProcess40, and ManufacturingProcess41 are all the predictors we need to impute.

library(impute)

cmp <- impute.knn(as.matrix(ChemicalManufacturingProcess))
cmp <- as.data.frame(cmp$data)
summary(cmp)
##      Yield       BiologicalMaterial01 BiologicalMaterial02
##  Min.   :35.25   Min.   :4.580        Min.   :46.87       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68       
##  Median :39.97   Median :6.305        Median :55.09       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75       
##  BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
##  Min.   :56.97        Min.   : 9.38        Min.   :13.24       
##  1st Qu.:64.98        1st Qu.:11.24        1st Qu.:17.23       
##  Median :67.22        Median :12.10        Median :18.49       
##  Mean   :67.70        Mean   :12.35        Mean   :18.60       
##  3rd Qu.:70.43        3rd Qu.:13.22        3rd Qu.:19.90       
##  Max.   :78.25        Max.   :23.09        Max.   :24.85       
##  BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
##  Min.   :40.60        Min.   :100.0        Min.   :15.88       
##  1st Qu.:46.05        1st Qu.:100.0        1st Qu.:17.06       
##  Median :48.46        Median :100.0        Median :17.51       
##  Mean   :48.91        Mean   :100.0        Mean   :17.49       
##  3rd Qu.:51.34        3rd Qu.:100.0        3rd Qu.:17.88       
##  Max.   :59.38        Max.   :100.8        Max.   :19.14       
##  BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
##  Min.   :11.44        Min.   :1.770        Min.   :135.8       
##  1st Qu.:12.60        1st Qu.:2.460        1st Qu.:143.8       
##  Median :12.84        Median :2.710        Median :146.1       
##  Mean   :12.85        Mean   :2.801        Mean   :147.0       
##  3rd Qu.:13.13        3rd Qu.:2.990        3rd Qu.:149.6       
##  Max.   :14.08        Max.   :6.870        Max.   :158.7       
##  BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
##  Min.   :18.35        Min.   : 0.00          Min.   : 0.00         
##  1st Qu.:19.73        1st Qu.:10.78          1st Qu.:19.17         
##  Median :20.12        Median :11.40          Median :21.00         
##  Mean   :20.20        Mean   :11.20          Mean   :16.66         
##  3rd Qu.:20.75        3rd Qu.:12.12          3rd Qu.:21.50         
##  Max.   :22.21        Max.   :14.10          Max.   :22.50         
##  ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
##  Min.   :1.470          Min.   :911.0          Min.   : 923.0        
##  1st Qu.:1.530          1st Qu.:928.0          1st Qu.: 986.8        
##  Median :1.544          Median :934.0          Median : 999.4        
##  Mean   :1.540          Mean   :931.8          Mean   :1001.8        
##  3rd Qu.:1.550          3rd Qu.:936.0          3rd Qu.:1009.2        
##  Max.   :1.600          Max.   :946.0          Max.   :1175.3        
##  ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
##  Min.   :203.0          Min.   :177.0          Min.   :177.0         
##  1st Qu.:205.7          1st Qu.:177.0          1st Qu.:177.0         
##  Median :206.8          Median :177.0          Median :178.0         
##  Mean   :207.4          Mean   :177.5          Mean   :177.6         
##  3rd Qu.:208.7          3rd Qu.:178.0          3rd Qu.:178.0         
##  Max.   :227.4          Max.   :178.0          Max.   :178.0         
##  ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
##  Min.   :38.89          Min.   : 7.500         Min.   : 7.500        
##  1st Qu.:44.89          1st Qu.: 8.700         1st Qu.: 9.000        
##  Median :45.73          Median : 9.100         Median : 9.400        
##  Mean   :45.66          Mean   : 9.186         Mean   : 9.396        
##  3rd Qu.:46.52          3rd Qu.: 9.525         3rd Qu.: 9.900        
##  Max.   :49.36          Max.   :11.600         Max.   :11.500        
##  ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
##  Min.   :   0.0         Min.   :32.10          Min.   :4701          
##  1st Qu.:   0.0         1st Qu.:33.90          1st Qu.:4827          
##  Median :   0.0         Median :34.60          Median :4856          
##  Mean   : 852.9         Mean   :34.51          Mean   :4854          
##  3rd Qu.:   0.0         3rd Qu.:35.20          3rd Qu.:4882          
##  Max.   :4549.0         Max.   :38.60          Max.   :5055          
##  ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
##  Min.   :5904           Min.   :   0           Min.   :31.30         
##  1st Qu.:6010           1st Qu.:4561           1st Qu.:33.50         
##  Median :6032           Median :4588           Median :34.40         
##  Mean   :6039           Mean   :4566           Mean   :34.34         
##  3rd Qu.:6061           3rd Qu.:4619           3rd Qu.:35.10         
##  Max.   :6233           Max.   :4852           Max.   :40.00         
##  ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
##  Min.   :   0           Min.   :5890           Min.   :   0          
##  1st Qu.:4813           1st Qu.:6001           1st Qu.:4553          
##  Median :4835           Median :6022           Median :4582          
##  Mean   :4810           Mean   :6028           Mean   :4556          
##  3rd Qu.:4862           3rd Qu.:6050           3rd Qu.:4610          
##  Max.   :4971           Max.   :6146           Max.   :4759          
##  ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
##  Min.   :-1.8000        Min.   : 0.000         Min.   :0.000         
##  1st Qu.:-0.6000        1st Qu.: 3.000         1st Qu.:2.000         
##  Median :-0.3000        Median : 5.000         Median :3.000         
##  Mean   :-0.1642        Mean   : 5.406         Mean   :3.011         
##  3rd Qu.: 0.0000        3rd Qu.: 8.000         3rd Qu.:4.000         
##  Max.   : 3.6000        Max.   :12.000         Max.   :6.000         
##  ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
##  Min.   : 0.000         Min.   :   0           Min.   :   0          
##  1st Qu.: 4.000         1st Qu.:4831           1st Qu.:6020          
##  Median : 8.000         Median :4854           Median :6046          
##  Mean   : 8.823         Mean   :4825           Mean   :6013          
##  3rd Qu.:14.000         3rd Qu.:4876           3rd Qu.:6069          
##  Max.   :23.000         Max.   :4990           Max.   :6161          
##  ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
##  Min.   :   0           Min.   : 0.000         Min.   : 0.0          
##  1st Qu.:4561           1st Qu.: 0.000         1st Qu.:19.7          
##  Median :4588           Median :10.400         Median :19.9          
##  Mean   :4561           Mean   : 6.444         Mean   :20.0          
##  3rd Qu.:4609           3rd Qu.:10.700         3rd Qu.:20.4          
##  Max.   :4710           Max.   :11.500         Max.   :22.0          
##  ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
##  Min.   : 0.000         Min.   : 0.00          Min.   :143.0         
##  1st Qu.: 8.800         1st Qu.:70.10          1st Qu.:155.0         
##  Median : 9.200         Median :70.80          Median :158.0         
##  Mean   : 9.167         Mean   :70.16          Mean   :158.5         
##  3rd Qu.: 9.700         3rd Qu.:71.40          3rd Qu.:162.0         
##  Max.   :11.200         Max.   :72.50          Max.   :173.0         
##  ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
##  Min.   :56.00          Min.   :2.300          Min.   :463.0         
##  1st Qu.:62.00          1st Qu.:2.500          1st Qu.:490.0         
##  Median :64.00          Median :2.500          Median :495.5         
##  Mean   :63.49          Mean   :2.493          Mean   :495.7         
##  3rd Qu.:65.00          3rd Qu.:2.500          3rd Qu.:501.2         
##  Max.   :70.00          Max.   :2.600          Max.   :522.0         
##  ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
##  Min.   :0.01700        Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.01900        1st Qu.:0.700          1st Qu.:2.000         
##  Median :0.02000        Median :1.000          Median :3.000         
##  Mean   :0.01959        Mean   :1.014          Mean   :2.534         
##  3rd Qu.:0.02000        3rd Qu.:1.300          3rd Qu.:3.000         
##  Max.   :0.02200        Max.   :2.300          Max.   :3.000         
##  ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
##  Min.   :0.000          Min.   :0.00000        Min.   :0.00000       
##  1st Qu.:7.100          1st Qu.:0.00000        1st Qu.:0.00000       
##  Median :7.200          Median :0.00000        Median :0.00000       
##  Mean   :6.851          Mean   :0.01761        Mean   :0.02358       
##  3rd Qu.:7.300          3rd Qu.:0.00000        3rd Qu.:0.00000       
##  Max.   :7.500          Max.   :0.10000        Max.   :0.20000       
##  ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
##  Min.   : 0.00          Min.   : 0.0000        Min.   :0.000         
##  1st Qu.:11.40          1st Qu.: 0.6000        1st Qu.:1.800         
##  Median :11.60          Median : 0.8000        Median :1.900         
##  Mean   :11.21          Mean   : 0.9119        Mean   :1.805         
##  3rd Qu.:11.70          3rd Qu.: 1.0250        3rd Qu.:1.900         
##  Max.   :12.10          Max.   :11.0000        Max.   :2.100         
##  ManufacturingProcess45
##  Min.   :0.000         
##  1st Qu.:2.100         
##  Median :2.200         
##  Mean   :2.138         
##  3rd Qu.:2.300         
##  Max.   :2.600

Now we have no NA’s.

c) Split the data into a training and a test set, pre-process the data, and use a model of your choice from this chapter. What is the optimal value of the performance metric?

We’ll use Partial Least Squares, as it performed better in the previous question.

smp1 <- floor(0.75 * nrow(cmp))

x1 <- sample(seq_len(nrow(cmp)), size = smp)
y1 <- sample(seq_len(nrow(cmp)), size = smp)

cmpTrain <- cmp[x1,-1]
cmpTest <- cmp[-x1,-1]
yTrain <- cmp[y1,1]
yTest <- cmp[-y1,1]

cmpTune <- train(cmpTrain, yTrain,
                 method = "pls",
                 tuneLength = 20, trControl = ctrl,
                 preProc = c("center", "scale"))

cmpTune
## Partial Least Squares 
## 
## 123 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 109, 111, 111, 111, 111, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared    MAE     
##    1     1.990940  0.08872065  1.584375
##    2     2.542706  0.03976928  1.828412
##    3     2.952260  0.08348326  2.050262
##    4     3.566405  0.07761645  2.243481
##    5     3.799803  0.06841577  2.348720
##    6     3.771451  0.06546726  2.345041
##    7     3.833465  0.06413688  2.358874
##    8     4.100629  0.07865771  2.502442
##    9     3.812132  0.07371600  2.448421
##   10     3.424531  0.04191420  2.308153
##   11     3.346966  0.06570311  2.322883
##   12     3.127500  0.06660960  2.241396
##   13     2.973043  0.06205586  2.177309
##   14     3.049624  0.05758056  2.220079
##   15     3.204395  0.05281387  2.281923
##   16     3.162355  0.06017971  2.282734
##   17     3.278946  0.06524080  2.354455
##   18     3.383178  0.06943381  2.385199
##   19     3.692067  0.06929113  2.493352
##   20     3.785993  0.07096712  2.521672
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.

In this instance, the optimal value was ncomp = 2, which has an \(R^2\) of 0.12783409.

d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

cmpPred <- predict(cmpTune, newdata = cmpTest)
postResample(pred = cmpPred, obs = yTest)
##      RMSE  Rsquared       MAE 
## 1.4462084 0.2041162 1.2537531

The \(R^2\) is a really low 0.008163636.

e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

cmpTune$finalModel$coefficients
## , , 1 comps
## 
##                             .outcome
## BiologicalMaterial01    0.0608653757
## BiologicalMaterial02    0.0553578126
## BiologicalMaterial03    0.0545395762
## BiologicalMaterial04    0.0400824643
## BiologicalMaterial05    0.0031263824
## BiologicalMaterial06    0.0517050408
## BiologicalMaterial07   -0.0026399147
## BiologicalMaterial08    0.0532354895
## BiologicalMaterial09    0.0491323388
## BiologicalMaterial10    0.0335511643
## BiologicalMaterial11    0.0278888413
## BiologicalMaterial12    0.0404087506
## ManufacturingProcess01  0.0135255227
## ManufacturingProcess02 -0.0186387429
## ManufacturingProcess03 -0.0243446535
## ManufacturingProcess04 -0.0157808228
## ManufacturingProcess05  0.0319798266
## ManufacturingProcess06  0.0608121273
## ManufacturingProcess07 -0.0114531023
## ManufacturingProcess08 -0.0307741942
## ManufacturingProcess09  0.0534116518
## ManufacturingProcess10  0.0131215054
## ManufacturingProcess11  0.0173837131
## ManufacturingProcess12  0.0083950179
## ManufacturingProcess13 -0.0402197652
## ManufacturingProcess14 -0.0049238624
## ManufacturingProcess15  0.0115572171
## ManufacturingProcess16  0.0039142476
## ManufacturingProcess17 -0.0205987386
## ManufacturingProcess18 -0.0501901868
## ManufacturingProcess19  0.0111998087
## ManufacturingProcess20 -0.0488868804
## ManufacturingProcess21  0.0184004041
## ManufacturingProcess22 -0.0233300208
## ManufacturingProcess23 -0.0125285815
## ManufacturingProcess24 -0.0143699415
## ManufacturingProcess25 -0.0011906891
## ManufacturingProcess26 -0.0008204135
## ManufacturingProcess27 -0.0018846585
## ManufacturingProcess28  0.0272632465
## ManufacturingProcess29  0.0046614355
## ManufacturingProcess30  0.0054802385
## ManufacturingProcess31 -0.0064500459
## ManufacturingProcess32  0.0267722840
## ManufacturingProcess33  0.0117218025
## ManufacturingProcess34 -0.0302936245
## ManufacturingProcess35 -0.0591382575
## ManufacturingProcess36 -0.0545803525
## ManufacturingProcess37  0.0105183772
## ManufacturingProcess38 -0.0302796235
## ManufacturingProcess39 -0.0443561795
## ManufacturingProcess40 -0.0140099119
## ManufacturingProcess41 -0.0127288010
## ManufacturingProcess42 -0.0194486336
## ManufacturingProcess43 -0.0067860655
## ManufacturingProcess44 -0.0162936766
## ManufacturingProcess45 -0.0204819420

Looking at only 2 comps, The Manufacturing Process seems to have the most importance, as generally their scores are higher than the Biological Materials. ManufacturingProcess40 has the highest score at 0.2013412892.

f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

The highest scoring Biological Material was 08 at 0.0945111599. In general, though the biological materials cannot be changed during the refinement process, identifying which ingredients/materials are more vital will help to ensure a higher yield as the company can focus on obtaining high quality ingredients of those materials. Likewise, knowing the most important manufacturing process steps allows the company to pinpoint where they can start fine tuning the procedure.