KJ 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of the product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch.

(a)

Start R and use these commands to load the data:

> library(AppliedPredictiveModeling)
> data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

# Load data
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.2.3
data(ChemicalManufacturingProcess)

(b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in those missing values (e.g., see Sect. 3.8).

We’ll first generate a summary to get a first look at the data and to show missing values, then we’ll get a count of missing values per variable.

# Take a first look
summary(ChemicalManufacturingProcess)
##      Yield       BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##  Min.   :35.25   Min.   :4.580        Min.   :46.87        Min.   :56.97       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68        1st Qu.:64.98       
##  Median :39.97   Median :6.305        Median :55.09        Median :67.22       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69        Mean   :67.70       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74        3rd Qu.:70.43       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75        Max.   :78.25       
##                                                                                
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##  Min.   : 9.38        Min.   :13.24        Min.   :40.60       
##  1st Qu.:11.24        1st Qu.:17.23        1st Qu.:46.05       
##  Median :12.10        Median :18.49        Median :48.46       
##  Mean   :12.35        Mean   :18.60        Mean   :48.91       
##  3rd Qu.:13.22        3rd Qu.:19.90        3rd Qu.:51.34       
##  Max.   :23.09        Max.   :24.85        Max.   :59.38       
##                                                                
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##  Min.   :100.0        Min.   :15.88        Min.   :11.44       
##  1st Qu.:100.0        1st Qu.:17.06        1st Qu.:12.60       
##  Median :100.0        Median :17.51        Median :12.84       
##  Mean   :100.0        Mean   :17.49        Mean   :12.85       
##  3rd Qu.:100.0        3rd Qu.:17.88        3rd Qu.:13.13       
##  Max.   :100.8        Max.   :19.14        Max.   :14.08       
##                                                                
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##  Min.   :1.770        Min.   :135.8        Min.   :18.35       
##  1st Qu.:2.460        1st Qu.:143.8        1st Qu.:19.73       
##  Median :2.710        Median :146.1        Median :20.12       
##  Mean   :2.801        Mean   :147.0        Mean   :20.20       
##  3rd Qu.:2.990        3rd Qu.:149.6        3rd Qu.:20.75       
##  Max.   :6.870        Max.   :158.7        Max.   :22.21       
##                                                                
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##  Min.   : 0.00          Min.   : 0.00          Min.   :1.47          
##  1st Qu.:10.80          1st Qu.:19.30          1st Qu.:1.53          
##  Median :11.40          Median :21.00          Median :1.54          
##  Mean   :11.21          Mean   :16.68          Mean   :1.54          
##  3rd Qu.:12.15          3rd Qu.:21.50          3rd Qu.:1.55          
##  Max.   :14.10          Max.   :22.50          Max.   :1.60          
##  NA's   :1              NA's   :3              NA's   :15            
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##  Min.   :911.0          Min.   : 923.0         Min.   :203.0         
##  1st Qu.:928.0          1st Qu.: 986.8         1st Qu.:205.7         
##  Median :934.0          Median : 999.2         Median :206.8         
##  Mean   :931.9          Mean   :1001.7         Mean   :207.4         
##  3rd Qu.:936.0          3rd Qu.:1008.9         3rd Qu.:208.7         
##  Max.   :946.0          Max.   :1175.3         Max.   :227.4         
##  NA's   :1              NA's   :1              NA's   :2             
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##  Min.   :177.0          Min.   :177.0          Min.   :38.89         
##  1st Qu.:177.0          1st Qu.:177.0          1st Qu.:44.89         
##  Median :177.0          Median :178.0          Median :45.73         
##  Mean   :177.5          Mean   :177.6          Mean   :45.66         
##  3rd Qu.:178.0          3rd Qu.:178.0          3rd Qu.:46.52         
##  Max.   :178.0          Max.   :178.0          Max.   :49.36         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##  Min.   : 7.500         Min.   : 7.500         Min.   :   0.0        
##  1st Qu.: 8.700         1st Qu.: 9.000         1st Qu.:   0.0        
##  Median : 9.100         Median : 9.400         Median :   0.0        
##  Mean   : 9.179         Mean   : 9.386         Mean   : 857.8        
##  3rd Qu.: 9.550         3rd Qu.: 9.900         3rd Qu.:   0.0        
##  Max.   :11.600         Max.   :11.500         Max.   :4549.0        
##  NA's   :9              NA's   :10             NA's   :1             
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##  Min.   :32.10          Min.   :4701           Min.   :5904          
##  1st Qu.:33.90          1st Qu.:4828           1st Qu.:6010          
##  Median :34.60          Median :4856           Median :6032          
##  Mean   :34.51          Mean   :4854           Mean   :6039          
##  3rd Qu.:35.20          3rd Qu.:4882           3rd Qu.:6061          
##  Max.   :38.60          Max.   :5055           Max.   :6233          
##                         NA's   :1                                    
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##  Min.   :   0           Min.   :31.30          Min.   :   0          
##  1st Qu.:4561           1st Qu.:33.50          1st Qu.:4813          
##  Median :4588           Median :34.40          Median :4835          
##  Mean   :4566           Mean   :34.34          Mean   :4810          
##  3rd Qu.:4619           3rd Qu.:35.10          3rd Qu.:4862          
##  Max.   :4852           Max.   :40.00          Max.   :4971          
##                                                                      
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##  Min.   :5890           Min.   :   0           Min.   :-1.8000       
##  1st Qu.:6001           1st Qu.:4553           1st Qu.:-0.6000       
##  Median :6022           Median :4582           Median :-0.3000       
##  Mean   :6028           Mean   :4556           Mean   :-0.1642       
##  3rd Qu.:6050           3rd Qu.:4610           3rd Qu.: 0.0000       
##  Max.   :6146           Max.   :4759           Max.   : 3.6000       
##                                                                      
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##  Min.   : 0.000         Min.   :0.000          Min.   : 0.000        
##  1st Qu.: 3.000         1st Qu.:2.000          1st Qu.: 4.000        
##  Median : 5.000         Median :3.000          Median : 8.000        
##  Mean   : 5.406         Mean   :3.017          Mean   : 8.834        
##  3rd Qu.: 8.000         3rd Qu.:4.000          3rd Qu.:14.000        
##  Max.   :12.000         Max.   :6.000          Max.   :23.000        
##  NA's   :1              NA's   :1              NA's   :1             
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##  Min.   :   0           Min.   :   0           Min.   :   0          
##  1st Qu.:4832           1st Qu.:6020           1st Qu.:4560          
##  Median :4855           Median :6047           Median :4587          
##  Mean   :4828           Mean   :6016           Mean   :4563          
##  3rd Qu.:4877           3rd Qu.:6070           3rd Qu.:4609          
##  Max.   :4990           Max.   :6161           Max.   :4710          
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##  Min.   : 0.000         Min.   : 0.00          Min.   : 0.000        
##  1st Qu.: 0.000         1st Qu.:19.70          1st Qu.: 8.800        
##  Median :10.400         Median :19.90          Median : 9.100        
##  Mean   : 6.592         Mean   :20.01          Mean   : 9.161        
##  3rd Qu.:10.750         3rd Qu.:20.40          3rd Qu.: 9.700        
##  Max.   :11.500         Max.   :22.00          Max.   :11.200        
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##  Min.   : 0.00          Min.   :143.0          Min.   :56.00         
##  1st Qu.:70.10          1st Qu.:155.0          1st Qu.:62.00         
##  Median :70.80          Median :158.0          Median :64.00         
##  Mean   :70.18          Mean   :158.5          Mean   :63.54         
##  3rd Qu.:71.40          3rd Qu.:162.0          3rd Qu.:65.00         
##  Max.   :72.50          Max.   :173.0          Max.   :70.00         
##  NA's   :5                                     NA's   :5             
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##  Min.   :2.300          Min.   :463.0          Min.   :0.01700       
##  1st Qu.:2.500          1st Qu.:490.0          1st Qu.:0.01900       
##  Median :2.500          Median :495.0          Median :0.02000       
##  Mean   :2.494          Mean   :495.6          Mean   :0.01957       
##  3rd Qu.:2.500          3rd Qu.:501.5          3rd Qu.:0.02000       
##  Max.   :2.600          Max.   :522.0          Max.   :0.02200       
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##  Min.   :0.000          Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.700          1st Qu.:2.000          1st Qu.:7.100         
##  Median :1.000          Median :3.000          Median :7.200         
##  Mean   :1.014          Mean   :2.534          Mean   :6.851         
##  3rd Qu.:1.300          3rd Qu.:3.000          3rd Qu.:7.300         
##  Max.   :2.300          Max.   :3.000          Max.   :7.500         
##                                                                      
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##  Min.   :0.00000        Min.   :0.00000        Min.   : 0.00         
##  1st Qu.:0.00000        1st Qu.:0.00000        1st Qu.:11.40         
##  Median :0.00000        Median :0.00000        Median :11.60         
##  Mean   :0.01771        Mean   :0.02371        Mean   :11.21         
##  3rd Qu.:0.00000        3rd Qu.:0.00000        3rd Qu.:11.70         
##  Max.   :0.10000        Max.   :0.20000        Max.   :12.10         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##  Min.   : 0.0000        Min.   :0.000          Min.   :0.000         
##  1st Qu.: 0.6000        1st Qu.:1.800          1st Qu.:2.100         
##  Median : 0.8000        Median :1.900          Median :2.200         
##  Mean   : 0.9119        Mean   :1.805          Mean   :2.138         
##  3rd Qu.: 1.0250        3rd Qu.:1.900          3rd Qu.:2.300         
##  Max.   :11.0000        Max.   :2.100          Max.   :2.600         
## 
# Show missing value counts
dfchem <- ChemicalManufacturingProcess  # To avoid typing this many letters
data.frame(Missing=colSums(is.na(dfchem))) %>%
    filter(Missing > 0) %>%
    kbl(caption='Missing value counts') %>%
    kable_classic(full_width=F)
Missing value counts
Missing
ManufacturingProcess01 1
ManufacturingProcess02 3
ManufacturingProcess03 15
ManufacturingProcess04 1
ManufacturingProcess05 1
ManufacturingProcess06 2
ManufacturingProcess07 1
ManufacturingProcess08 1
ManufacturingProcess10 9
ManufacturingProcess11 10
ManufacturingProcess12 1
ManufacturingProcess14 1
ManufacturingProcess22 1
ManufacturingProcess23 1
ManufacturingProcess24 1
ManufacturingProcess25 5
ManufacturingProcess26 5
ManufacturingProcess27 5
ManufacturingProcess28 5
ManufacturingProcess29 5
ManufacturingProcess30 5
ManufacturingProcess31 5
ManufacturingProcess33 5
ManufacturingProcess34 5
ManufacturingProcess35 5
ManufacturingProcess36 5
ManufacturingProcess40 1
ManufacturingProcess41 1
# Impute missing values
#imp <- mice(dfchem, printFlag=F)
imp <- knnImputation(dfchem, k=3)
dfchem2 <- complete(imp)

# Verify no more missing values exist
print(paste0('Missing values after imputation: ', sum(is.na(dfchem2))))
## [1] "Missing values after imputation: 0"
# Look for skewness
# As a general rule of thumb: If skewness is less than -1 or greater than 1, the distribution is highly skewed.
# If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed. 
# If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.
print("Skewness of columns:")
## [1] "Skewness of columns:"
colskewness <- apply(dfchem2, 2, e1071::skewness, type=1)
colskewness
##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##            0.313628736            0.275662556            0.246222354 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##            0.028755477            1.747184905            0.306614797 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##            0.371697783            7.462171782            0.221942745 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##           -0.270721711            2.422999528            0.361901086 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##            0.306452433           -3.967337791           -1.469933180 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##           -0.589468598           -0.709609748            2.619804979 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##            3.080054936            0.087004003           -0.209672712 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##           -0.948742838            0.644366087           -0.014138720 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##            1.601281538            0.484400166           -0.002762522 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##            0.680136157          -12.526835771            1.172954087 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##          -12.845460479            0.299893667          -12.746809903 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##            1.743956159            0.311971249            0.187396582 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##            0.355136773          -12.927210865          -12.966657435 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##          -12.812938765           -0.397520290          -10.298318542 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##           -4.861799837          -12.074918634            0.213038336 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##           -0.103459835           -0.234979423           -0.125325542 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##            0.180092333            0.381605455           -1.696241287 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##           -4.305766123            1.700356568            2.197063408 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##           -5.496789260            9.132598658           -5.013019018 
## ManufacturingProcess45 
##           -4.112944721
# Show skewed columns
print("Skewed columns (skewness < -1 or > 1):")
## [1] "Skewed columns (skewness < -1 or > 1):"
skewedcols <- dfchem2[colskewness < -1 | colskewness > 1]
colnames(skewedcols)
##  [1] "BiologicalMaterial04"   "BiologicalMaterial07"   "BiologicalMaterial10"  
##  [4] "ManufacturingProcess01" "ManufacturingProcess02" "ManufacturingProcess05"
##  [7] "ManufacturingProcess06" "ManufacturingProcess12" "ManufacturingProcess16"
## [10] "ManufacturingProcess17" "ManufacturingProcess18" "ManufacturingProcess20"
## [13] "ManufacturingProcess21" "ManufacturingProcess25" "ManufacturingProcess26"
## [16] "ManufacturingProcess27" "ManufacturingProcess29" "ManufacturingProcess30"
## [19] "ManufacturingProcess31" "ManufacturingProcess38" "ManufacturingProcess39"
## [22] "ManufacturingProcess40" "ManufacturingProcess41" "ManufacturingProcess42"
## [25] "ManufacturingProcess43" "ManufacturingProcess44" "ManufacturingProcess45"

(c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

First, we’ll choose a random seed for reproducible results. Then we’ll split the data using createDataPartition(), which attempts to split the data in such a way that the training and test sets will an approximately proportionate distributions of the outcome variable.

# Set seed
set.seed(777)

# Split into train/test; createDataPartition generates indicies of the training set
train_indices <- createDataPartition(dfchem2$Yield, p=0.80, times=1, list=F)
dftrain <- dfchem2[train_indices,]
dftest <- dfchem2[-train_indices,]

# Examine train and test sets
hist(dftrain$Yield, main='Training set yield histogram')

hist(dftest$Yield, main='Test set yield histogram')

Now we’ll examine the predictors to look for correlations and examine their relationship to the outcome variable.

# Corr chart - not useful here because of the number of predictors
# chart.Correlation(dftrain)

# Corr plot
corr1 <- cor(dftrain)
corrplot(corr1, order='hclust', type='upper', diag=F)

# Show candidates for removal
print("Candidates for removal due to high correlation:")
## [1] "Candidates for removal due to high correlation:"
high_corr <- findCorrelation(corr1, cutoff=0.9, exact=T, verbose=F, names=F)
colnames(dftrain[,high_corr])
##  [1] "BiologicalMaterial02"   "BiologicalMaterial04"   "BiologicalMaterial12"  
##  [4] "ManufacturingProcess29" "ManufacturingProcess42" "ManufacturingProcess27"
##  [7] "ManufacturingProcess25" "ManufacturingProcess31" "ManufacturingProcess18"
## [10] "ManufacturingProcess40"
# Remove the highly correlated variables
dftrain2 <- dftrain[,-high_corr]
dftest2 <- dftest[,-high_corr]

Look at near-zero variance features.

# Remove NZV features
tmp_nzv <- nearZeroVar(dftrain2)
print(paste0('Near-zero variance features: ', colnames(dftrain2)[tmp_nzv]))
## [1] "Near-zero variance features: BiologicalMaterial07"
dftrain2 <- dftrain2[,-tmp_nzv]
dftest2 <- dftest2[,-tmp_nzv]

Now we’ll try modeling. First, prepare a data frame for the results, separate the outcome and predictor variables, and set cross-validation parameters.

# Results data frame
dfr <- data.frame(matrix(nrow=10, ncol=3))
colnames(dfr) <- c('Model', 'Tuning.Parameters', 'RMSE')

# Separate outcome and predictors
trainx <- dftrain2 %>% dplyr::select(-Yield)
trainy <- dftrain2$Yield
testx <- dftest2 %>% dplyr::select(-Yield)
testy <- dftest2$Yield

# specify 10x cross-validation
ctrl <- trainControl(method='cv', number=10)

Test preprocessing

# Testing out pre-processing
print('-------------------------------------------------------')
## [1] "-------------------------------------------------------"
print('No pre-processing')
## [1] "No pre-processing"
set.seed(777)
fitlm1 <- train(x=trainx, y=trainy, method='lm', trControl=ctrl)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
predict(fitlm1, newdata=as.matrix(testx))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
##        4        5       18       23       27       29       32       37 
## 41.45387 42.63260 39.68963 72.65342 35.91451 37.51592 41.83047 43.24816 
##       49       50       58       60       65       72       77       80 
## 43.58761 41.80537 39.53683 39.98923 42.38107 40.76567 39.95035 38.58628 
##       88       95       96      100      105      110      111      115 
## 41.19641 40.23947 39.72705 38.16562 38.25934 38.93946 40.02140 40.38711 
##      117      126      140      154      163      169      170      174 
## 41.30805 40.68793 38.67291 39.14236 40.41033 37.47626 38.74232 42.23967
print('-------------------------------------------------------')
## [1] "-------------------------------------------------------"
print('Pre-process within train() fuction: Center and scaling only')
## [1] "Pre-process within train() fuction: Center and scaling only"
set.seed(777)
fitlm1 <- train(x=trainx, y=trainy, method='lm', trControl=ctrl, preProcess=c('center', 'scale'))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
predict(fitlm1, newdata=as.matrix(testx))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
##        4        5       18       23       27       29       32       37 
## 41.45387 42.63260 39.68963 72.65342 35.91451 37.51592 41.83047 43.24816 
##       49       50       58       60       65       72       77       80 
## 43.58761 41.80537 39.53683 39.98923 42.38107 40.76567 39.95035 38.58628 
##       88       95       96      100      105      110      111      115 
## 41.19641 40.23947 39.72705 38.16562 38.25934 38.93946 40.02140 40.38711 
##      117      126      140      154      163      169      170      174 
## 41.30805 40.68793 38.67291 39.14236 40.41033 37.47626 38.74232 42.23967
print('-------------------------------------------------------')
## [1] "-------------------------------------------------------"
print('Pre-process within train() fuction: Center, scaling, YeoJohnson')
## [1] "Pre-process within train() fuction: Center, scaling, YeoJohnson"
set.seed(777)
fitlm1 <- train(x=trainx, y=trainy, method='lm', trControl=ctrl, preProcess=c('center', 'scale', 'YeoJohnson'))
predict(fitlm1, newdata=as.matrix(testx))
##            4            5           18           23           27           29 
## 4.094113e+01 4.221883e+01 3.992967e+01 1.201549e+12 3.541196e+01 3.721764e+01 
##           32           37           49           50           58           60 
## 4.194176e+01 4.291529e+01 4.385981e+01 4.144476e+01 3.783137e+01 4.008025e+01 
##           65           72           77           80           88           95 
## 4.238697e+01 4.068365e+01 3.991285e+01 3.832768e+01 4.119786e+01 3.976468e+01 
##           96          100          105          110          111          115 
## 3.924898e+01 3.804538e+01 3.823628e+01 3.923236e+01 4.036698e+01 4.024042e+01 
##          117          126          140          154          163          169 
## 4.131700e+01 4.073699e+01 3.902077e+01 3.942066e+01 4.042338e+01 3.704005e+01 
##          170          174 
## 3.846184e+01 4.167869e+01
# Preprocess training x data
transobj_trainx <- preProcess(trainx, method=c('center', 'scale', 'YeoJohnson'))
trainx_trans <- predict(transobj_trainx, trainx)

# Preprocess test x data
transobj_testx <- preProcess(testx, method=c('center', 'scale', 'YeoJohnson'))
testx_trans <- predict(transobj_testx, testx)

# Fit model using transformed x data
print('-------------------------------------------------------')
## [1] "-------------------------------------------------------"
print('Pre-process outside of train() fuction: Center, scaling, YeoJohnson on x data only')
## [1] "Pre-process outside of train() fuction: Center, scaling, YeoJohnson on x data only"
set.seed(777)
fitlm1 <- train(x=trainx_trans, y=trainy, method='lm', trControl=ctrl)
predict(fitlm1, newdata=as.matrix(testx_trans))
##        4        5       18       23       27       29       32       37 
## 42.09880 44.26170 40.49845 44.41176 36.99165 38.76723 41.38606 42.18600 
##       49       50       58       60       65       72       77       80 
## 43.61614 40.98222 37.32326 39.05236 41.34297 40.84644 39.41932 37.55382 
##       88       95       96      100      105      110      111      115 
## 41.24311 39.70256 39.30794 38.65147 38.01144 38.81168 40.49879 39.12775 
##      117      126      140      154      163      169      170      174 
## 40.86134 40.92234 38.94973 39.17437 40.01704 37.99964 39.73224 41.28282
# Preprocess training y data
transobj_trainy <- preProcess(as.data.frame(trainy), method=c('center', 'scale', 'YeoJohnson'))
trainy_trans <- predict(transobj_trainy, as.data.frame(trainy))

# Fit model using transformed x and y data
print('-------------------------------------------------------')
## [1] "-------------------------------------------------------"
print('Pre-process outside of train() fuction: Center, scaling, YeoJohnson on both x and y data')
## [1] "Pre-process outside of train() fuction: Center, scaling, YeoJohnson on both x and y data"
set.seed(777)
fitlm1 <- train(x=trainx_trans, y=trainy_trans$trainy, method='lm', trControl=ctrl)
testy_trans <- predict(fitlm1, newdata=as.matrix(testx_trans))
transobj_testy <- preProcess(as.data.frame(testy_trans), method=c('center', 'scale', 'YeoJohnson'))
predict(transobj_testy, as.data.frame(testy_trans))
##      testy_trans
## 4    1.004088890
## 5    1.920870344
## 18   0.310978196
## 23   1.854997594
## 27  -2.193780517
## 29  -0.891372789
## 32   0.709107066
## 37   1.138481858
## 49   1.722119936
## 50   0.502469197
## 58  -1.533989854
## 60  -0.492274137
## 65   0.708979938
## 72   0.422600779
## 77  -0.329975252
## 80  -1.395272642
## 88   0.658728298
## 95  -0.199432730
## 96  -0.411736013
## 100 -0.881490837
## 105 -1.187210700
## 110 -0.658494961
## 111  0.337418788
## 115 -0.365231063
## 117  0.542788610
## 126  0.502745721
## 140 -0.652864036
## 154 -0.402308668
## 163 -0.007460397
## 169 -1.229123782
## 170 -0.265956083
## 174  0.761599245

Linear regression

Since linear regression is sensitive to skewness, we’ll need to transform the predictors. But since there are zero and negative values, we’ll use YeoJohnson instead of Box Cox.

# Linear model
set.seed(777)
fitlm1 <- train(x=trainx, y=trainy, method='lm', trControl=ctrl, preProcess=c('center', 'scale', 'YeoJohnson'))
fitlm1
## Linear Regression 
## 
## 144 samples
##  46 predictor
## 
## Pre-processing: centered (46), scaled (46), Yeo-Johnson transformation (23) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ... 
## Resampling results:
## 
##   RMSE          Rsquared   MAE        
##   117739644465  0.4828238  29434911117
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
dfr[1,] = data.frame(Model='LM', Tuning.Parameters='', RMSE=fitlm1$results[['RMSE']])

# Stepwise linear model using MASS package/caret
stepGrid <- data.frame(.nvmax=seq(1, round(ncol(trainx) / 2, 0)))  # Max number of parameters to use
set.seed(777)
fitlm2 <- train(x=trainx, y=trainy, method='leapSeq', trControl=ctrl, tuneGrid=stepGrid, preProcess=c('center', 'scale', 'YeoJohnson'))
fitlm2
## Linear Regression with Stepwise Selection 
## 
## 144 samples
##  46 predictor
## 
## Pre-processing: centered (46), scaled (46), Yeo-Johnson transformation (23) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE          Rsquared   MAE         
##    1     1.431491e+00  0.4342738  1.144534e+00
##    2     1.158788e+00  0.6031342  9.437858e-01
##    3     1.196402e+00  0.5713593  9.745006e-01
##    4     1.200655e+00  0.5713735  9.781351e-01
##    5     1.111279e+00  0.6419103  8.914018e-01
##    6     5.362400e+10  0.5215602  1.340600e+10
##    7     5.945710e+10  0.5228394  1.486428e+10
##    8     1.241937e+00  0.5558046  1.004326e+00
##    9     5.950209e+10  0.5004446  1.487552e+10
##   10     6.496760e+10  0.4864935  1.624190e+10
##   11     5.294823e+10  0.5207291  1.323706e+10
##   12     5.388407e+10  0.3643220  1.347102e+10
##   13     5.481079e+10  0.5141754  1.370270e+10
##   14     5.240901e+10  0.4454655  1.310225e+10
##   15     8.587152e+10  0.4471269  2.146788e+10
##   16     8.385069e+10  0.5036706  2.096267e+10
##   17     5.800547e+10  0.4495873  1.450137e+10
##   18     5.477397e+10  0.4834435  1.369349e+10
##   19     7.008780e+10  0.5196428  1.752195e+10
##   20     8.466672e+10  0.5577231  2.116668e+10
##   21     8.733927e+10  0.4745233  2.183482e+10
##   22     9.003694e+10  0.4912813  2.250923e+10
##   23     9.226087e+10  0.5070177  2.306522e+10
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 5.
dfr[2,] = data.frame(MOdel='LM', Tuning.Parameters=paste0('nvmax=', fitlm2$bestTune[['nvmax']]), RMSE=min(fitlm2$results[['RMSE']]))

Robust linear regression

# Robust linear model - this errors out with "'x' is singular: singular fits are not implemented in 'rlm'"
set.seed(777)
fitrlm <- train(x=trainx, y=trainy, method='rlm', preProcess=c('center', 'scale', 'pca'), trControl=ctrl)
fitrlm
## Robust Linear Model 
## 
## 144 samples
##  46 predictor
## 
## Pre-processing: centered (46), scaled (46), principal component
##  signal extraction (46) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   intercept  psi           RMSE       Rsquared   MAE      
##   FALSE      psi.huber     40.359305  0.5187123  40.309744
##   FALSE      psi.hampel    40.359305  0.5187123  40.309744
##   FALSE      psi.bisquare  40.359228  0.5189204  40.309711
##    TRUE      psi.huber      1.703796  0.5128949   1.152851
##    TRUE      psi.hampel     1.639212  0.5197273   1.122953
##    TRUE      psi.bisquare   1.649879  0.5141430   1.138463
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.hampel.
dfr[3,] = data.frame(Model='RLM', Tuning.Parameters='', RMSE=min(fitrlm$results[['RMSE']]))  # temp

Partial least squares

# PLS
set.seed(777)
fitpls <- train(x=trainx, y=trainy, method='pls', preProcess=c('center', 'scale'), trControl=ctrl, tuneLength=nrow(trainx) / 2)
fitpls
## Partial Least Squares 
## 
## 144 samples
##  46 predictor
## 
## Pre-processing: centered (46), scaled (46) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.526790  0.4263332  1.166839
##    2     1.511817  0.5062515  1.092722
##    3     1.629087  0.5535124  1.117592
##    4     1.875163  0.5474681  1.152308
##    5     2.053876  0.5302709  1.212185
##    6     2.196015  0.5197294  1.254813
##    7     2.549015  0.4954945  1.373743
##    8     2.774119  0.4848949  1.430631
##    9     2.912486  0.5013733  1.452154
##   10     3.107472  0.4982692  1.494394
##   11     3.228344  0.5000939  1.517594
##   12     3.405619  0.4938940  1.557258
##   13     3.476198  0.4977553  1.572813
##   14     3.617499  0.4972972  1.603532
##   15     3.730694  0.4994671  1.629883
##   16     3.849865  0.5003645  1.664269
##   17     3.954212  0.4996230  1.692126
##   18     4.055287  0.4950035  1.724869
##   19     4.139159  0.4924955  1.749936
##   20     4.215228  0.4921319  1.770767
##   21     4.362633  0.4897603  1.813180
##   22     4.460844  0.4886693  1.841417
##   23     4.593184  0.4861933  1.878487
##   24     4.748029  0.4829737  1.920425
##   25     4.864180  0.4808356  1.950680
##   26     4.949870  0.4783736  1.975737
##   27     4.991077  0.4773965  1.986732
##   28     5.012214  0.4765909  1.992702
##   29     5.035560  0.4759194  2.000144
##   30     5.051121  0.4748363  2.005798
##   31     5.064986  0.4744422  2.009019
##   32     5.077696  0.4744943  2.012298
##   33     5.088090  0.4743634  2.015385
##   34     5.090468  0.4746549  2.015811
##   35     5.094116  0.4749406  2.016809
##   36     5.103858  0.4748463  2.019486
##   37     5.108222  0.4749117  2.020422
##   38     5.108992  0.4747286  2.020758
##   39     5.108526  0.4746318  2.020737
##   40     5.110694  0.4745965  2.021325
##   41     5.114275  0.4745892  2.022312
##   42     5.115730  0.4746206  2.022649
##   43     5.116157  0.4746222  2.022756
##   44     5.115674  0.4746303  2.022622
##   45     5.115640  0.4746321  2.022611
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
dfr[4,] = data.frame(Model='PLS', Tuning.Parameters=paste0('ncomp=', fitpls$bestTune), RMSE=min(fitpls$results[['RMSE']]))

Ridge regression

# Ridge regression
set.seed(777)
ridgeGrid <- data.frame(.lambda=seq(0, 0.1, length=15))
fitridge <- train(x=trainx, y=trainy, method='ridge', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=ridgeGrid)
fitridge
## Ridge Regression 
## 
## 144 samples
##  46 predictor
## 
## Pre-processing: centered (46), scaled (46) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE      Rsquared   MAE     
##   0.000000000  5.115640  0.4746321  2.022611
##   0.007142857  4.007837  0.4979266  1.709757
##   0.014285714  3.520810  0.5078463  1.578087
##   0.021428571  3.237975  0.5133402  1.503131
##   0.028571429  3.050304  0.5168667  1.454249
##   0.035714286  2.915124  0.5193879  1.418881
##   0.042857143  2.812181  0.5213426  1.391944
##   0.050000000  2.730595  0.5229511  1.370663
##   0.057142857  2.663981  0.5243319  1.354138
##   0.064285714  2.608329  0.5255526  1.340362
##   0.071428571  2.560989  0.5266534  1.328619
##   0.078571429  2.520133  0.5276596  1.318586
##   0.085714286  2.484452  0.5285878  1.310057
##   0.092857143  2.452986  0.5294493  1.302455
##   0.100000000  2.425008  0.5302522  1.295635
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
dfr[5,] = data.frame(Model='Ridge', Tuning.Parameters=paste0('labmda=', fitridge$bestTune[['lambda']]), RMSE=min(fitridge$results[['RMSE']]))

Lasso regression

# Lasso regression
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1), .fraction=seq(0.05, 1, length=20))
set.seed(777)
fitlasso <- train(x=trainx, y=trainy, method='enet', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=enetGrid)
fitlasso
## Elasticnet 
## 
## 144 samples
##  46 predictor
## 
## Pre-processing: centered (46), scaled (46) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE     
##   0.00    0.05      1.496260  0.5634332  1.242877
##   0.00    0.10      1.280891  0.5850413  1.067828
##   0.00    0.15      1.244567  0.5705032  1.017070
##   0.00    0.20      1.552368  0.5657841  1.071553
##   0.00    0.25      1.693358  0.5683321  1.101153
##   0.00    0.30      1.841463  0.5634434  1.138020
##   0.00    0.35      1.960392  0.5572569  1.165177
##   0.00    0.40      2.121687  0.5552184  1.204194
##   0.00    0.45      2.265305  0.5548726  1.238814
##   0.00    0.50      2.502205  0.5286630  1.306408
##   0.00    0.55      2.775275  0.5179853  1.374875
##   0.00    0.60      3.027842  0.5135267  1.437783
##   0.00    0.65      3.330734  0.5149751  1.514301
##   0.00    0.70      3.687284  0.5117928  1.608031
##   0.00    0.75      4.021617  0.5070116  1.699490
##   0.00    0.80      4.287030  0.5007199  1.774338
##   0.00    0.85      4.522109  0.4933412  1.844119
##   0.00    0.90      4.743989  0.4860898  1.911502
##   0.00    0.95      4.927568  0.4793554  1.967701
##   0.00    1.00      5.115640  0.4746321  2.022611
##   0.01    0.05      1.552755  0.5392355  1.289136
##   0.01    0.10      1.368745  0.5773088  1.139872
##   0.01    0.15      1.228621  0.5934733  1.022713
##   0.01    0.20      1.271444  0.5651755  1.021248
##   0.01    0.25      1.512796  0.5690943  1.058149
##   0.01    0.30      1.638723  0.5724123  1.083730
##   0.01    0.35      1.754695  0.5710581  1.108978
##   0.01    0.40      1.893296  0.5652949  1.146110
##   0.01    0.45      2.001564  0.5558437  1.178080
##   0.01    0.50      2.097541  0.5539844  1.201485
##   0.01    0.55      2.234012  0.5471378  1.241234
##   0.01    0.60      2.402482  0.5342127  1.285540
##   0.01    0.65      2.583688  0.5270247  1.328696
##   0.01    0.70      2.768173  0.5214363  1.374307
##   0.01    0.75      2.956960  0.5170501  1.422059
##   0.01    0.80      3.118342  0.5158694  1.463990
##   0.01    0.85      3.338838  0.5148072  1.521653
##   0.01    0.90      3.511004  0.5118248  1.568048
##   0.01    0.95      3.651312  0.5079108  1.607902
##   0.01    1.00      3.775369  0.5027184  1.646082
##   0.10    0.05      1.623123  0.4761389  1.346421
##   0.10    0.10      1.493217  0.5555220  1.239770
##   0.10    0.15      1.372376  0.5770680  1.143356
##   0.10    0.20      1.266243  0.5905997  1.055866
##   0.10    0.25      1.206276  0.5956067  1.007595
##   0.10    0.30      1.209429  0.5795297  1.002188
##   0.10    0.35      1.347724  0.5601984  1.020873
##   0.10    0.40      1.487463  0.5653996  1.047829
##   0.10    0.45      1.584918  0.5710952  1.067204
##   0.10    0.50      1.670864  0.5721786  1.087431
##   0.10    0.55      1.750869  0.5720047  1.106169
##   0.10    0.60      1.828812  0.5705264  1.126478
##   0.10    0.65      1.909082  0.5675414  1.148811
##   0.10    0.70      1.993045  0.5662598  1.168350
##   0.10    0.75      2.096245  0.5606080  1.197623
##   0.10    0.80      2.204726  0.5522837  1.230596
##   0.10    0.85      2.286488  0.5446859  1.254934
##   0.10    0.90      2.333570  0.5389458  1.269233
##   0.10    0.95      2.378917  0.5345773  1.282115
##   0.10    1.00      2.425008  0.5302522  1.295635
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.25 and lambda = 0.1.
dfr[6,] = data.frame(
    Model='Lasso (enlastic net)', 
    Tuning.Parameters=paste0('lambda=', fitlasso$bestTune[['lambda']], ', fraction=', fitlasso$bestTune[['fraction']]), 
    RMSE=min(fitlasso$results[['RMSE']])
)
# Model comparison
dfr %>%
    filter(!is.na(RMSE)) %>%
    kbl(caption='Model comparison') %>%
    kable_classic(full_width=F)
Model comparison
Model Tuning.Parameters RMSE
LM 1.177396e+11
LM nvmax=5 1.111279e+00
RLM 1.639212e+00
PLS ncomp=2 1.511817e+00
Ridge labmda=0.1 2.425008e+00
Lasso (enlastic net) lambda=0.1, fraction=0.25 1.206276e+00

Although the stepwise linear model performed the best in terms of RMSE, there were problems with outliers in the final predictors that appears to have biased the model. Therefore, we’ll choose the next best model (lasso). The lasso model with max lambda=0.1 and fraction=0.25 was the best-performing model under training conditions, with an RMSE of 1.21.

We’ll check residuals to make sure the model is valid.

# Predict y values of training data
predy_train <- predict(fitlasso, s=0.1, fraction=0.25, newdata=as.matrix(trainx))

# Check residual plot
ggplot() +
    geom_point(aes(x=predy_train, y=trainy - predy_train)) +
    ggtitle('Residual plot for best-performing model (stepwise linear regression)')

# Breusch-Pagan test to determine homoschedasticity of residuals
# Null hypothesis: the residuals are homoschedastic.
# If the p-value is small, reject the null, i.e., consider the residuals heteroschedastic.
#bp <- bptest(fitlm2)
#if (bp$p.value > 0.05 & bp$statistic < 10) {
#  print(paste0("Breusch-Pagan test for homoschedasticity: The p-value of ", bp$p.value, " is > 0.05 and the test statistic of ", bp$statistic,
#      " is < 10, so don't reject the null; i.e., the residuals are HOMOSCHEDASTIC"))
#} else if (bp$p.value <= 0.05) {
#  print(paste0("Breusch-Pagan test for homoschedasticity: The p-value of ", bp$p.value, " is <= 0.05 and the test statistic is ", bp$statistic,
#      ", so reject the null; i.e., the residuals are HETEROSCHEDASTIC"))
#} else {
#  print(paste0("Breusch-Pagan test for homoschedasticity: The p-value of ", bp$p.value, " and test statistic of ", bp$statistic,
#      " are inconclusive, so homoschedasticity can't be determined using this test."))
#}

The residuals don’t exhibit any pattern and appear appropriate for a well-fit model.

(d)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

First, calculate the accuracy based on the test set.

# Generate y predictions
predy <- predict(fitlasso, s=0.1, fraction=0.25, newdata=as.matrix(testx))

# Get accuracy results
accuracy(predy, testy)
##                  ME     RMSE       MAE        MPE     MAPE
## Test set -0.1320439 1.343438 0.9877057 -0.4636695 2.450543

As expected, the RMSE for the test set is higher than that that of the training set (1.34 versus 1.21, respectively). But the low MAPE value of 2.45 confirms good model performance regardless.

(e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

# Extract coefficients from model (lasso)
enetCoef <- predict(fitlasso$finalModel, s=0.1, fraction=0.5, mode='fraction', type='coefficients')
dftop <- data.frame(enetCoef$coefficients) %>%
    filter(enetCoef.coefficients != 0)

# Extract coefficients from model (stepwise linear regression)
#lmCoef <- coef(fitlm2$finalModel, id=5)
#dftop <- as.data.frame(lmCoef)

# Display coefficients
dftop %>%
    kbl(caption='Non-zero model coefficients') %>%
    kable_classic(full_width=F)
Non-zero model coefficients
enetCoef.coefficients
ManufacturingProcess09 0.1644254
ManufacturingProcess13 -0.0189290
ManufacturingProcess32 0.4286063

As shown, the three most important variables are all those related to the manufacturing process. This works out favorably, as these are elements that can be controlled or adjusted to have an effect on yield, whereas the biological predictors can’t be changed.

(f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

# Iterate over top parameters
for (p in rownames(dftop)) {
    if (p != '(Intercept)') {  # do not include the intercept
        plt <- dfchem2 %>%
            ggplot(aes(x=eval(sym(p)), y=Yield)) +
            geom_point() +
            xlab(p) +
            geom_smooth(method=lm, formula=y ~ x, linetype=2, color='darkred', se=F)
        print(plt)
    }
}

  • Based on the scatter plot for ManufacturingProcess09, it is evident that increasing this process consistently produces better yields; each unit increase in ManufacturingProcess09 will realize a 0.11-unit gain in yield.
  • Likewise, increasing Manufacturing Process32 has a positive effect on yield (about 8%).
  • On the other hand, ManufacturingProcess13 has an inverse relationship with yield (about 0.02-unit increase in yield for every unit decrease in ManufacturingProcess13).