LinearRegression KJ6.3

Kuhn and Johnson

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

LOAD DATA

Start R and use these commands to load the data:

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Explore the relationships between each of the top predictors and the re- sponse. How could this information be helpful in improving yield in future runs of the manufacturing process?

#clear the workspace
rm(list = ls())

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

data(package="AppliedPredictiveModeling")
data(ChemicalManufacturingProcess)
# summary(ChemicalManufacturingProcess)

The data set is in the library of applied predictive modeling. It contains 176 observations and 58 variables .The data set has the outcome variable as yield ,and 12 variables describing biological materials, and 45 variables describing manufacturing process.

DATA PREPROCESS

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in therese missing values.

I am using the mice package for the imputation, with the PMM method, and two rounds of imputations are done.

##      Yield       BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##  Min.   :35.25   Min.   :4.580        Min.   :46.87        Min.   :56.97       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68        1st Qu.:64.98       
##  Median :39.97   Median :6.305        Median :55.09        Median :67.22       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69        Mean   :67.70       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74        3rd Qu.:70.43       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75        Max.   :78.25       
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##  Min.   : 9.38        Min.   :13.24        Min.   :40.60       
##  1st Qu.:11.24        1st Qu.:17.23        1st Qu.:46.05       
##  Median :12.10        Median :18.49        Median :48.46       
##  Mean   :12.35        Mean   :18.60        Mean   :48.91       
##  3rd Qu.:13.22        3rd Qu.:19.90        3rd Qu.:51.34       
##  Max.   :23.09        Max.   :24.85        Max.   :59.38       
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##  Min.   :100.0        Min.   :15.88        Min.   :11.44       
##  1st Qu.:100.0        1st Qu.:17.06        1st Qu.:12.60       
##  Median :100.0        Median :17.51        Median :12.84       
##  Mean   :100.0        Mean   :17.49        Mean   :12.85       
##  3rd Qu.:100.0        3rd Qu.:17.88        3rd Qu.:13.13       
##  Max.   :100.8        Max.   :19.14        Max.   :14.08       
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##  Min.   :1.770        Min.   :135.8        Min.   :18.35       
##  1st Qu.:2.460        1st Qu.:143.8        1st Qu.:19.73       
##  Median :2.710        Median :146.1        Median :20.12       
##  Mean   :2.801        Mean   :147.0        Mean   :20.20       
##  3rd Qu.:2.990        3rd Qu.:149.6        3rd Qu.:20.75       
##  Max.   :6.870        Max.   :158.7        Max.   :22.21       
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##  Min.   : 0.00          Min.   : 0.00          Min.   :1.47          
##  1st Qu.:10.80          1st Qu.:19.23          1st Qu.:1.53          
##  Median :11.40          Median :21.00          Median :1.55          
##  Mean   :11.21          Mean   :16.64          Mean   :1.54          
##  3rd Qu.:12.12          3rd Qu.:21.50          3rd Qu.:1.55          
##  Max.   :14.10          Max.   :22.50          Max.   :1.60          
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##  Min.   :911.0          Min.   : 923.0         Min.   :203.0         
##  1st Qu.:927.8          1st Qu.: 986.8         1st Qu.:205.7         
##  Median :934.0          Median : 999.0         Median :206.8         
##  Mean   :931.8          Mean   :1001.6         Mean   :207.4         
##  3rd Qu.:936.0          3rd Qu.:1008.7         3rd Qu.:208.7         
##  Max.   :946.0          Max.   :1175.3         Max.   :227.4         
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##  Min.   :177.0          Min.   :177.0          Min.   :38.89         
##  1st Qu.:177.0          1st Qu.:177.0          1st Qu.:44.89         
##  Median :177.0          Median :178.0          Median :45.73         
##  Mean   :177.5          Mean   :177.6          Mean   :45.66         
##  3rd Qu.:178.0          3rd Qu.:178.0          3rd Qu.:46.52         
##  Max.   :178.0          Max.   :178.0          Max.   :49.36         
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##  Min.   : 7.500         Min.   : 7.500         Min.   :   0.0        
##  1st Qu.: 8.700         1st Qu.: 9.000         1st Qu.:   0.0        
##  Median : 9.000         Median : 9.400         Median :   0.0        
##  Mean   : 9.164         Mean   : 9.407         Mean   : 852.9        
##  3rd Qu.: 9.525         3rd Qu.: 9.900         3rd Qu.:   0.0        
##  Max.   :11.600         Max.   :11.500         Max.   :4549.0        
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##  Min.   :32.10          Min.   :4701           Min.   :5904          
##  1st Qu.:33.90          1st Qu.:4827           1st Qu.:6010          
##  Median :34.60          Median :4856           Median :6032          
##  Mean   :34.51          Mean   :4853           Mean   :6039          
##  3rd Qu.:35.20          3rd Qu.:4882           3rd Qu.:6061          
##  Max.   :38.60          Max.   :5055           Max.   :6233          
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##  Min.   :   0           Min.   :31.30          Min.   :   0          
##  1st Qu.:4561           1st Qu.:33.50          1st Qu.:4813          
##  Median :4588           Median :34.40          Median :4835          
##  Mean   :4566           Mean   :34.34          Mean   :4810          
##  3rd Qu.:4619           3rd Qu.:35.10          3rd Qu.:4862          
##  Max.   :4852           Max.   :40.00          Max.   :4971          
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##  Min.   :5890           Min.   :   0           Min.   :-1.8000       
##  1st Qu.:6001           1st Qu.:4553           1st Qu.:-0.6000       
##  Median :6022           Median :4582           Median :-0.3000       
##  Mean   :6028           Mean   :4556           Mean   :-0.1642       
##  3rd Qu.:6050           3rd Qu.:4610           3rd Qu.: 0.0000       
##  Max.   :6146           Max.   :4759           Max.   : 3.6000       
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##  Min.   : 0.000         Min.   :0.000          Min.   : 0.000        
##  1st Qu.: 3.000         1st Qu.:2.000          1st Qu.: 4.000        
##  Median : 5.000         Median :3.000          Median : 8.000        
##  Mean   : 5.386         Mean   :3.017          Mean   : 8.898        
##  3rd Qu.: 8.000         3rd Qu.:4.000          3rd Qu.:14.000        
##  Max.   :12.000         Max.   :6.000          Max.   :23.000        
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##  Min.   :   0           Min.   :   0           Min.   :   0          
##  1st Qu.:4828           1st Qu.:6019           1st Qu.:4563          
##  Median :4854           Median :6046           Median :4586          
##  Mean   :4827           Mean   :6014           Mean   :4564          
##  3rd Qu.:4876           3rd Qu.:6069           3rd Qu.:4610          
##  Max.   :4990           Max.   :6161           Max.   :4710          
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##  Min.   : 0.000         Min.   : 0.00          Min.   : 0.00         
##  1st Qu.: 0.000         1st Qu.:19.70          1st Qu.: 8.80         
##  Median :10.400         Median :20.00          Median : 9.15         
##  Mean   : 6.405         Mean   :20.06          Mean   : 9.17         
##  3rd Qu.:10.700         3rd Qu.:20.50          3rd Qu.: 9.70         
##  Max.   :11.500         Max.   :22.00          Max.   :11.20         
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##  Min.   : 0.00          Min.   :143.0          Min.   :56.00         
##  1st Qu.:69.90          1st Qu.:155.0          1st Qu.:62.00         
##  Median :70.75          Median :158.0          Median :64.00         
##  Mean   :70.04          Mean   :158.5          Mean   :63.61         
##  3rd Qu.:71.40          3rd Qu.:162.0          3rd Qu.:65.00         
##  Max.   :72.50          Max.   :173.0          Max.   :70.00         
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##  Min.   :2.300          Min.   :463.0          Min.   :0.01700       
##  1st Qu.:2.500          1st Qu.:490.0          1st Qu.:0.01900       
##  Median :2.500          Median :496.0          Median :0.01900       
##  Mean   :2.491          Mean   :495.9          Mean   :0.01952       
##  3rd Qu.:2.500          3rd Qu.:502.0          3rd Qu.:0.02000       
##  Max.   :2.600          Max.   :522.0          Max.   :0.02200       
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##  Min.   :0.000          Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.700          1st Qu.:2.000          1st Qu.:7.100         
##  Median :1.000          Median :3.000          Median :7.200         
##  Mean   :1.014          Mean   :2.534          Mean   :6.851         
##  3rd Qu.:1.300          3rd Qu.:3.000          3rd Qu.:7.300         
##  Max.   :2.300          Max.   :3.000          Max.   :7.500         
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##  Min.   :0.00000        Min.   :0.00000        Min.   : 0.00         
##  1st Qu.:0.00000        1st Qu.:0.00000        1st Qu.:11.40         
##  Median :0.00000        Median :0.00000        Median :11.60         
##  Mean   :0.01818        Mean   :0.02415        Mean   :11.21         
##  3rd Qu.:0.00000        3rd Qu.:0.00000        3rd Qu.:11.70         
##  Max.   :0.10000        Max.   :0.20000        Max.   :12.10         
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##  Min.   : 0.0000        Min.   :0.000          Min.   :0.000         
##  1st Qu.: 0.6000        1st Qu.:1.800          1st Qu.:2.100         
##  Median : 0.8000        Median :1.900          Median :2.200         
##  Mean   : 0.9119        Mean   :1.805          Mean   :2.138         
##  3rd Qu.: 1.0250        3rd Qu.:1.900          3rd Qu.:2.300         
##  Max.   :11.0000        Max.   :2.100          Max.   :2.600

Now the data is completed and ready to be analyzed.

library(e1071)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

TRAINING TESTING SPLIT

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

train_chem <- C_M_P2[1:130,] 
test_chem <- C_M_P2 [131:176,]

The data is split into training dataset of the 1st 130 observations, and testing dataset with the rest of 46 observations. This is 80 /20 split .They both contain the original 58 variables. But without missing data.

## [1] 19

## [1] 39

The caret package has a function to find correlations among variables. I used that, with the cutoff points of 0.7 as the threshold, which results in 19 variables that are highly correlated. After filtering out these 19 variables, there are remaining 39 explainable variables to be used in analysis.

library(corrplot)

## corrplot 0.84 loaded

correlation_chem <- cor(filtered_train_chem, method = "pearson")
corrplot(correlation_chem, type="upper", method="color")

This correlation plot is a visualization of the remaining 39 variables, which is already taken out the highly correlated ones.

DATA PREDICTION

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

library(pls)

set.seed(100)
ctrl <- trainControl(method = "cv", number = 10)

plsTune2 <- train(filtered_train_chem[,2:39],filtered_train_chem[,1],
                 method="pls",
                 tuneLength=20,
                 trControl=ctrl,
                 preProc=c("center","scale"))
plsTune2$bestTune

##    ncomp
## 19    19

plsTune2$results

##    ncomp     RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1      1 2.073545 0.7570869 1.6158044 0.8764292 0.14509107 0.5065408
## 2      2 1.915597 0.7915543 1.3506882 0.8390431 0.17535559 0.3746553
## 3      3 1.525990 0.8696215 1.1587579 0.4349058 0.06143772 0.2579549
## 4      4 1.897189 0.8187616 1.2378116 1.1832356 0.18303063 0.5389155
## 5      5 1.963573 0.8128172 1.2331389 1.6160088 0.23467799 0.6560934
## 6      6 2.146138 0.7952321 1.2536089 1.9703495 0.26040360 0.7387702
## 7      7 2.390184 0.7857642 1.3286480 2.4596378 0.27353801 0.8801379
## 8      8 2.326514 0.7874180 1.2523691 2.3026289 0.26839864 0.7798256
## 9      9 2.365424 0.7912818 1.2339710 2.4896982 0.27286811 0.8200504
## 10    10 2.488107 0.7855064 1.2607733 2.7546992 0.28258928 0.8794479
## 11    11 2.357707 0.7954019 1.1954375 2.5781113 0.26711059 0.7918826
## 12    12 2.237697 0.8054063 1.1397488 2.4442107 0.25380639 0.7450187
## 13    13 2.144837 0.8148693 1.0941577 2.3682589 0.24474339 0.7056096
## 14    14 1.930114 0.8356747 1.0230721 2.0719448 0.21511426 0.6229877
## 15    15 1.654861 0.8690194 0.9347058 1.6431573 0.15297974 0.4970126
## 16    16 1.497130 0.8910031 0.8895384 1.3525178 0.10659545 0.3909605
## 17    17 1.227332 0.9170523 0.8061245 0.6593939 0.05145399 0.1997950
## 18    18 1.122657 0.9229582 0.7798127 0.3576277 0.04987600 0.1549173
## 19    19 1.067462 0.9207771 0.7512899 0.4563599 0.07186039 0.2082580
## 20    20 1.242128 0.8905825 0.8054621 0.6001110 0.10614256 0.2311430

summary(plsTune2)

## Data:    X dimension: 130 38 
##  Y dimension: 130 1
## Fit method: oscorespls
## Number of components considered: 19
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           15.71    20.71    24.92    29.84    35.35    38.76    42.11
## .outcome    80.35    88.61    91.55    92.73    93.47    94.57    95.28
##           8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X           45.45    48.98     52.19     55.39     58.72     60.92     63.62
## .outcome    95.74    96.07     96.30     96.47     96.58     96.65     96.69
##           15 comps  16 comps  17 comps  18 comps  19 comps
## X            66.76     69.86     72.80     76.07     78.69
## .outcome     96.71     96.73     96.75     96.77     96.78

plot(plsTune2)

The detailed Performance evaluation in terms RMSE, R ^2, MAE etc are listed as results . The plot of num of components to RMSE shows that the RMSE is hightest at 10 components, drop down gradually as components are added. But interesting , with three components, the first 3, RMSE is also relatively low. The optimum number of components is 10.

CMP_Prediction2 <- predict(plsTune2, newdata=test_chem[,2:58], ncomp=1:10)


cor(as.numeric(CMP_Prediction2), test_chem[,1]) ^ 2

## [1] 0.1257143

Next we used this model with the 10 components to predict the test data. And the correlation of outcome yield on the test data is 0.125.

PREDICTOR EVALUATION

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

library(caret)
varImp(plsTune2)

## pls variable importance
## 
##   only 20 most important variables shown (out of 38)
## 
##                        Overall
## BiologicalMaterial06    100.00
## BiologicalMaterial11     71.11
## BiologicalMaterial12     67.92
## BiologicalMaterial04     67.89
## BiologicalMaterial08     67.15
## ManufacturingProcess02   57.44
## ManufacturingProcess33   46.93
## ManufacturingProcess36   39.02
## ManufacturingProcess15   38.49
## ManufacturingProcess05   33.30
## ManufacturingProcess04   31.81
## ManufacturingProcess19   30.23
## ManufacturingProcess42   29.25
## ManufacturingProcess09   28.81
## ManufacturingProcess06   28.40
## BiologicalMaterial09     23.99
## ManufacturingProcess11   23.95
## ManufacturingProcess23   17.67
## ManufacturingProcess01   16.24
## ManufacturingProcess34   14.58

The caret package has the function varImp, which is what we used to determine the most important predictors in our training model. The first 5 most important predictors are biological ones, which we understand cannot be modified. But the good news is the next nine variables ranked belong to manufacturing process, which can be modified. And remaining 20 variables are split between manufacturing process and the biological ones.

Imp <- varImp(plsTune2) 
plot(Imp, top =20)

The top 20 most important variables are plotted out as well below.

TOP PREDICTORS DECIPHER

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

The correlation among these variables our shown in this correlation plot.

par(mfrow = c(2, 3))
## why I do not have the line in red color
plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess02)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess02), col="red")

plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess33)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess33), col="red")

plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess36)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess36), col="red")

plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess15)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess15), col="red")

plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess05)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess05), col="red")

plot(C_M_P2$Yield, C_M_P2$ManufacturingProcess04)
abline(lm(C_M_P2$Yield~C_M_P2$ManufacturingProcess04), col="red")

I have selected the top six most important predictors to their relationship of the outcome yield. As we can see some are positively correlated with outcome, while others are negatively correlated with the outcome. I intended to plot out a linear line in red color for the outcome, but for some unknown reason the linear model line failed to show up do too some bucks in the program which I will fix later.