A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of the product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch.
Start R and use these commands to load the data:
> library(AppliedPredictiveModeling)
> data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
# Load data
library(AppliedPredictiveModeling)## Warning: package 'AppliedPredictiveModeling' was built under R version 4.2.3
data(ChemicalManufacturingProcess)A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in those missing values (e.g., see Sect. 3.8).
We’ll first generate a summary to get a first look at the data and to show missing values, then we’ll get a count of missing values per variable.
# Take a first look
summary(ChemicalManufacturingProcess)## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## Min. :35.25 Min. :4.580 Min. :46.87 Min. :56.97
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68 1st Qu.:64.98
## Median :39.97 Median :6.305 Median :55.09 Median :67.22
## Mean :40.18 Mean :6.411 Mean :55.69 Mean :67.70
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74 3rd Qu.:70.43
## Max. :46.34 Max. :8.810 Max. :64.75 Max. :78.25
##
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## Min. : 9.38 Min. :13.24 Min. :40.60
## 1st Qu.:11.24 1st Qu.:17.23 1st Qu.:46.05
## Median :12.10 Median :18.49 Median :48.46
## Mean :12.35 Mean :18.60 Mean :48.91
## 3rd Qu.:13.22 3rd Qu.:19.90 3rd Qu.:51.34
## Max. :23.09 Max. :24.85 Max. :59.38
##
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## Min. :100.0 Min. :15.88 Min. :11.44
## 1st Qu.:100.0 1st Qu.:17.06 1st Qu.:12.60
## Median :100.0 Median :17.51 Median :12.84
## Mean :100.0 Mean :17.49 Mean :12.85
## 3rd Qu.:100.0 3rd Qu.:17.88 3rd Qu.:13.13
## Max. :100.8 Max. :19.14 Max. :14.08
##
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## Min. :1.770 Min. :135.8 Min. :18.35
## 1st Qu.:2.460 1st Qu.:143.8 1st Qu.:19.73
## Median :2.710 Median :146.1 Median :20.12
## Mean :2.801 Mean :147.0 Mean :20.20
## 3rd Qu.:2.990 3rd Qu.:149.6 3rd Qu.:20.75
## Max. :6.870 Max. :158.7 Max. :22.21
##
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## Min. : 0.00 Min. : 0.00 Min. :1.47
## 1st Qu.:10.80 1st Qu.:19.30 1st Qu.:1.53
## Median :11.40 Median :21.00 Median :1.54
## Mean :11.21 Mean :16.68 Mean :1.54
## 3rd Qu.:12.15 3rd Qu.:21.50 3rd Qu.:1.55
## Max. :14.10 Max. :22.50 Max. :1.60
## NA's :1 NA's :3 NA's :15
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## Min. :911.0 Min. : 923.0 Min. :203.0
## 1st Qu.:928.0 1st Qu.: 986.8 1st Qu.:205.7
## Median :934.0 Median : 999.2 Median :206.8
## Mean :931.9 Mean :1001.7 Mean :207.4
## 3rd Qu.:936.0 3rd Qu.:1008.9 3rd Qu.:208.7
## Max. :946.0 Max. :1175.3 Max. :227.4
## NA's :1 NA's :1 NA's :2
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## Min. :177.0 Min. :177.0 Min. :38.89
## 1st Qu.:177.0 1st Qu.:177.0 1st Qu.:44.89
## Median :177.0 Median :178.0 Median :45.73
## Mean :177.5 Mean :177.6 Mean :45.66
## 3rd Qu.:178.0 3rd Qu.:178.0 3rd Qu.:46.52
## Max. :178.0 Max. :178.0 Max. :49.36
## NA's :1 NA's :1
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## Min. : 7.500 Min. : 7.500 Min. : 0.0
## 1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
## Median : 9.100 Median : 9.400 Median : 0.0
## Mean : 9.179 Mean : 9.386 Mean : 857.8
## 3rd Qu.: 9.550 3rd Qu.: 9.900 3rd Qu.: 0.0
## Max. :11.600 Max. :11.500 Max. :4549.0
## NA's :9 NA's :10 NA's :1
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## Min. :32.10 Min. :4701 Min. :5904
## 1st Qu.:33.90 1st Qu.:4828 1st Qu.:6010
## Median :34.60 Median :4856 Median :6032
## Mean :34.51 Mean :4854 Mean :6039
## 3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061
## Max. :38.60 Max. :5055 Max. :6233
## NA's :1
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## Min. : 0 Min. :31.30 Min. : 0
## 1st Qu.:4561 1st Qu.:33.50 1st Qu.:4813
## Median :4588 Median :34.40 Median :4835
## Mean :4566 Mean :34.34 Mean :4810
## 3rd Qu.:4619 3rd Qu.:35.10 3rd Qu.:4862
## Max. :4852 Max. :40.00 Max. :4971
##
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## Min. :5890 Min. : 0 Min. :-1.8000
## 1st Qu.:6001 1st Qu.:4553 1st Qu.:-0.6000
## Median :6022 Median :4582 Median :-0.3000
## Mean :6028 Mean :4556 Mean :-0.1642
## 3rd Qu.:6050 3rd Qu.:4610 3rd Qu.: 0.0000
## Max. :6146 Max. :4759 Max. : 3.6000
##
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
## Median : 5.000 Median :3.000 Median : 8.000
## Mean : 5.406 Mean :3.017 Mean : 8.834
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
## Max. :12.000 Max. :6.000 Max. :23.000
## NA's :1 NA's :1 NA's :1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:4832 1st Qu.:6020 1st Qu.:4560
## Median :4855 Median :6047 Median :4587
## Mean :4828 Mean :6016 Mean :4563
## 3rd Qu.:4877 3rd Qu.:6070 3rd Qu.:4609
## Max. :4990 Max. :6161 Max. :4710
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:19.70 1st Qu.: 8.800
## Median :10.400 Median :19.90 Median : 9.100
## Mean : 6.592 Mean :20.01 Mean : 9.161
## 3rd Qu.:10.750 3rd Qu.:20.40 3rd Qu.: 9.700
## Max. :11.500 Max. :22.00 Max. :11.200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## Min. : 0.00 Min. :143.0 Min. :56.00
## 1st Qu.:70.10 1st Qu.:155.0 1st Qu.:62.00
## Median :70.80 Median :158.0 Median :64.00
## Mean :70.18 Mean :158.5 Mean :63.54
## 3rd Qu.:71.40 3rd Qu.:162.0 3rd Qu.:65.00
## Max. :72.50 Max. :173.0 Max. :70.00
## NA's :5 NA's :5
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## Min. :2.300 Min. :463.0 Min. :0.01700
## 1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
## Median :2.500 Median :495.0 Median :0.02000
## Mean :2.494 Mean :495.6 Mean :0.01957
## 3rd Qu.:2.500 3rd Qu.:501.5 3rd Qu.:0.02000
## Max. :2.600 Max. :522.0 Max. :0.02200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100
## Median :1.000 Median :3.000 Median :7.200
## Mean :1.014 Mean :2.534 Mean :6.851
## 3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300
## Max. :2.300 Max. :3.000 Max. :7.500
##
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## Min. :0.00000 Min. :0.00000 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:11.40
## Median :0.00000 Median :0.00000 Median :11.60
## Mean :0.01771 Mean :0.02371 Mean :11.21
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:11.70
## Max. :0.10000 Max. :0.20000 Max. :12.10
## NA's :1 NA's :1
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## Min. : 0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.6000 1st Qu.:1.800 1st Qu.:2.100
## Median : 0.8000 Median :1.900 Median :2.200
## Mean : 0.9119 Mean :1.805 Mean :2.138
## 3rd Qu.: 1.0250 3rd Qu.:1.900 3rd Qu.:2.300
## Max. :11.0000 Max. :2.100 Max. :2.600
##
# Show missing value counts
dfchem <- ChemicalManufacturingProcess # To avoid typing this many letters
data.frame(Missing=colSums(is.na(dfchem))) %>%
filter(Missing > 0) %>%
kbl(caption='Missing value counts') %>%
kable_classic(full_width=F)| Missing | |
|---|---|
| ManufacturingProcess01 | 1 |
| ManufacturingProcess02 | 3 |
| ManufacturingProcess03 | 15 |
| ManufacturingProcess04 | 1 |
| ManufacturingProcess05 | 1 |
| ManufacturingProcess06 | 2 |
| ManufacturingProcess07 | 1 |
| ManufacturingProcess08 | 1 |
| ManufacturingProcess10 | 9 |
| ManufacturingProcess11 | 10 |
| ManufacturingProcess12 | 1 |
| ManufacturingProcess14 | 1 |
| ManufacturingProcess22 | 1 |
| ManufacturingProcess23 | 1 |
| ManufacturingProcess24 | 1 |
| ManufacturingProcess25 | 5 |
| ManufacturingProcess26 | 5 |
| ManufacturingProcess27 | 5 |
| ManufacturingProcess28 | 5 |
| ManufacturingProcess29 | 5 |
| ManufacturingProcess30 | 5 |
| ManufacturingProcess31 | 5 |
| ManufacturingProcess33 | 5 |
| ManufacturingProcess34 | 5 |
| ManufacturingProcess35 | 5 |
| ManufacturingProcess36 | 5 |
| ManufacturingProcess40 | 1 |
| ManufacturingProcess41 | 1 |
# Impute missing values
#imp <- mice(dfchem, printFlag=F)
imp <- knnImputation(dfchem, k=3)
dfchem2 <- complete(imp)
# Verify no more missing values exist
print(paste0('Missing values after imputation: ', sum(is.na(dfchem2))))## [1] "Missing values after imputation: 0"
# Look for skewness
# As a general rule of thumb: If skewness is less than -1 or greater than 1, the distribution is highly skewed.
# If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
# If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.
print("Skewness of columns:")## [1] "Skewness of columns:"
colskewness <- apply(dfchem2, 2, e1071::skewness, type=1)
colskewness## Yield BiologicalMaterial01 BiologicalMaterial02
## 0.313628736 0.275662556 0.246222354
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0.028755477 1.747184905 0.306614797
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## 0.371697783 7.462171782 0.221942745
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## -0.270721711 2.422999528 0.361901086
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 0.306452433 -3.967337791 -1.469933180
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## -0.589468598 -0.709609748 2.619804979
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## 3.080054936 0.087004003 -0.209672712
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## -0.948742838 0.644366087 -0.014138720
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## 1.601281538 0.484400166 -0.002762522
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## 0.680136157 -12.526835771 1.172954087
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## -12.845460479 0.299893667 -12.746809903
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## 1.743956159 0.311971249 0.187396582
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## 0.355136773 -12.927210865 -12.966657435
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## -12.812938765 -0.397520290 -10.298318542
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## -4.861799837 -12.074918634 0.213038336
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## -0.103459835 -0.234979423 -0.125325542
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## 0.180092333 0.381605455 -1.696241287
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## -4.305766123 1.700356568 2.197063408
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## -5.496789260 9.132598658 -5.013019018
## ManufacturingProcess45
## -4.112944721
# Show skewed columns
print("Skewed columns (skewness < -1 or > 1):")## [1] "Skewed columns (skewness < -1 or > 1):"
skewedcols <- dfchem2[colskewness < -1 | colskewness > 1]
colnames(skewedcols)## [1] "BiologicalMaterial04" "BiologicalMaterial07" "BiologicalMaterial10"
## [4] "ManufacturingProcess01" "ManufacturingProcess02" "ManufacturingProcess05"
## [7] "ManufacturingProcess06" "ManufacturingProcess12" "ManufacturingProcess16"
## [10] "ManufacturingProcess17" "ManufacturingProcess18" "ManufacturingProcess20"
## [13] "ManufacturingProcess21" "ManufacturingProcess25" "ManufacturingProcess26"
## [16] "ManufacturingProcess27" "ManufacturingProcess29" "ManufacturingProcess30"
## [19] "ManufacturingProcess31" "ManufacturingProcess38" "ManufacturingProcess39"
## [22] "ManufacturingProcess40" "ManufacturingProcess41" "ManufacturingProcess42"
## [25] "ManufacturingProcess43" "ManufacturingProcess44" "ManufacturingProcess45"
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
First, we’ll choose a random seed for reproducible results. Then we’ll split the data using createDataPartition(), which attempts to split the data in such a way that the training and test sets will an approximately proportionate distributions of the outcome variable.
# Set seed
set.seed(777)
# Split into train/test; createDataPartition generates indicies of the training set
train_indices <- createDataPartition(dfchem2$Yield, p=0.80, times=1, list=F)
dftrain <- dfchem2[train_indices,]
dftest <- dfchem2[-train_indices,]
# Examine train and test sets
hist(dftrain$Yield, main='Training set yield histogram')hist(dftest$Yield, main='Test set yield histogram')Now we’ll examine the predictors to look for correlations and examine their relationship to the outcome variable.
# Corr chart - not useful here because of the number of predictors
# chart.Correlation(dftrain)
# Corr plot
corr1 <- cor(dftrain)
corrplot(corr1, order='hclust', type='upper', diag=F)# Show candidates for removal
print("Candidates for removal due to high correlation:")## [1] "Candidates for removal due to high correlation:"
high_corr <- findCorrelation(corr1, cutoff=0.9, exact=T, verbose=F, names=F)
colnames(dftrain[,high_corr])## [1] "BiologicalMaterial02" "BiologicalMaterial04" "BiologicalMaterial12"
## [4] "ManufacturingProcess29" "ManufacturingProcess42" "ManufacturingProcess27"
## [7] "ManufacturingProcess25" "ManufacturingProcess31" "ManufacturingProcess18"
## [10] "ManufacturingProcess40"
# Remove the highly correlated variables
dftrain2 <- dftrain[,-high_corr]
dftest2 <- dftest[,-high_corr]Look at near-zero variance features.
# Remove NZV features
tmp_nzv <- nearZeroVar(dftrain2)
print(paste0('Near-zero variance features: ', colnames(dftrain2)[tmp_nzv]))## [1] "Near-zero variance features: BiologicalMaterial07"
dftrain2 <- dftrain2[,-tmp_nzv]
dftest2 <- dftest2[,-tmp_nzv]Now we’ll try modeling. First, prepare a data frame for the results, separate the outcome and predictor variables, and set cross-validation parameters.
# Results data frame
dfr <- data.frame(matrix(nrow=10, ncol=3))
colnames(dfr) <- c('Model', 'Tuning.Parameters', 'RMSE')
# Separate outcome and predictors
trainx <- dftrain2 %>% dplyr::select(-Yield)
trainy <- dftrain2$Yield
testx <- dftest2 %>% dplyr::select(-Yield)
testy <- dftest2$Yield
# specify 10x cross-validation
ctrl <- trainControl(method='cv', number=10)Test preprocessing
# Testing out pre-processing
print('-------------------------------------------------------')## [1] "-------------------------------------------------------"
print('No pre-processing')## [1] "No pre-processing"
set.seed(777)
fitlm1 <- train(x=trainx, y=trainy, method='lm', trControl=ctrl)## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
predict(fitlm1, newdata=as.matrix(testx))## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## 4 5 18 23 27 29 32 37
## 41.45387 42.63260 39.68963 72.65342 35.91451 37.51592 41.83047 43.24816
## 49 50 58 60 65 72 77 80
## 43.58761 41.80537 39.53683 39.98923 42.38107 40.76567 39.95035 38.58628
## 88 95 96 100 105 110 111 115
## 41.19641 40.23947 39.72705 38.16562 38.25934 38.93946 40.02140 40.38711
## 117 126 140 154 163 169 170 174
## 41.30805 40.68793 38.67291 39.14236 40.41033 37.47626 38.74232 42.23967
print('-------------------------------------------------------')## [1] "-------------------------------------------------------"
print('Pre-process within train() fuction: Center and scaling only')## [1] "Pre-process within train() fuction: Center and scaling only"
set.seed(777)
fitlm1 <- train(x=trainx, y=trainy, method='lm', trControl=ctrl, preProcess=c('center', 'scale'))## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
predict(fitlm1, newdata=as.matrix(testx))## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## 4 5 18 23 27 29 32 37
## 41.45387 42.63260 39.68963 72.65342 35.91451 37.51592 41.83047 43.24816
## 49 50 58 60 65 72 77 80
## 43.58761 41.80537 39.53683 39.98923 42.38107 40.76567 39.95035 38.58628
## 88 95 96 100 105 110 111 115
## 41.19641 40.23947 39.72705 38.16562 38.25934 38.93946 40.02140 40.38711
## 117 126 140 154 163 169 170 174
## 41.30805 40.68793 38.67291 39.14236 40.41033 37.47626 38.74232 42.23967
print('-------------------------------------------------------')## [1] "-------------------------------------------------------"
print('Pre-process within train() fuction: Center, scaling, YeoJohnson')## [1] "Pre-process within train() fuction: Center, scaling, YeoJohnson"
set.seed(777)
fitlm1 <- train(x=trainx, y=trainy, method='lm', trControl=ctrl, preProcess=c('center', 'scale', 'YeoJohnson'))
predict(fitlm1, newdata=as.matrix(testx))## 4 5 18 23 27 29
## 4.094113e+01 4.221883e+01 3.992967e+01 1.201549e+12 3.541196e+01 3.721764e+01
## 32 37 49 50 58 60
## 4.194176e+01 4.291529e+01 4.385981e+01 4.144476e+01 3.783137e+01 4.008025e+01
## 65 72 77 80 88 95
## 4.238697e+01 4.068365e+01 3.991285e+01 3.832768e+01 4.119786e+01 3.976468e+01
## 96 100 105 110 111 115
## 3.924898e+01 3.804538e+01 3.823628e+01 3.923236e+01 4.036698e+01 4.024042e+01
## 117 126 140 154 163 169
## 4.131700e+01 4.073699e+01 3.902077e+01 3.942066e+01 4.042338e+01 3.704005e+01
## 170 174
## 3.846184e+01 4.167869e+01
# Preprocess training x data
transobj_trainx <- preProcess(trainx, method=c('center', 'scale', 'YeoJohnson'))
trainx_trans <- predict(transobj_trainx, trainx)
# Preprocess test x data
transobj_testx <- preProcess(testx, method=c('center', 'scale', 'YeoJohnson'))
testx_trans <- predict(transobj_testx, testx)
# Fit model using transformed x data
print('-------------------------------------------------------')## [1] "-------------------------------------------------------"
print('Pre-process outside of train() fuction: Center, scaling, YeoJohnson on x data only')## [1] "Pre-process outside of train() fuction: Center, scaling, YeoJohnson on x data only"
set.seed(777)
fitlm1 <- train(x=trainx_trans, y=trainy, method='lm', trControl=ctrl)
predict(fitlm1, newdata=as.matrix(testx_trans))## 4 5 18 23 27 29 32 37
## 42.09880 44.26170 40.49845 44.41176 36.99165 38.76723 41.38606 42.18600
## 49 50 58 60 65 72 77 80
## 43.61614 40.98222 37.32326 39.05236 41.34297 40.84644 39.41932 37.55382
## 88 95 96 100 105 110 111 115
## 41.24311 39.70256 39.30794 38.65147 38.01144 38.81168 40.49879 39.12775
## 117 126 140 154 163 169 170 174
## 40.86134 40.92234 38.94973 39.17437 40.01704 37.99964 39.73224 41.28282
# Preprocess training y data
transobj_trainy <- preProcess(as.data.frame(trainy), method=c('center', 'scale', 'YeoJohnson'))
trainy_trans <- predict(transobj_trainy, as.data.frame(trainy))
# Fit model using transformed x and y data
print('-------------------------------------------------------')## [1] "-------------------------------------------------------"
print('Pre-process outside of train() fuction: Center, scaling, YeoJohnson on both x and y data')## [1] "Pre-process outside of train() fuction: Center, scaling, YeoJohnson on both x and y data"
set.seed(777)
fitlm1 <- train(x=trainx_trans, y=trainy_trans$trainy, method='lm', trControl=ctrl)
testy_trans <- predict(fitlm1, newdata=as.matrix(testx_trans))
transobj_testy <- preProcess(as.data.frame(testy_trans), method=c('center', 'scale', 'YeoJohnson'))
predict(transobj_testy, as.data.frame(testy_trans))## testy_trans
## 4 1.004088890
## 5 1.920870344
## 18 0.310978196
## 23 1.854997594
## 27 -2.193780517
## 29 -0.891372789
## 32 0.709107066
## 37 1.138481858
## 49 1.722119936
## 50 0.502469197
## 58 -1.533989854
## 60 -0.492274137
## 65 0.708979938
## 72 0.422600779
## 77 -0.329975252
## 80 -1.395272642
## 88 0.658728298
## 95 -0.199432730
## 96 -0.411736013
## 100 -0.881490837
## 105 -1.187210700
## 110 -0.658494961
## 111 0.337418788
## 115 -0.365231063
## 117 0.542788610
## 126 0.502745721
## 140 -0.652864036
## 154 -0.402308668
## 163 -0.007460397
## 169 -1.229123782
## 170 -0.265956083
## 174 0.761599245
Linear regression
Since linear regression is sensitive to skewness, we’ll need to transform the predictors. But since there are zero and negative values, we’ll use YeoJohnson instead of Box Cox.
# Linear model
set.seed(777)
fitlm1 <- train(x=trainx, y=trainy, method='lm', trControl=ctrl, preProcess=c('center', 'scale', 'YeoJohnson'))
fitlm1## Linear Regression
##
## 144 samples
## 46 predictor
##
## Pre-processing: centered (46), scaled (46), Yeo-Johnson transformation (23)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 117739644465 0.4828238 29434911117
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
dfr[1,] = data.frame(Model='LM', Tuning.Parameters='', RMSE=fitlm1$results[['RMSE']])
# Stepwise linear model using MASS package/caret
stepGrid <- data.frame(.nvmax=seq(1, round(ncol(trainx) / 2, 0))) # Max number of parameters to use
set.seed(777)
fitlm2 <- train(x=trainx, y=trainy, method='leapSeq', trControl=ctrl, tuneGrid=stepGrid, preProcess=c('center', 'scale', 'YeoJohnson'))
fitlm2## Linear Regression with Stepwise Selection
##
## 144 samples
## 46 predictor
##
## Pre-processing: centered (46), scaled (46), Yeo-Johnson transformation (23)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 1.431491e+00 0.4342738 1.144534e+00
## 2 1.158788e+00 0.6031342 9.437858e-01
## 3 1.196402e+00 0.5713593 9.745006e-01
## 4 1.200655e+00 0.5713735 9.781351e-01
## 5 1.111279e+00 0.6419103 8.914018e-01
## 6 5.362400e+10 0.5215602 1.340600e+10
## 7 5.945710e+10 0.5228394 1.486428e+10
## 8 1.241937e+00 0.5558046 1.004326e+00
## 9 5.950209e+10 0.5004446 1.487552e+10
## 10 6.496760e+10 0.4864935 1.624190e+10
## 11 5.294823e+10 0.5207291 1.323706e+10
## 12 5.388407e+10 0.3643220 1.347102e+10
## 13 5.481079e+10 0.5141754 1.370270e+10
## 14 5.240901e+10 0.4454655 1.310225e+10
## 15 8.587152e+10 0.4471269 2.146788e+10
## 16 8.385069e+10 0.5036706 2.096267e+10
## 17 5.800547e+10 0.4495873 1.450137e+10
## 18 5.477397e+10 0.4834435 1.369349e+10
## 19 7.008780e+10 0.5196428 1.752195e+10
## 20 8.466672e+10 0.5577231 2.116668e+10
## 21 8.733927e+10 0.4745233 2.183482e+10
## 22 9.003694e+10 0.4912813 2.250923e+10
## 23 9.226087e+10 0.5070177 2.306522e+10
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 5.
dfr[2,] = data.frame(MOdel='LM', Tuning.Parameters=paste0('nvmax=', fitlm2$bestTune[['nvmax']]), RMSE=min(fitlm2$results[['RMSE']]))Robust linear regression
# Robust linear model - this errors out with "'x' is singular: singular fits are not implemented in 'rlm'"
set.seed(777)
fitrlm <- train(x=trainx, y=trainy, method='rlm', preProcess=c('center', 'scale', 'pca'), trControl=ctrl)
fitrlm## Robust Linear Model
##
## 144 samples
## 46 predictor
##
## Pre-processing: centered (46), scaled (46), principal component
## signal extraction (46)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## intercept psi RMSE Rsquared MAE
## FALSE psi.huber 40.359305 0.5187123 40.309744
## FALSE psi.hampel 40.359305 0.5187123 40.309744
## FALSE psi.bisquare 40.359228 0.5189204 40.309711
## TRUE psi.huber 1.703796 0.5128949 1.152851
## TRUE psi.hampel 1.639212 0.5197273 1.122953
## TRUE psi.bisquare 1.649879 0.5141430 1.138463
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.hampel.
dfr[3,] = data.frame(Model='RLM', Tuning.Parameters='', RMSE=min(fitrlm$results[['RMSE']])) # tempPartial least squares
# PLS
set.seed(777)
fitpls <- train(x=trainx, y=trainy, method='pls', preProcess=c('center', 'scale'), trControl=ctrl, tuneLength=nrow(trainx) / 2)
fitpls## Partial Least Squares
##
## 144 samples
## 46 predictor
##
## Pre-processing: centered (46), scaled (46)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.526790 0.4263332 1.166839
## 2 1.511817 0.5062515 1.092722
## 3 1.629087 0.5535124 1.117592
## 4 1.875163 0.5474681 1.152308
## 5 2.053876 0.5302709 1.212185
## 6 2.196015 0.5197294 1.254813
## 7 2.549015 0.4954945 1.373743
## 8 2.774119 0.4848949 1.430631
## 9 2.912486 0.5013733 1.452154
## 10 3.107472 0.4982692 1.494394
## 11 3.228344 0.5000939 1.517594
## 12 3.405619 0.4938940 1.557258
## 13 3.476198 0.4977553 1.572813
## 14 3.617499 0.4972972 1.603532
## 15 3.730694 0.4994671 1.629883
## 16 3.849865 0.5003645 1.664269
## 17 3.954212 0.4996230 1.692126
## 18 4.055287 0.4950035 1.724869
## 19 4.139159 0.4924955 1.749936
## 20 4.215228 0.4921319 1.770767
## 21 4.362633 0.4897603 1.813180
## 22 4.460844 0.4886693 1.841417
## 23 4.593184 0.4861933 1.878487
## 24 4.748029 0.4829737 1.920425
## 25 4.864180 0.4808356 1.950680
## 26 4.949870 0.4783736 1.975737
## 27 4.991077 0.4773965 1.986732
## 28 5.012214 0.4765909 1.992702
## 29 5.035560 0.4759194 2.000144
## 30 5.051121 0.4748363 2.005798
## 31 5.064986 0.4744422 2.009019
## 32 5.077696 0.4744943 2.012298
## 33 5.088090 0.4743634 2.015385
## 34 5.090468 0.4746549 2.015811
## 35 5.094116 0.4749406 2.016809
## 36 5.103858 0.4748463 2.019486
## 37 5.108222 0.4749117 2.020422
## 38 5.108992 0.4747286 2.020758
## 39 5.108526 0.4746318 2.020737
## 40 5.110694 0.4745965 2.021325
## 41 5.114275 0.4745892 2.022312
## 42 5.115730 0.4746206 2.022649
## 43 5.116157 0.4746222 2.022756
## 44 5.115674 0.4746303 2.022622
## 45 5.115640 0.4746321 2.022611
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
dfr[4,] = data.frame(Model='PLS', Tuning.Parameters=paste0('ncomp=', fitpls$bestTune), RMSE=min(fitpls$results[['RMSE']]))Ridge regression
# Ridge regression
set.seed(777)
ridgeGrid <- data.frame(.lambda=seq(0, 0.1, length=15))
fitridge <- train(x=trainx, y=trainy, method='ridge', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=ridgeGrid)
fitridge## Ridge Regression
##
## 144 samples
## 46 predictor
##
## Pre-processing: centered (46), scaled (46)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 5.115640 0.4746321 2.022611
## 0.007142857 4.007837 0.4979266 1.709757
## 0.014285714 3.520810 0.5078463 1.578087
## 0.021428571 3.237975 0.5133402 1.503131
## 0.028571429 3.050304 0.5168667 1.454249
## 0.035714286 2.915124 0.5193879 1.418881
## 0.042857143 2.812181 0.5213426 1.391944
## 0.050000000 2.730595 0.5229511 1.370663
## 0.057142857 2.663981 0.5243319 1.354138
## 0.064285714 2.608329 0.5255526 1.340362
## 0.071428571 2.560989 0.5266534 1.328619
## 0.078571429 2.520133 0.5276596 1.318586
## 0.085714286 2.484452 0.5285878 1.310057
## 0.092857143 2.452986 0.5294493 1.302455
## 0.100000000 2.425008 0.5302522 1.295635
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
dfr[5,] = data.frame(Model='Ridge', Tuning.Parameters=paste0('labmda=', fitridge$bestTune[['lambda']]), RMSE=min(fitridge$results[['RMSE']]))Lasso regression
# Lasso regression
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1), .fraction=seq(0.05, 1, length=20))
set.seed(777)
fitlasso <- train(x=trainx, y=trainy, method='enet', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=enetGrid)
fitlasso## Elasticnet
##
## 144 samples
## 46 predictor
##
## Pre-processing: centered (46), scaled (46)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 128, 131, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 1.496260 0.5634332 1.242877
## 0.00 0.10 1.280891 0.5850413 1.067828
## 0.00 0.15 1.244567 0.5705032 1.017070
## 0.00 0.20 1.552368 0.5657841 1.071553
## 0.00 0.25 1.693358 0.5683321 1.101153
## 0.00 0.30 1.841463 0.5634434 1.138020
## 0.00 0.35 1.960392 0.5572569 1.165177
## 0.00 0.40 2.121687 0.5552184 1.204194
## 0.00 0.45 2.265305 0.5548726 1.238814
## 0.00 0.50 2.502205 0.5286630 1.306408
## 0.00 0.55 2.775275 0.5179853 1.374875
## 0.00 0.60 3.027842 0.5135267 1.437783
## 0.00 0.65 3.330734 0.5149751 1.514301
## 0.00 0.70 3.687284 0.5117928 1.608031
## 0.00 0.75 4.021617 0.5070116 1.699490
## 0.00 0.80 4.287030 0.5007199 1.774338
## 0.00 0.85 4.522109 0.4933412 1.844119
## 0.00 0.90 4.743989 0.4860898 1.911502
## 0.00 0.95 4.927568 0.4793554 1.967701
## 0.00 1.00 5.115640 0.4746321 2.022611
## 0.01 0.05 1.552755 0.5392355 1.289136
## 0.01 0.10 1.368745 0.5773088 1.139872
## 0.01 0.15 1.228621 0.5934733 1.022713
## 0.01 0.20 1.271444 0.5651755 1.021248
## 0.01 0.25 1.512796 0.5690943 1.058149
## 0.01 0.30 1.638723 0.5724123 1.083730
## 0.01 0.35 1.754695 0.5710581 1.108978
## 0.01 0.40 1.893296 0.5652949 1.146110
## 0.01 0.45 2.001564 0.5558437 1.178080
## 0.01 0.50 2.097541 0.5539844 1.201485
## 0.01 0.55 2.234012 0.5471378 1.241234
## 0.01 0.60 2.402482 0.5342127 1.285540
## 0.01 0.65 2.583688 0.5270247 1.328696
## 0.01 0.70 2.768173 0.5214363 1.374307
## 0.01 0.75 2.956960 0.5170501 1.422059
## 0.01 0.80 3.118342 0.5158694 1.463990
## 0.01 0.85 3.338838 0.5148072 1.521653
## 0.01 0.90 3.511004 0.5118248 1.568048
## 0.01 0.95 3.651312 0.5079108 1.607902
## 0.01 1.00 3.775369 0.5027184 1.646082
## 0.10 0.05 1.623123 0.4761389 1.346421
## 0.10 0.10 1.493217 0.5555220 1.239770
## 0.10 0.15 1.372376 0.5770680 1.143356
## 0.10 0.20 1.266243 0.5905997 1.055866
## 0.10 0.25 1.206276 0.5956067 1.007595
## 0.10 0.30 1.209429 0.5795297 1.002188
## 0.10 0.35 1.347724 0.5601984 1.020873
## 0.10 0.40 1.487463 0.5653996 1.047829
## 0.10 0.45 1.584918 0.5710952 1.067204
## 0.10 0.50 1.670864 0.5721786 1.087431
## 0.10 0.55 1.750869 0.5720047 1.106169
## 0.10 0.60 1.828812 0.5705264 1.126478
## 0.10 0.65 1.909082 0.5675414 1.148811
## 0.10 0.70 1.993045 0.5662598 1.168350
## 0.10 0.75 2.096245 0.5606080 1.197623
## 0.10 0.80 2.204726 0.5522837 1.230596
## 0.10 0.85 2.286488 0.5446859 1.254934
## 0.10 0.90 2.333570 0.5389458 1.269233
## 0.10 0.95 2.378917 0.5345773 1.282115
## 0.10 1.00 2.425008 0.5302522 1.295635
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.25 and lambda = 0.1.
dfr[6,] = data.frame(
Model='Lasso (enlastic net)',
Tuning.Parameters=paste0('lambda=', fitlasso$bestTune[['lambda']], ', fraction=', fitlasso$bestTune[['fraction']]),
RMSE=min(fitlasso$results[['RMSE']])
)# Model comparison
dfr %>%
filter(!is.na(RMSE)) %>%
kbl(caption='Model comparison') %>%
kable_classic(full_width=F)| Model | Tuning.Parameters | RMSE |
|---|---|---|
| LM | 1.177396e+11 | |
| LM | nvmax=5 | 1.111279e+00 |
| RLM | 1.639212e+00 | |
| PLS | ncomp=2 | 1.511817e+00 |
| Ridge | labmda=0.1 | 2.425008e+00 |
| Lasso (enlastic net) | lambda=0.1, fraction=0.25 | 1.206276e+00 |
Although the stepwise linear model performed the best in terms of RMSE, there were problems with outliers in the final predictors that appears to have biased the model. Therefore, we’ll choose the next best model (lasso). The lasso model with max lambda=0.1 and fraction=0.25 was the best-performing model under training conditions, with an RMSE of 1.21.
We’ll check residuals to make sure the model is valid.
# Predict y values of training data
predy_train <- predict(fitlasso, s=0.1, fraction=0.25, newdata=as.matrix(trainx))
# Check residual plot
ggplot() +
geom_point(aes(x=predy_train, y=trainy - predy_train)) +
ggtitle('Residual plot for best-performing model (stepwise linear regression)')# Breusch-Pagan test to determine homoschedasticity of residuals
# Null hypothesis: the residuals are homoschedastic.
# If the p-value is small, reject the null, i.e., consider the residuals heteroschedastic.
#bp <- bptest(fitlm2)
#if (bp$p.value > 0.05 & bp$statistic < 10) {
# print(paste0("Breusch-Pagan test for homoschedasticity: The p-value of ", bp$p.value, " is > 0.05 and the test statistic of ", bp$statistic,
# " is < 10, so don't reject the null; i.e., the residuals are HOMOSCHEDASTIC"))
#} else if (bp$p.value <= 0.05) {
# print(paste0("Breusch-Pagan test for homoschedasticity: The p-value of ", bp$p.value, " is <= 0.05 and the test statistic is ", bp$statistic,
# ", so reject the null; i.e., the residuals are HETEROSCHEDASTIC"))
#} else {
# print(paste0("Breusch-Pagan test for homoschedasticity: The p-value of ", bp$p.value, " and test statistic of ", bp$statistic,
# " are inconclusive, so homoschedasticity can't be determined using this test."))
#}The residuals don’t exhibit any pattern and appear appropriate for a well-fit model.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
First, calculate the accuracy based on the test set.
# Generate y predictions
predy <- predict(fitlasso, s=0.1, fraction=0.25, newdata=as.matrix(testx))
# Get accuracy results
accuracy(predy, testy)## ME RMSE MAE MPE MAPE
## Test set -0.1320439 1.343438 0.9877057 -0.4636695 2.450543
As expected, the RMSE for the test set is higher than that that of the training set (1.34 versus 1.21, respectively). But the low MAPE value of 2.45 confirms good model performance regardless.
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
# Extract coefficients from model (lasso)
enetCoef <- predict(fitlasso$finalModel, s=0.1, fraction=0.5, mode='fraction', type='coefficients')
dftop <- data.frame(enetCoef$coefficients) %>%
filter(enetCoef.coefficients != 0)
# Extract coefficients from model (stepwise linear regression)
#lmCoef <- coef(fitlm2$finalModel, id=5)
#dftop <- as.data.frame(lmCoef)
# Display coefficients
dftop %>%
kbl(caption='Non-zero model coefficients') %>%
kable_classic(full_width=F)| enetCoef.coefficients | |
|---|---|
| ManufacturingProcess09 | 0.1644254 |
| ManufacturingProcess13 | -0.0189290 |
| ManufacturingProcess32 | 0.4286063 |
As shown, the three most important variables are all those related to the manufacturing process. This works out favorably, as these are elements that can be controlled or adjusted to have an effect on yield, whereas the biological predictors can’t be changed.
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
# Iterate over top parameters
for (p in rownames(dftop)) {
if (p != '(Intercept)') { # do not include the intercept
plt <- dfchem2 %>%
ggplot(aes(x=eval(sym(p)), y=Yield)) +
geom_point() +
xlab(p) +
geom_smooth(method=lm, formula=y ~ x, linetype=2, color='darkred', se=F)
print(plt)
}
}