DATA 624 HW4
KJ 6.3
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurmenets of the manufacturing process (predictors), and the response of the product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
(a)
Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
library(tidyverse)
library(kableExtra)
library(e1071)
library(mice)
library(corrplot)
library(caret)
library(pls)
data(ChemicalManufacturingProcess)
The matrix processPredictors
contains the 57 predictors
(12 describing the input biological material and 45 describing the
process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
(b)
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.
## Total NAs
## 106
Columns with NAs,
colSums(is.na(cmp)) |>
data.frame() |>
rename("NAs" = 1) |>
filter(NAs > 0) |>
arrange(desc(NAs)) |>
kable()
NAs | |
---|---|
ManufacturingProcess03 | 15 |
ManufacturingProcess11 | 10 |
ManufacturingProcess10 | 9 |
ManufacturingProcess25 | 5 |
ManufacturingProcess26 | 5 |
ManufacturingProcess27 | 5 |
ManufacturingProcess28 | 5 |
ManufacturingProcess29 | 5 |
ManufacturingProcess30 | 5 |
ManufacturingProcess31 | 5 |
ManufacturingProcess33 | 5 |
ManufacturingProcess34 | 5 |
ManufacturingProcess35 | 5 |
ManufacturingProcess36 | 5 |
ManufacturingProcess02 | 3 |
ManufacturingProcess06 | 2 |
ManufacturingProcess01 | 1 |
ManufacturingProcess04 | 1 |
ManufacturingProcess05 | 1 |
ManufacturingProcess07 | 1 |
ManufacturingProcess08 | 1 |
ManufacturingProcess12 | 1 |
ManufacturingProcess14 | 1 |
ManufacturingProcess22 | 1 |
ManufacturingProcess23 | 1 |
ManufacturingProcess24 | 1 |
ManufacturingProcess40 | 1 |
ManufacturingProcess41 | 1 |
Imputing values using the mice
packages predictive mean
matching – the function provided in the book (impute.knn
function in the impute
package) – generated a series of
errors I was unable to resolve.
set.seed(1)
# store mids object (multiple imputations by chained equations)
imputed_vals <- mice(cmp, method = "pmm", m = 5, printFlag = F)
# return complete dataset
trans <- complete(imputed_vals)
# check NAs
sprintf("There are %d NA's",sum(is.na(trans)))
## [1] "There are 0 NA's"
(c)
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Before splitting the data, we want to a look at the distribution and correlation amongst the predictors. To do this, we’ll calculate the skewness statistic and examine a correlation plot of the variables,
skew.vals <- apply(trans, MARGIN = 2, skewness) |>
data.frame() |>
rename("skew" = 1) |>
filter(skew > 2 | skew < -2) |>
arrange(desc(skew))
sprintf("Number of skewed variables: %d",nrow(skew.vals))
## [1] "Number of skewed variables: 20"
skew | |
---|---|
ManufacturingProcess43 | 9.054875 |
BiologicalMaterial07 | 7.398664 |
ManufacturingProcess06 | 3.061119 |
ManufacturingProcess05 | 2.554790 |
BiologicalMaterial10 | 2.402378 |
ManufacturingProcess41 | 2.119406 |
ManufacturingProcess01 | -3.932338 |
ManufacturingProcess45 | -4.077941 |
ManufacturingProcess39 | -4.269121 |
ManufacturingProcess30 | -4.333221 |
ManufacturingProcess44 | -4.970355 |
ManufacturingProcess42 | -5.450008 |
ManufacturingProcess29 | -10.207474 |
ManufacturingProcess31 | -11.273361 |
ManufacturingProcess16 | -12.420225 |
ManufacturingProcess20 | -12.638327 |
ManufacturingProcess27 | -12.675018 |
ManufacturingProcess18 | -12.736138 |
ManufacturingProcess25 | -12.793140 |
ManufacturingProcess26 | -12.851872 |
Generally, a skewness between -0.5 and 0.5 indicates a relatively small amount of skewness (0 = perfect symmetry). Any values lower than -2 or greater than 2 are considered skewed. A positive skew indicates a right-skew, while a negative skew indicates a left-skew.
In order to reduce skewness we can use a Box-Cox transformation depending on the approach we take. Now we take a look at a correlation matrix,
trans |>
select(-Yield) |>
cor() |>
round(2) |>
corrplot::corrplot(method = "square",
order = "alphabet",
tl.cex = 0.3,
type = "lower")
We can see there’s quite a bit of correlation amongst the predictors. A Partial Least Squares model may be in order.
set.seed(1)
# takes a vector of data as the first argument to return a list of indices
training_rows <- createDataPartition(trans$Yield,
p = .80,
list = FALSE)
# subset the data into objects for training using integer subsetting
train.set <- cmp[training_rows,]
test.set <- cmp[-training_rows,]
First we create the training and testing sets, then we pre-process the data. I am going to try the Partial Least Squares model, which we can preprocess as we fit the model. The optimal value for this is the number of components which minimizes the RMSE.
set.seed(1)
# specify type of resampling
ctrl <- trainControl(method = "cv", number = 10)
# set aside predictors and response
train.x <- train.set |> select(-Yield)
train.y <- train.set$Yield
test.x <- test.set |> select(-Yield)
test.y<- test.set$Yield
# tune and fit the model with pre processing
plsTune <- train(train.x, train.y,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center","scale"))
plsTune
## Partial Least Squares
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 129, 130, 131, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.566705 0.4247629 1.227834
## 2 1.587705 0.4966616 1.151398
## 3 1.351363 0.5581612 1.071164
## 4 1.783367 0.5433330 1.196660
## 5 2.013469 0.5281434 1.255710
## 6 2.249506 0.5206636 1.318019
## 7 2.469632 0.5197227 1.381629
## 8 2.650143 0.5183121 1.444127
## 9 2.673926 0.5210755 1.472676
## 10 2.743457 0.5245765 1.492636
## 11 2.826852 0.5229362 1.526187
## 12 2.967788 0.5026011 1.590029
## 13 3.096250 0.4874232 1.636990
## 14 3.247232 0.4651949 1.700138
## 15 3.421738 0.4408043 1.760898
## 16 3.475470 0.4327055 1.791162
## 17 3.515570 0.4256927 1.804170
## 18 3.455536 0.4164697 1.797471
## 19 3.371133 0.4187467 1.774526
## 20 3.330286 0.4143005 1.768120
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
optimal.model <- plsTune$bestTune[['ncomp']]
optimal.rmse <- plsTune$results[['RMSE']][[3]]
cat(sprintf("The optimal number of components to minimize the RMSE is %i\nThe RMSE using %i components is %f.",
optimal.model,
optimal.model,
optimal.rmse))
## The optimal number of components to minimize the RMSE is 3
## The RMSE using 3 components is 1.351363.
(d)
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
## RMSE Rsquared MAE
## 1.1137324 0.5967705 0.8842548
The RMSE value is actually lower on the test set than it was on the training set.
The residuals appear to be random and centered around mean zero. The model appears to fit well.
(e)
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
The varImp
function in the caret
package
calculates variable importance for objects produced by the
train
function.
The top 6 most important variables are all process predictors.
(f)
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Let’s use the ten most important variables for this exercise,
# store 10 most important variable names
m.imp <- var.imp$importance |>
data.frame() |>
arrange(desc(Overall)) |>
head(10) |>
row.names()
# correlation matrix of 10 most important variables and response variable
trans |>
select(all_of(c("Yield",m.imp))) |>
cor() |>
round(2) |>
corrplot::corrplot(method = "square",
order = "alphabet",
tl.cex = 0.6,
type = "lower")
We can observe that we might reduce the processes with strong negative correlation with the yield - Manufacturing Processes 13, 17, and 36 - as reducing those processes will increase the yield.