DATA 624 HW4

KJ 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurmenets of the manufacturing process (predictors), and the response of the product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
library(tidyverse)
library(kableExtra)
library(e1071)
library(mice)
library(corrplot)
library(caret)
library(pls)

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

cmp <- ChemicalManufacturingProcess

(b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.

cat(sprintf("Total NAs\n%d",sum(is.na(cmp))))
## Total NAs
## 106

Columns with NAs,

colSums(is.na(cmp)) |> 
  data.frame() |>
  rename("NAs" = 1) |>
  filter(NAs > 0) |>
  arrange(desc(NAs)) |>
  kable()
NAs
ManufacturingProcess03 15
ManufacturingProcess11 10
ManufacturingProcess10 9
ManufacturingProcess25 5
ManufacturingProcess26 5
ManufacturingProcess27 5
ManufacturingProcess28 5
ManufacturingProcess29 5
ManufacturingProcess30 5
ManufacturingProcess31 5
ManufacturingProcess33 5
ManufacturingProcess34 5
ManufacturingProcess35 5
ManufacturingProcess36 5
ManufacturingProcess02 3
ManufacturingProcess06 2
ManufacturingProcess01 1
ManufacturingProcess04 1
ManufacturingProcess05 1
ManufacturingProcess07 1
ManufacturingProcess08 1
ManufacturingProcess12 1
ManufacturingProcess14 1
ManufacturingProcess22 1
ManufacturingProcess23 1
ManufacturingProcess24 1
ManufacturingProcess40 1
ManufacturingProcess41 1

Imputing values using the mice packages predictive mean matching – the function provided in the book (impute.knn function in the impute package) – generated a series of errors I was unable to resolve.

set.seed(1)

# store mids object (multiple imputations by chained equations)
imputed_vals <- mice(cmp, method = "pmm", m = 5, printFlag = F)
# return complete dataset
trans <- complete(imputed_vals)

# check NAs
sprintf("There are %d NA's",sum(is.na(trans)))
## [1] "There are 0 NA's"

(c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Before splitting the data, we want to a look at the distribution and correlation amongst the predictors. To do this, we’ll calculate the skewness statistic and examine a correlation plot of the variables,

skew.vals <- apply(trans, MARGIN = 2, skewness) |>
                  data.frame() |>
                  rename("skew" = 1) |>
                  filter(skew > 2 | skew < -2) |>
                  arrange(desc(skew))

sprintf("Number of skewed variables: %d",nrow(skew.vals))
## [1] "Number of skewed variables: 20"
kable(skew.vals)
skew
ManufacturingProcess43 9.054875
BiologicalMaterial07 7.398664
ManufacturingProcess06 3.061119
ManufacturingProcess05 2.554790
BiologicalMaterial10 2.402378
ManufacturingProcess41 2.119406
ManufacturingProcess01 -3.932338
ManufacturingProcess45 -4.077941
ManufacturingProcess39 -4.269121
ManufacturingProcess30 -4.333221
ManufacturingProcess44 -4.970355
ManufacturingProcess42 -5.450008
ManufacturingProcess29 -10.207474
ManufacturingProcess31 -11.273361
ManufacturingProcess16 -12.420225
ManufacturingProcess20 -12.638327
ManufacturingProcess27 -12.675018
ManufacturingProcess18 -12.736138
ManufacturingProcess25 -12.793140
ManufacturingProcess26 -12.851872

Generally, a skewness between -0.5 and 0.5 indicates a relatively small amount of skewness (0 = perfect symmetry). Any values lower than -2 or greater than 2 are considered skewed. A positive skew indicates a right-skew, while a negative skew indicates a left-skew.

In order to reduce skewness we can use a Box-Cox transformation depending on the approach we take. Now we take a look at a correlation matrix,

trans |>
  select(-Yield) |>
  cor() |>
  round(2) |>
  corrplot::corrplot(method = "square",
            order = "alphabet",
            tl.cex = 0.3,
            type = "lower")

We can see there’s quite a bit of correlation amongst the predictors. A Partial Least Squares model may be in order.

set.seed(1)

# takes a vector of data as the first argument to return a list of indices
training_rows <- createDataPartition(trans$Yield,
                                     p = .80,
                                     list = FALSE)

# subset the data into objects for training using integer subsetting
train.set <- cmp[training_rows,]
test.set <- cmp[-training_rows,]

First we create the training and testing sets, then we pre-process the data. I am going to try the Partial Least Squares model, which we can preprocess as we fit the model. The optimal value for this is the number of components which minimizes the RMSE.

set.seed(1)
# specify type of resampling 
ctrl <- trainControl(method = "cv", number = 10)

# set aside predictors and response 
train.x <- train.set |> select(-Yield)
train.y <- train.set$Yield

test.x <- test.set |> select(-Yield)
test.y<- test.set$Yield

# tune and fit the model with pre processing 
plsTune <- train(train.x, train.y,
                 method = "pls",
                 tuneLength = 20,
                 trControl = ctrl,
                 preProc = c("center","scale"))

plsTune
## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 129, 130, 131, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.566705  0.4247629  1.227834
##    2     1.587705  0.4966616  1.151398
##    3     1.351363  0.5581612  1.071164
##    4     1.783367  0.5433330  1.196660
##    5     2.013469  0.5281434  1.255710
##    6     2.249506  0.5206636  1.318019
##    7     2.469632  0.5197227  1.381629
##    8     2.650143  0.5183121  1.444127
##    9     2.673926  0.5210755  1.472676
##   10     2.743457  0.5245765  1.492636
##   11     2.826852  0.5229362  1.526187
##   12     2.967788  0.5026011  1.590029
##   13     3.096250  0.4874232  1.636990
##   14     3.247232  0.4651949  1.700138
##   15     3.421738  0.4408043  1.760898
##   16     3.475470  0.4327055  1.791162
##   17     3.515570  0.4256927  1.804170
##   18     3.455536  0.4164697  1.797471
##   19     3.371133  0.4187467  1.774526
##   20     3.330286  0.4143005  1.768120
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
optimal.model <- plsTune$bestTune[['ncomp']]
optimal.rmse <- plsTune$results[['RMSE']][[3]]

cat(sprintf("The optimal number of components to minimize the RMSE is %i\nThe RMSE using %i components is %f.",
            optimal.model,
            optimal.model,
            optimal.rmse))
## The optimal number of components to minimize the RMSE is 3
## The RMSE using 3 components is 1.351363.
plot(plsTune)

(d)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

set.seed(1)
preds <- predict(plsTune, test.x)
postResample(preds, test.y)
##      RMSE  Rsquared       MAE 
## 1.1137324 0.5967705 0.8842548

The RMSE value is actually lower on the test set than it was on the training set.

plot(residuals(plsTune), ylab='residuals')
abline(h = 0, col = 'red')

The residuals appear to be random and centered around mean zero. The model appears to fit well.

(e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The varImp function in the caret package calculates variable importance for objects produced by the train function.

# variable importance
var.imp <- varImp(plsTune)
plot(var.imp, top = 10)

The top 6 most important variables are all process predictors.

(f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Let’s use the ten most important variables for this exercise,

# store 10 most important variable names
m.imp <- var.imp$importance |>
  data.frame() |>
  arrange(desc(Overall)) |>
  head(10) |>
  row.names()

# correlation matrix of 10 most important variables and response variable
trans |>
  select(all_of(c("Yield",m.imp))) |>
  cor() |>
  round(2) |>
  corrplot::corrplot(method = "square",
            order = "alphabet",
            tl.cex = 0.6,
            type = "lower")

We can observe that we might reduce the processes with strong negative correlation with the yield - Manufacturing Processes 13, 17, and 36 - as reducing those processes will increase the yield.