—————————————————————————

Student Name : Sachid Deshmukh

—————————————————————————

6.3 A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

a. Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.6.3
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

b. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Check missing values

## [1] "Number of columns with missing values =  28"
## [1] "Names of columns with missing values =  ManufacturingProcess01, ManufacturingProcess02, ManufacturingProcess03, ManufacturingProcess04, ManufacturingProcess05, ManufacturingProcess06, ManufacturingProcess07, ManufacturingProcess08, ManufacturingProcess10, ManufacturingProcess11, ManufacturingProcess12, ManufacturingProcess14, ManufacturingProcess22, ManufacturingProcess23, ManufacturingProcess24, ManufacturingProcess25, ManufacturingProcess26, ManufacturingProcess27, ManufacturingProcess28, ManufacturingProcess29, ManufacturingProcess30, ManufacturingProcess31, ManufacturingProcess33, ManufacturingProcess34, ManufacturingProcess35, ManufacturingProcess36, ManufacturingProcess40, ManufacturingProcess41"

Let’s do data imputation for missing columns

Which columns are mssing and what is a missing pattern. Let’s leverage VIM package to get this information

## 
##  Variables sorted by number of missings: 
##                Variable       Count
##  ManufacturingProcess03 0.085227273
##  ManufacturingProcess11 0.056818182
##  ManufacturingProcess10 0.051136364
##  ManufacturingProcess25 0.028409091
##  ManufacturingProcess26 0.028409091
##  ManufacturingProcess27 0.028409091
##  ManufacturingProcess28 0.028409091
##  ManufacturingProcess29 0.028409091
##  ManufacturingProcess30 0.028409091
##  ManufacturingProcess31 0.028409091
##  ManufacturingProcess33 0.028409091
##  ManufacturingProcess34 0.028409091
##  ManufacturingProcess35 0.028409091
##  ManufacturingProcess36 0.028409091
##  ManufacturingProcess02 0.017045455
##  ManufacturingProcess06 0.011363636
##  ManufacturingProcess01 0.005681818
##  ManufacturingProcess04 0.005681818
##  ManufacturingProcess05 0.005681818
##  ManufacturingProcess07 0.005681818
##  ManufacturingProcess08 0.005681818
##  ManufacturingProcess12 0.005681818
##  ManufacturingProcess14 0.005681818
##  ManufacturingProcess22 0.005681818
##  ManufacturingProcess23 0.005681818
##  ManufacturingProcess24 0.005681818
##  ManufacturingProcess40 0.005681818
##  ManufacturingProcess41 0.005681818
##                   Yield 0.000000000
##    BiologicalMaterial01 0.000000000
##    BiologicalMaterial02 0.000000000
##    BiologicalMaterial03 0.000000000
##    BiologicalMaterial04 0.000000000
##    BiologicalMaterial05 0.000000000
##    BiologicalMaterial06 0.000000000
##    BiologicalMaterial07 0.000000000
##    BiologicalMaterial08 0.000000000
##    BiologicalMaterial09 0.000000000
##    BiologicalMaterial10 0.000000000
##    BiologicalMaterial11 0.000000000
##    BiologicalMaterial12 0.000000000
##  ManufacturingProcess09 0.000000000
##  ManufacturingProcess13 0.000000000
##  ManufacturingProcess15 0.000000000
##  ManufacturingProcess16 0.000000000
##  ManufacturingProcess17 0.000000000
##  ManufacturingProcess18 0.000000000
##  ManufacturingProcess19 0.000000000
##  ManufacturingProcess20 0.000000000
##  ManufacturingProcess21 0.000000000
##  ManufacturingProcess32 0.000000000
##  ManufacturingProcess37 0.000000000
##  ManufacturingProcess38 0.000000000
##  ManufacturingProcess39 0.000000000
##  ManufacturingProcess42 0.000000000
##  ManufacturingProcess43 0.000000000
##  ManufacturingProcess44 0.000000000
##  ManufacturingProcess45 0.000000000

Let’s use MICE package to imput missing values

## 
##  iter imp variable
##   1   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   1   2  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   2   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   2   2  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   3   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   3   2  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   4   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   4   2  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   5   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   5   2  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
## Warning: Number of logged events: 270

c. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

# train test split
set.seed(100)
rows = nrow(ChemicalManufacturingProcess)
t.index <- sample(1:rows, size = round(0.75*rows), replace=FALSE)
df.train <- ChemicalManufacturingProcess[t.index ,]
df.test <- ChemicalManufacturingProcess[-t.index ,]
df.train.x = df.train[,-1]
df.train.y = df.train[,1]
df.test.x = df.test[,-1]
df.test.y = df.test[,1]

dataframe have lots of variable. Let’s use penalized regression model

ridgeGrid = data.frame(.lambda = seq(0, .1, length = 15))


ridgeReg.fit = train(df.train.x, df.train.y, method = 'ridge', tuneGrid = ridgeGrid, trControl  = trainControl(), preProc = c("center", "scale"))

ridgeReg.fit
## Ridge Regression 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE      Rsquared   MAE     
##   0.000000000  4.426153  0.1833702  2.132187
##   0.007142857  1.956353  0.4111441  1.376672
##   0.014285714  1.746569  0.4613601  1.278776
##   0.021428571  1.661009  0.4830388  1.233125
##   0.028571429  1.611079  0.4964931  1.206454
##   0.035714286  1.577266  0.5060123  1.186955
##   0.042857143  1.552409  0.5132714  1.171608
##   0.050000000  1.533189  0.5191031  1.158948
##   0.057142857  1.517835  0.5239682  1.148549
##   0.064285714  1.505295  0.5281401  1.139731
##   0.071428571  1.494895  0.5317897  1.132096
##   0.078571429  1.486178  0.5350292  1.125741
##   0.085714286  1.478818  0.5379359  1.120782
##   0.092857143  1.472575  0.5405653  1.116576
##   0.100000000  1.467267  0.5429592  1.112989
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

d. Predict the response for the test set. What is the value of the performance metric and how does this compare with the re-sampled performance metric on the training set?

ridgeModel = enet(x = as.matrix(df.train.x), y = df.train.y, lambda = 0.1)
ridgePred = predict(ridgeModel, newx = as.matrix(df.test.x), s=1, mode = 'fraction')


df.train.y = as.numeric(df.train.y)
ridgePred = as.numeric(ridgePred$fit)
rmse = (sqrt(sum((ridgePred - df.train.y) ^2)))/nrow(df.test)

print(paste('Test RMSE = ', rmse))
## [1] "Test RMSE =  1.29919635962004"

Test RMSE is higher than train RMSE

e. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Let’s use Boruta to identify variable importance

boruta.train = Boruta(df.train.x, df.train.y)
print(boruta.train)
## Boruta performed 99 iterations in 15.04611 secs.
##  30 attributes confirmed important: BiologicalMaterial01,
## BiologicalMaterial02, BiologicalMaterial03, BiologicalMaterial04,
## BiologicalMaterial05 and 25 more;
##  24 attributes confirmed unimportant: BiologicalMaterial07,
## ManufacturingProcess03, ManufacturingProcess04,
## ManufacturingProcess05, ManufacturingProcess07 and 19 more;
##  3 tentative attributes left: BiologicalMaterial10,
## ManufacturingProcess25, ManufacturingProcess43;
plot(boruta.train, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)

Process predictors dominate the list. Which makes more sense regarding yield outcome variable. Yield can be better with better manufacturing process components

f. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Plot correltion matrix of data

From the above correlation matrix, we can see that Yield has higher positive correlation with Manufacturing Processes. Better the manufacturing processes better is the yield