Data624_KJ_Chapter6

Exercise 6

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch: (a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
data <- ChemicalManufacturingProcess

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run. (b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Missing Yield & Biological Predictors

Let’s first review the yield and biological predictors. There are no values missing from either of the columns. These are all numeric values.

yield_and_bio <- data[,c(1:13)]

vis_dat(yield_and_bio)

Missing Process Predictors

Now let’s review if any values are missing from the process predictors. There are more observations missing in the manufacturing process predictors.

process <- data[,c(14:58)]

vis_dat(process)

The first observation is missing many manufacturing process predictors:
* ManufacturingProcess01 to ManufacturingProcess08
* ManufacturingProcess10 to ManufacturingProcess12
* ManufacturingProcess22 to ManufacturingProcess24
* ManufacturingProcess40 to ManufacturingProcess41

The following variables are missing the most values and are missing in observations 1-25:
* ManufacturingProcess03, ManufacturingProcess10, & ManufacturingProcess11

The following variables are missing the most values and are missing in observations 170-176:
* ManufacturingProcess25 to ManufacturingProcess36 with the exception of ManufacturingProcess32

In order to complete our analysis, we are going to impute these missing values using the preProcess() function using knnImpute.

imp <- preProcess(data, method = "knnImpute")
imputed <- predict(imp,data)
imputed[c(1:5),] %>% 
  kbl() %>%
  kable_styling()

Yield	BiologicalMaterial01	BiologicalMaterial02	BiologicalMaterial03	BiologicalMaterial04	BiologicalMaterial05	BiologicalMaterial06	BiologicalMaterial07	BiologicalMaterial08	BiologicalMaterial09	BiologicalMaterial10	BiologicalMaterial11	BiologicalMaterial12	ManufacturingProcess01	ManufacturingProcess02	ManufacturingProcess03	ManufacturingProcess04	ManufacturingProcess05	ManufacturingProcess06	ManufacturingProcess07	ManufacturingProcess08	ManufacturingProcess09	ManufacturingProcess10	ManufacturingProcess11	ManufacturingProcess12	ManufacturingProcess13	ManufacturingProcess14	ManufacturingProcess15	ManufacturingProcess16	ManufacturingProcess17	ManufacturingProcess18	ManufacturingProcess19	ManufacturingProcess20	ManufacturingProcess21	ManufacturingProcess22	ManufacturingProcess23	ManufacturingProcess24	ManufacturingProcess25	ManufacturingProcess26	ManufacturingProcess27	ManufacturingProcess28	ManufacturingProcess29	ManufacturingProcess30	ManufacturingProcess31	ManufacturingProcess32	ManufacturingProcess33	ManufacturingProcess34	ManufacturingProcess35	ManufacturingProcess36	ManufacturingProcess37	ManufacturingProcess38	ManufacturingProcess39	ManufacturingProcess40	ManufacturingProcess41	ManufacturingProcess42	ManufacturingProcess43	ManufacturingProcess44	ManufacturingProcess45
-1.1792673	-0.2261036	-1.514098	-2.683036	0.2201765	0.4941942	-1.382888	-0.1313107	-1.233131	-3.3962895	1.1005296	-1.838655	-1.770922	0.2154105	0.5662872	0.3765810	0.5655598	-0.4459347	-0.5414997	-0.1596700	-0.3095182	-1.7201524	-0.0770090	-0.0915734	-0.4806937	0.9771151	0.8093999	1.1846438	0.3303945	0.9263296	0.1505348	0.4563798	0.3109942	0.2109804	0.0583331	0.8317688	0.8907291	0.1200183	0.1256347	0.3460352	0.7826636	0.5943242	0.7566948	-0.1952552	-0.4568829	0.9890307	-1.7202722	-0.8869472	-0.6557774	-1.1540243	0.7174727	0.2317270	0.0596971	-0.0690077	0.2027957	2.4056473	-0.0158806	0.6437185
1.2263678	2.2391498	1.308996	-0.056235	1.2964386	0.4128555	1.129077	-0.1313107	2.282619	-0.7227225	1.1005296	1.393395	1.098985	-6.1497028	-1.9692525	0.1979962	-2.3669726	0.9993332	0.9625383	-0.9580199	0.8941637	0.5883746	0.5229740	1.0820476	-0.4806937	-0.5003098	0.2775205	0.9617071	0.1455765	-0.2753953	0.1559773	1.5095063	0.1849230	0.2109804	-0.7223009	-1.8147683	-1.0060115	0.1093082	0.1966227	0.1906613	0.8779201	0.8347250	0.7566948	-0.2672523	1.9517531	0.9890307	1.9568096	1.1463833	-0.6557774	2.2161351	-0.8224687	0.2317270	2.1490969	2.3462628	-0.0547226	-0.0137466	0.2946725	0.1522024
1.0042258	2.2391498	1.308996	-0.056235	1.2964386	0.4128555	1.129077	-0.1313107	2.282619	-0.7227225	1.1005296	1.393395	1.098985	-6.1497028	-1.9692525	0.1087038	-3.1638563	0.0624642	-0.1117745	1.0378549	0.8941637	-0.3815947	0.3142842	0.5511238	-0.4806937	0.2876502	0.4425865	0.8245152	0.1455765	0.3655246	0.1831898	1.0926437	0.1849230	0.2109804	-0.4220571	-1.2132826	-0.8335805	0.1842786	0.2159831	0.2104362	0.8588688	0.7746248	0.2444430	-0.1592567	2.6928719	0.9890307	1.9568096	1.2388074	-1.8000420	-0.7046697	-0.8224687	0.2317270	-0.4626528	-0.4405878	0.4088104	0.1014627	-0.0158806	0.3979605
0.6737219	2.2391498	1.308996	-0.056235	1.2964386	0.4128555	1.129077	-0.1313107	2.282619	-0.7227225	1.1005296	1.393395	1.098985	-6.1497028	-1.9692525	0.4658734	-3.3232331	0.4227984	2.1850322	-0.9580199	-1.1119728	-0.4785917	-0.0248366	0.8026141	-0.4806937	0.2876502	0.7910592	1.0817499	0.1967569	0.3655246	0.1695836	0.9829430	0.1562704	0.2109804	-0.1218132	-0.6117969	-0.6611496	0.1708910	0.2052273	0.1906613	0.8588688	0.7746248	0.2444430	-0.1592567	2.3223125	1.7943843	0.1182687	0.0372939	-1.8000420	0.4187168	-0.8224687	0.2317270	-0.4626528	-0.4405878	-0.3122410	0.2166719	-0.0158806	-0.0935556
1.2534583	1.4827653	1.893939	1.135948	0.9414412	-0.3734185	1.534835	-0.1313107	1.071310	-0.1205678	0.4162193	0.136256	1.098985	-0.2784345	-1.9692525	0.1087038	-2.2075958	0.8453722	-0.6304083	1.0378549	0.8941637	-0.4527258	-0.3900436	0.1040301	-0.4806937	0.0906602	2.5334227	3.3282665	0.4754056	-0.3555103	0.2076811	1.6192070	0.2938027	-0.6884239	0.7789183	0.5911745	1.5804530	0.2726365	0.2912733	0.3432102	0.8969714	0.9549255	-0.1653585	-0.1412574	2.3223125	2.5997378	0.1182687	-2.5505812	-2.9443066	-1.8280562	-0.8224687	0.2981503	-0.4626528	-0.4405878	-0.1062263	0.2166719	-0.3264336	-0.0935556

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

For this problem, we will use the ridge regression model. Ridge regression is used when the predictors in the data set are highly correlated by adding a penalty error to the sum of squared errors. It uses, alpha, as a tuning parameter for the penalty.

Here we are going to start by using 80% of the data set to train our model. We will use the data set with imputed values, imputed. To do this we randomly sample 80% observations based on their row number.

set.seed(123)
ids <- sample(nrow(imputed),nrow(imputed)*0.8)

train_pred <- imputed[ids,-1]
train_output <- imputed[ids,1]

test_pred <- imputed[-ids,-1]
test_output <- imputed[-ids,1]

Next we are going to determine which value of lambda returns the best model. We know which lambda is best because we are looking for the least value of RMSE.

ctrl <- trainControl(method = "cv", number = 10)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(123)
ridgeRegFit <- train(train_pred,train_output,
                     method='ridge',
                     #Our different alpha penalty values
                     tuneGrid=ridgeGrid,
                     trControl = ctrl,
                     #center & get pred variables on same scale
                     preProc=c("center","scale"))
ridgeRegFit

## Ridge Regression 
## 
## 140 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 126, 125, 125, 125, 127, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.000000000  1.0221525  0.3674877  0.7492669
##   0.007142857  0.7394789  0.5192863  0.6083392
##   0.014285714  0.7067733  0.5427204  0.5836053
##   0.021428571  0.7004064  0.5455998  0.5818266
##   0.028571429  0.6990414  0.5459039  0.5802543
##   0.035714286  0.6989843  0.5459852  0.5790184
##   0.042857143  0.6993876  0.5461214  0.5776078
##   0.050000000  0.7000090  0.5463097  0.5763720
##   0.057142857  0.7007663  0.5465197  0.5761176
##   0.064285714  0.7016265  0.5467293  0.5763379
##   0.071428571  0.7025735  0.5469253  0.5767611
##   0.078571429  0.7035971  0.5471011  0.5771880
##   0.085714286  0.7046902  0.5472540  0.5775532
##   0.092857143  0.7058465  0.5473833  0.5778765
##   0.100000000  0.7070609  0.5474898  0.5784133
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.03571429.

This plot shows the change in RMSE as alpha increases. Our best model uses alpha ~0.0357.

plot(ridgeRegFit)

This ridge regression has an R^2 of 55% and RMSE of 0.7.

best_ridge_model = ridgeRegFit$results[ridgeRegFit$results$RMSE == min(ridgeRegFit$results$RMSE),]
best_ridge_model

ridge_model <- enet(x = as.matrix(train_pred), y = train_output, 
                    lambda = best_ridge_model$lambda, 
                    normalize = TRUE)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Here we use our model, ridge_model, to see how well it predicts on our test data. Here we set s = 1 and mode = ‘fraction’ in order to get a ridge regression model.

ridgePred_Test <-predict(ridge_model,newx=as.matrix(test_pred),s=1,mode="fraction",type="fit") 
ridgeCoeff_Test <-predict(ridge_model,newx=as.matrix(test_pred),s=1,mode="fraction",type="coefficients")

Now we can compare how our model performed on our train data versus our test data.

model_values <- data.frame(obs = test_output, pred = ridgePred_Test$fit)
defaultSummary(model_values)

##       RMSE   Rsquared        MAE 
## 3.26725126 0.07583988 0.96056057

The RMSE value for the test value is only 7.6% and the RMSE value is 3.27. Our model did not perform as well as on the testing data as it did on the training data.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

varImp(ridgeRegFit)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     87.97
## ManufacturingProcess13   85.29
## BiologicalMaterial03     80.45
## ManufacturingProcess17   75.65
## BiologicalMaterial12     66.75
## BiologicalMaterial02     65.96
## ManufacturingProcess09   64.64
## ManufacturingProcess31   62.26
## ManufacturingProcess06   61.40
## ManufacturingProcess36   61.07
## ManufacturingProcess33   50.95
## BiologicalMaterial04     49.13
## ManufacturingProcess29   44.79
## ManufacturingProcess11   44.27
## BiologicalMaterial11     43.55
## ManufacturingProcess30   42.95
## ManufacturingProcess02   42.68
## BiologicalMaterial01     39.24
## ManufacturingProcess27   34.92

ManufacturingProcess32 is the most important variable followed by BiologicalMaterial06. The manufacturing predictors are dominating in our list and appear to more important in our model.

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

ggplot(data=data,aes(x=ManufacturingProcess32,y=Yield)) + geom_point()

cor(data$ManufacturingProcess32,data$Yield)

## [1] 0.6083321

The manufacturing data appears to be discrete, whole numbers. There is a strong linear relationship between the two variables with a correlation of 61%. The measurements from ManufacturingProcess32 would be extremely helpful in trying to maximize the Yield.

ggplot(data=data,aes(x=BiologicalMaterial06,y=Yield)) + geom_point()

boxplot(data$BiologicalMaterial06)

The BiologicalMaterial06 is a continuous variable. There is a linear realtionship between BiologicalMater06 and the Yield but it isn’t as strong as the relationship between the ManufacturingProcess32 and the Yield. Here we do see outliers within BiologicalMaterial06. This could confirm we should be using a penalty linear regression method such as Lasso or Ridge Regression. We potentially may want to remove the outliers in order to create a better model.

Resources:

https://www.statology.org/ridge-regression-in-r/
https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/

Data624_KJ_Chapter6_Exercises

Devin Teran

6/22/2021

Exercise 6

Missing Yield & Biological Predictors

Missing Process Predictors

Resources: