6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch: (a) Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
data <- ChemicalManufacturingProcess
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run. (b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Let’s first review the yield and biological predictors. There are no values missing from either of the columns. These are all numeric values.
yield_and_bio <- data[,c(1:13)]
vis_dat(yield_and_bio)
Now let’s review if any values are missing from the process predictors. There are more observations missing in the manufacturing process predictors.
process <- data[,c(14:58)]
vis_dat(process)
The first observation is missing many manufacturing process predictors:
* ManufacturingProcess01 to ManufacturingProcess08
* ManufacturingProcess10 to ManufacturingProcess12
* ManufacturingProcess22 to ManufacturingProcess24
* ManufacturingProcess40 to ManufacturingProcess41
The following variables are missing the most values and are missing in observations 1-25:
* ManufacturingProcess03, ManufacturingProcess10, & ManufacturingProcess11
The following variables are missing the most values and are missing in observations 170-176:
* ManufacturingProcess25 to ManufacturingProcess36 with the exception of ManufacturingProcess32
In order to complete our analysis, we are going to impute these missing values using the preProcess() function using knnImpute.
imp <- preProcess(data, method = "knnImpute")
imputed <- predict(imp,data)
imputed[c(1:5),] %>%
kbl() %>%
kable_styling()
| Yield | BiologicalMaterial01 | BiologicalMaterial02 | BiologicalMaterial03 | BiologicalMaterial04 | BiologicalMaterial05 | BiologicalMaterial06 | BiologicalMaterial07 | BiologicalMaterial08 | BiologicalMaterial09 | BiologicalMaterial10 | BiologicalMaterial11 | BiologicalMaterial12 | ManufacturingProcess01 | ManufacturingProcess02 | ManufacturingProcess03 | ManufacturingProcess04 | ManufacturingProcess05 | ManufacturingProcess06 | ManufacturingProcess07 | ManufacturingProcess08 | ManufacturingProcess09 | ManufacturingProcess10 | ManufacturingProcess11 | ManufacturingProcess12 | ManufacturingProcess13 | ManufacturingProcess14 | ManufacturingProcess15 | ManufacturingProcess16 | ManufacturingProcess17 | ManufacturingProcess18 | ManufacturingProcess19 | ManufacturingProcess20 | ManufacturingProcess21 | ManufacturingProcess22 | ManufacturingProcess23 | ManufacturingProcess24 | ManufacturingProcess25 | ManufacturingProcess26 | ManufacturingProcess27 | ManufacturingProcess28 | ManufacturingProcess29 | ManufacturingProcess30 | ManufacturingProcess31 | ManufacturingProcess32 | ManufacturingProcess33 | ManufacturingProcess34 | ManufacturingProcess35 | ManufacturingProcess36 | ManufacturingProcess37 | ManufacturingProcess38 | ManufacturingProcess39 | ManufacturingProcess40 | ManufacturingProcess41 | ManufacturingProcess42 | ManufacturingProcess43 | ManufacturingProcess44 | ManufacturingProcess45 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -1.1792673 | -0.2261036 | -1.514098 | -2.683036 | 0.2201765 | 0.4941942 | -1.382888 | -0.1313107 | -1.233131 | -3.3962895 | 1.1005296 | -1.838655 | -1.770922 | 0.2154105 | 0.5662872 | 0.3765810 | 0.5655598 | -0.4459347 | -0.5414997 | -0.1596700 | -0.3095182 | -1.7201524 | -0.0770090 | -0.0915734 | -0.4806937 | 0.9771151 | 0.8093999 | 1.1846438 | 0.3303945 | 0.9263296 | 0.1505348 | 0.4563798 | 0.3109942 | 0.2109804 | 0.0583331 | 0.8317688 | 0.8907291 | 0.1200183 | 0.1256347 | 0.3460352 | 0.7826636 | 0.5943242 | 0.7566948 | -0.1952552 | -0.4568829 | 0.9890307 | -1.7202722 | -0.8869472 | -0.6557774 | -1.1540243 | 0.7174727 | 0.2317270 | 0.0596971 | -0.0690077 | 0.2027957 | 2.4056473 | -0.0158806 | 0.6437185 |
| 1.2263678 | 2.2391498 | 1.308996 | -0.056235 | 1.2964386 | 0.4128555 | 1.129077 | -0.1313107 | 2.282619 | -0.7227225 | 1.1005296 | 1.393395 | 1.098985 | -6.1497028 | -1.9692525 | 0.1979962 | -2.3669726 | 0.9993332 | 0.9625383 | -0.9580199 | 0.8941637 | 0.5883746 | 0.5229740 | 1.0820476 | -0.4806937 | -0.5003098 | 0.2775205 | 0.9617071 | 0.1455765 | -0.2753953 | 0.1559773 | 1.5095063 | 0.1849230 | 0.2109804 | -0.7223009 | -1.8147683 | -1.0060115 | 0.1093082 | 0.1966227 | 0.1906613 | 0.8779201 | 0.8347250 | 0.7566948 | -0.2672523 | 1.9517531 | 0.9890307 | 1.9568096 | 1.1463833 | -0.6557774 | 2.2161351 | -0.8224687 | 0.2317270 | 2.1490969 | 2.3462628 | -0.0547226 | -0.0137466 | 0.2946725 | 0.1522024 |
| 1.0042258 | 2.2391498 | 1.308996 | -0.056235 | 1.2964386 | 0.4128555 | 1.129077 | -0.1313107 | 2.282619 | -0.7227225 | 1.1005296 | 1.393395 | 1.098985 | -6.1497028 | -1.9692525 | 0.1087038 | -3.1638563 | 0.0624642 | -0.1117745 | 1.0378549 | 0.8941637 | -0.3815947 | 0.3142842 | 0.5511238 | -0.4806937 | 0.2876502 | 0.4425865 | 0.8245152 | 0.1455765 | 0.3655246 | 0.1831898 | 1.0926437 | 0.1849230 | 0.2109804 | -0.4220571 | -1.2132826 | -0.8335805 | 0.1842786 | 0.2159831 | 0.2104362 | 0.8588688 | 0.7746248 | 0.2444430 | -0.1592567 | 2.6928719 | 0.9890307 | 1.9568096 | 1.2388074 | -1.8000420 | -0.7046697 | -0.8224687 | 0.2317270 | -0.4626528 | -0.4405878 | 0.4088104 | 0.1014627 | -0.0158806 | 0.3979605 |
| 0.6737219 | 2.2391498 | 1.308996 | -0.056235 | 1.2964386 | 0.4128555 | 1.129077 | -0.1313107 | 2.282619 | -0.7227225 | 1.1005296 | 1.393395 | 1.098985 | -6.1497028 | -1.9692525 | 0.4658734 | -3.3232331 | 0.4227984 | 2.1850322 | -0.9580199 | -1.1119728 | -0.4785917 | -0.0248366 | 0.8026141 | -0.4806937 | 0.2876502 | 0.7910592 | 1.0817499 | 0.1967569 | 0.3655246 | 0.1695836 | 0.9829430 | 0.1562704 | 0.2109804 | -0.1218132 | -0.6117969 | -0.6611496 | 0.1708910 | 0.2052273 | 0.1906613 | 0.8588688 | 0.7746248 | 0.2444430 | -0.1592567 | 2.3223125 | 1.7943843 | 0.1182687 | 0.0372939 | -1.8000420 | 0.4187168 | -0.8224687 | 0.2317270 | -0.4626528 | -0.4405878 | -0.3122410 | 0.2166719 | -0.0158806 | -0.0935556 |
| 1.2534583 | 1.4827653 | 1.893939 | 1.135948 | 0.9414412 | -0.3734185 | 1.534835 | -0.1313107 | 1.071310 | -0.1205678 | 0.4162193 | 0.136256 | 1.098985 | -0.2784345 | -1.9692525 | 0.1087038 | -2.2075958 | 0.8453722 | -0.6304083 | 1.0378549 | 0.8941637 | -0.4527258 | -0.3900436 | 0.1040301 | -0.4806937 | 0.0906602 | 2.5334227 | 3.3282665 | 0.4754056 | -0.3555103 | 0.2076811 | 1.6192070 | 0.2938027 | -0.6884239 | 0.7789183 | 0.5911745 | 1.5804530 | 0.2726365 | 0.2912733 | 0.3432102 | 0.8969714 | 0.9549255 | -0.1653585 | -0.1412574 | 2.3223125 | 2.5997378 | 0.1182687 | -2.5505812 | -2.9443066 | -1.8280562 | -0.8224687 | 0.2981503 | -0.4626528 | -0.4405878 | -0.1062263 | 0.2166719 | -0.3264336 | -0.0935556 |
For this problem, we will use the ridge regression model. Ridge regression is used when the predictors in the data set are highly correlated by adding a penalty error to the sum of squared errors. It uses, alpha, as a tuning parameter for the penalty.
Here we are going to start by using 80% of the data set to train our model. We will use the data set with imputed values, imputed. To do this we randomly sample 80% observations based on their row number.
set.seed(123)
ids <- sample(nrow(imputed),nrow(imputed)*0.8)
train_pred <- imputed[ids,-1]
train_output <- imputed[ids,1]
test_pred <- imputed[-ids,-1]
test_output <- imputed[-ids,1]
Next we are going to determine which value of lambda returns the best model. We know which lambda is best because we are looking for the least value of RMSE.
ctrl <- trainControl(method = "cv", number = 10)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(123)
ridgeRegFit <- train(train_pred,train_output,
method='ridge',
#Our different alpha penalty values
tuneGrid=ridgeGrid,
trControl = ctrl,
#center & get pred variables on same scale
preProc=c("center","scale"))
ridgeRegFit
## Ridge Regression
##
## 140 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 126, 126, 125, 125, 125, 127, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 1.0221525 0.3674877 0.7492669
## 0.007142857 0.7394789 0.5192863 0.6083392
## 0.014285714 0.7067733 0.5427204 0.5836053
## 0.021428571 0.7004064 0.5455998 0.5818266
## 0.028571429 0.6990414 0.5459039 0.5802543
## 0.035714286 0.6989843 0.5459852 0.5790184
## 0.042857143 0.6993876 0.5461214 0.5776078
## 0.050000000 0.7000090 0.5463097 0.5763720
## 0.057142857 0.7007663 0.5465197 0.5761176
## 0.064285714 0.7016265 0.5467293 0.5763379
## 0.071428571 0.7025735 0.5469253 0.5767611
## 0.078571429 0.7035971 0.5471011 0.5771880
## 0.085714286 0.7046902 0.5472540 0.5775532
## 0.092857143 0.7058465 0.5473833 0.5778765
## 0.100000000 0.7070609 0.5474898 0.5784133
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.03571429.
This plot shows the change in RMSE as alpha increases. Our best model uses alpha ~0.0357.
plot(ridgeRegFit)
This ridge regression has an R^2 of 55% and RMSE of 0.7.
best_ridge_model = ridgeRegFit$results[ridgeRegFit$results$RMSE == min(ridgeRegFit$results$RMSE),]
best_ridge_model
ridge_model <- enet(x = as.matrix(train_pred), y = train_output,
lambda = best_ridge_model$lambda,
normalize = TRUE)
Here we use our model, ridge_model, to see how well it predicts on our test data. Here we set s = 1 and mode = ‘fraction’ in order to get a ridge regression model.
ridgePred_Test <-predict(ridge_model,newx=as.matrix(test_pred),s=1,mode="fraction",type="fit")
ridgeCoeff_Test <-predict(ridge_model,newx=as.matrix(test_pred),s=1,mode="fraction",type="coefficients")
Now we can compare how our model performed on our train data versus our test data.
model_values <- data.frame(obs = test_output, pred = ridgePred_Test$fit)
defaultSummary(model_values)
## RMSE Rsquared MAE
## 3.26725126 0.07583988 0.96056057
The RMSE value for the test value is only 7.6% and the RMSE value is 3.27. Our model did not perform as well as on the testing data as it did on the training data.
varImp(ridgeRegFit)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 87.97
## ManufacturingProcess13 85.29
## BiologicalMaterial03 80.45
## ManufacturingProcess17 75.65
## BiologicalMaterial12 66.75
## BiologicalMaterial02 65.96
## ManufacturingProcess09 64.64
## ManufacturingProcess31 62.26
## ManufacturingProcess06 61.40
## ManufacturingProcess36 61.07
## ManufacturingProcess33 50.95
## BiologicalMaterial04 49.13
## ManufacturingProcess29 44.79
## ManufacturingProcess11 44.27
## BiologicalMaterial11 43.55
## ManufacturingProcess30 42.95
## ManufacturingProcess02 42.68
## BiologicalMaterial01 39.24
## ManufacturingProcess27 34.92
ManufacturingProcess32 is the most important variable followed by BiologicalMaterial06. The manufacturing predictors are dominating in our list and appear to more important in our model.
ggplot(data=data,aes(x=ManufacturingProcess32,y=Yield)) + geom_point()
cor(data$ManufacturingProcess32,data$Yield)
## [1] 0.6083321
The manufacturing data appears to be discrete, whole numbers. There is a strong linear relationship between the two variables with a correlation of 61%. The measurements from ManufacturingProcess32 would be extremely helpful in trying to maximize the Yield.
ggplot(data=data,aes(x=BiologicalMaterial06,y=Yield)) + geom_point()
boxplot(data$BiologicalMaterial06)
The BiologicalMaterial06 is a continuous variable. There is a linear realtionship between BiologicalMater06 and the Yield but it isn’t as strong as the relationship between the ManufacturingProcess32 and the Yield. Here we do see outliers within BiologicalMaterial06. This could confirm we should be using a penalty linear regression method such as Lasso or Ridge Regression. We potentially may want to remove the outliers in order to create a better model.