Textbook: Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.
# Required R packages
library(tidyverse)
library(AppliedPredictiveModeling)
library(caret)
# Minor R Packages
library(kableExtra)
library(psych)Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
Start R and use these commands to load the data:
The matrix fingerprints contain the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
This function is good for diagnosing predictors that have one unique value or predictors that have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.
## [1] "As a result, nearly, 35% of the predictors are left for modeling."
Moreover, to build a smaller model without predictors with extremely high correlations, it is best to reduce the number of predictors such that there are no absolute pairwise correlations above 0.90. The list below shows only significant correlations (at 5% level) for the top 10 highest correlations by the correlation coefficient. The results show that these ten have a perfect correlation of 1. There are 152 pairs of perfect correlation.
corr = cor(fingerprints.new)
corr[corr == 1] = NA
corr[abs(corr) < 0.85] = NA
corr = na.omit(reshape::melt(corr))
head(corr[order(-abs(corr$value)),], 10) ## X1 X2 value
## 3575 X133 X16 1
## 3576 X138 X16 1
## 4676 X37 X25 1
## 4687 X49 X25 1
## 4709 X71 X25 1
## 4710 X72 X25 1
## 7385 X25 X37 1
## 7403 X49 X37 1
## 7425 X71 X37 1
## 7426 X72 X37 1
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
For pre-processing, center was used to subtracts the mean of the predictor’s data from the predictor values and scale to divide by the standard deviation. Next, the data is split into training and testing sets with a ratio of 7:3.
set.seed(525)
# Pre-processing
filters = preProcess(fingerprints.new, method = c("center", "scale"))
response = predict(filters, fingerprints.new)
# Indices for 70% of the data set
intrain = createDataPartition(permeability, p = 0.7)[[1]]
# Separate test and training sets
## Predictive variables
train.p = response[intrain,]
test.p = response[-intrain,]
## Response variables
train.r = permeability[intrain,]
test.r = permeability[-intrain,]A PLS model is fitted and it works by successively extracting factors from both predictive and target variables such that covariance between the extracted factors is maximized.
set.seed(525)
# Tuning PLS Model
pls.model = train(train.p, train.r,
method = "pls",
metric = "Rsquared",
tuneLength = 10,
trControl = trainControl(method = "cv"))
pls.model## Partial Least Squares
##
## 117 samples
## 110 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 11.14305 0.5001935 8.014767
## 2 10.69170 0.5395304 7.845220
## 3 11.16686 0.5061752 8.606873
## 4 11.26578 0.5099554 8.596693
## 5 10.98327 0.5196048 8.451220
## 6 10.94557 0.5223379 8.477930
## 7 11.17571 0.5062397 8.440740
## 8 11.36017 0.4933461 8.748603
## 9 11.52887 0.4820818 8.975561
## 10 11.79531 0.4635016 9.282334
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 2.
From the partial least squared model, the number of components which resulted in the smallest root mean squared error is 2. It has RMSE = 10.69, \(R^2\) = 0.54, and MAE = 7.85. While it does account for the largest portion of the variability in the data than all other latent variables, it also produces the smallest error. Moreover, from the plots with the predictions, the fit with the observed values is a bit further from the 1:1 line, and the residuals are quite large.
Predict the response for the test set. What is the test set estimate of \(R^2\)?
Using the test set of predictors, predictions were determined to see how well it compares to the actual values.
prediction = predict(pls.model, test.p)
xyplot(test.r ~ prediction, type = c("p", "g"),
main = "Predicted vs Observed",
xlab = "Predicted",
ylab = "Observed",
panel = function(x, y) {
panel.xyplot(x, y)
panel.abline(lm(y ~ x))
})## RMSE Rsquared MAE
## 13.1567360 0.4178801 9.0272188
The plot of the predictions versus the actual permeability responses suggests that the fit may not be close but it does a pretty good generalizing the fit of the data. The performance statistics suggest that the model fits the test data with higher error, RMSE = 13.16. It also accounts for only 41.8% of the variability of the data with the mean absolute error is 9.03.
Try building other models discussed in this chapter. Do any have better predictive performance?
Penalized models will be built and compared to the PLS model. The glmnet method in caret has an alpha argument that determines what type of penalized model is fit, i.e. ridge or lasso. If alpha = 0 then a ridge regression model is fit, and if alpha = 1 then a lasso model is fit. Moreover, to tune over various penalties to define the amount of shrinkage. The best lambda is then defined as the lambda that minimizes the cross-validation prediction error rate.
From the ridge model below, the best tune is with a lambda = 99.39 since \(R^2\) was used to select the optimal model using the largest value. It is equal to 0.54, while the RMSE is 11.24. Moreover, from the plots with the predictions, the fit with the observed values is a bit further from the 1:1 line, and the residuals are quite large.
# Ridge Regression
set.seed(525)
lambda = round(seq(80, 120, length = 100),5)
ridge.model = train(x = train.p, y = train.r, method = "glmnet",
trControl = trainControl("cv", number = 10), metric = "Rsquared",
tuneGrid = expand.grid(alpha = 0, lambda = lambda)
)
ridge.model$bestTune## alpha lambda
## 49 0 99.39394
data.frame(rsquared = ridge.model[["results"]][["Rsquared"]][as.numeric(rownames(ridge.model$bestTune))],
rmse = ridge.model[["results"]][["RMSE"]][as.numeric(rownames(ridge.model$bestTune))])## rsquared rmse
## 1 0.5437861 11.2432
From the lasso model below, the best tune is with a lambda = 0.273 since \(R^2\) was also used to select the optimal model using the largest value. It is equal to 0.54, while the RMSE is 10.59. Moreover, from the plots with the predictions, the fit with the observed values is closer to the 1:1 line, and the residuals are smaller than compared to the PLS and ridge models.
# Lasso Regression
set.seed(525)
lambda = round(seq(0, 3, length = 100),5)
lasso.model = train(x = train.p, y = train.r, method = "glmnet",
trControl = trainControl("cv", number = 10), metric = "Rsquared",
tuneGrid = expand.grid(alpha = 1, lambda = lambda)
)
lasso.model$bestTune## alpha lambda
## 10 1 0.27273
data.frame(rsquared = lasso.model[["results"]][["Rsquared"]][as.numeric(rownames(lasso.model$bestTune))],
rmse = lasso.model[["results"]][["RMSE"]][as.numeric(rownames(lasso.model$bestTune))])## rsquared rmse
## 1 0.543528 10.58578
From the elastic net model below, the best tune is with an alpha = 0.1 and lambda = 3.69 since \(R^2\) was also used to select the optimal model using the largest value. It is equal to 0.56, while the RMSE is 10.42. Similar to the appearance of the Lasso model, the elastic net model predictions fit with the observed values closer to the 1:1 line, and the residuals are much smaller than the PLS and Ridge models.
# Elastic Net Regression
set.seed(525)
elastic.model = train(x = train.p, y = train.r, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10, metric = "Rsquared"
)
elastic.model$bestTune## alpha lambda
## 9 0.1 3.689559
data.frame(rsquared = elastic.model[["results"]][["Rsquared"]][as.numeric(rownames(elastic.model$bestTune))],
rmse = elastic.model[["results"]][["RMSE"]][as.numeric(rownames(elastic.model$bestTune))])## rsquared rmse
## 1 0.5621277 10.42212
ggplot(elastic.model) + labs(title = "R-squared Error of Elastic Model") + theme(legend.position = "top")By conducting a resampling method, performance metrics were collected and analyzed to determine the model that best fits the training data. The results below suggest that the elastic net model had a mean \(R^2\) = 0.56 from the 10 sample cross-validations. Additionally, the root mean squared error is also the smallest among the models, RMSE = 10.42. It can therefore be stated that the elastic net model best fitted the training data than the PLS, ridge, and LASSO models.
set.seed(525)
summary(resamples(list(pls = pls.model, ridge = ridge.model, lasso = lasso.model, elastic = elastic.model)))##
## Call:
## summary.resamples(object = resamples(list(pls = pls.model, ridge =
## ridge.model, lasso = lasso.model, elastic = elastic.model)))
##
## Models: pls, ridge, lasso, elastic
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 4.516380 6.319551 7.822575 7.845220 9.033235 12.07860 0
## ridge 5.673263 7.267001 7.638879 8.255648 8.842271 11.97652 0
## lasso 4.894922 6.265963 7.734898 7.978629 8.933502 13.80871 0
## elastic 4.578180 5.623109 7.479047 7.533469 8.670289 12.37257 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 6.502382 8.522319 10.326830 10.69170 12.97002 15.47320 0
## ridge 7.108027 10.435837 10.659700 11.24320 11.90559 15.01630 0
## lasso 5.909654 7.997035 10.108416 10.58578 12.73794 17.89737 0
## elastic 6.563718 7.638264 9.796075 10.42212 12.91785 16.43266 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.23821228 0.3059683 0.5509111 0.5395304 0.7687725 0.8201935 0
## ridge 0.20239897 0.3471501 0.5409930 0.5437861 0.7809898 0.8970033 0
## lasso 0.09491479 0.4499906 0.5087114 0.5435280 0.7000218 0.8744062 0
## elastic 0.15299062 0.4041701 0.5708622 0.5621277 0.7216551 0.8982463 0
Now, let’s use the best model, i.e. elastic net model with alpha = 0.1 and lambda = 3.69 to make predictions with the test predictive variables and compare the accuracy to the actual test responses.
accuracy = function(models, predictors, response){
acc = list()
i = 1
for (model in models){
predictions = predict(model, newdata = predictors)
acc[[i]] = postResample(pred = predictions, obs = response)
i = i + 1
}
names(acc) = c("pls", "ridge", "lasso", "elastic")
return(acc)
}
models = list(pls.model, ridge.model, lasso.model, elastic.model)
accuracy(models, test.p, test.r)## $pls
## RMSE Rsquared MAE
## 13.1567360 0.4178801 9.0272188
##
## $ridge
## RMSE Rsquared MAE
## 14.3035711 0.3942777 10.3668457
##
## $lasso
## RMSE Rsquared MAE
## 12.8591783 0.4277793 9.6944967
##
## $elastic
## RMSE Rsquared MAE
## 12.4468982 0.5003148 9.0763473
From the results, it can be concluded that the elastic net model predicted the test response with the best accuracy. It has \(R^2\) = 0.50 and RMSE = 12.45.
Would you recommend any of your models to replace the permeability laboratory experiment?
Typically, for predicting physical processes, \(R^2\) should be greater than 50%, as this only explains about 29% of the standard deviation. From the descriptive statistic, it is clear that the predictions are not as similar to the actual values. The predictive interval is smaller than what some of the actual values can be. Moreover, the predictions have a median of 10.19 whereas the actual data is 4.91. In conclusion, it is not recommended to use this model to replace the permeability laboratory experiment.
rbind(actual = describe(permeability)[,-c(1,6,7)],
prediction = describe(predict(elastic.model, newdata = test.p))[,-c(1,6,7)])## n mean sd median min max range skew kurtosis se
## actual 165 12.24 15.58 4.91 0.06 55.60 55.54 1.40 0.56 1.21
## prediction 48 12.09 8.75 10.19 -2.86 35.78 38.64 1.05 0.64 1.26
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, the manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
psych::describe(ChemicalManufacturingProcess)[,-c(1,5,6,10,13)] %>%
kable() %>% kable_styling(bootstrap_options = "striped") %>%
scroll_box(width = "100%", height = "300px")| n | mean | sd | mad | min | max | skew | kurtosis | |
|---|---|---|---|---|---|---|---|---|
| Yield | 176 | 40.1765341 | 1.8456664 | 1.9718580 | 35.250 | 46.340 | 0.3109596 | -0.1132944 |
| BiologicalMaterial01 | 176 | 6.4114205 | 0.7139225 | 0.6745830 | 4.580 | 8.810 | 0.2733165 | 0.4567758 |
| BiologicalMaterial02 | 176 | 55.6887500 | 4.0345806 | 4.5812340 | 46.870 | 64.750 | 0.2441269 | -0.7050911 |
| BiologicalMaterial03 | 176 | 67.7050000 | 4.0010641 | 4.2773010 | 56.970 | 78.250 | 0.0285108 | -0.1235203 |
| BiologicalMaterial04 | 176 | 12.3492614 | 1.7746607 | 1.3714050 | 9.380 | 23.090 | 1.7323153 | 7.0564614 |
| BiologicalMaterial05 | 176 | 18.5986364 | 1.8441408 | 1.8829020 | 13.240 | 24.850 | 0.3040053 | 0.2198005 |
| BiologicalMaterial06 | 176 | 48.9103977 | 3.7460718 | 3.9437160 | 40.600 | 59.380 | 0.3685344 | -0.3654933 |
| BiologicalMaterial07 | 176 | 100.0141477 | 0.1077423 | 0.0000000 | 100.000 | 100.830 | 7.3986642 | 53.0417012 |
| BiologicalMaterial08 | 176 | 17.4947727 | 0.6769536 | 0.5930400 | 15.880 | 19.140 | 0.2200539 | 0.0627721 |
| BiologicalMaterial09 | 176 | 12.8500568 | 0.4151757 | 0.4225410 | 11.440 | 14.080 | -0.2684177 | 0.2927765 |
| BiologicalMaterial10 | 176 | 2.8006250 | 0.5991433 | 0.4003020 | 1.770 | 6.870 | 2.4023783 | 11.6471845 |
| BiologicalMaterial11 | 176 | 146.9531818 | 4.8204704 | 4.1142150 | 135.810 | 158.730 | 0.3588211 | 0.0162456 |
| BiologicalMaterial12 | 176 | 20.1998864 | 0.7735440 | 0.6671700 | 18.350 | 22.210 | 0.3038443 | 0.0146595 |
| ManufacturingProcess01 | 175 | 11.2074286 | 1.8224342 | 1.0378200 | 0.000 | 14.100 | -3.9201855 | 21.8688069 |
| ManufacturingProcess02 | 173 | 16.6826590 | 8.4715694 | 1.4826000 | 0.000 | 22.500 | -1.4307675 | 0.1062466 |
| ManufacturingProcess03 | 161 | 1.5395652 | 0.0223983 | 0.0148260 | 1.470 | 1.600 | -0.4799447 | 1.7280557 |
| ManufacturingProcess04 | 175 | 931.8514286 | 6.2744406 | 5.9304000 | 911.000 | 946.000 | -0.6979357 | 0.0631282 |
| ManufacturingProcess05 | 175 | 1001.6931429 | 30.5272134 | 17.3464200 | 923.000 | 1175.300 | 2.5872769 | 11.7446904 |
| ManufacturingProcess06 | 174 | 207.4017241 | 2.6993999 | 1.9273800 | 203.000 | 227.400 | 3.0419007 | 17.3764864 |
| ManufacturingProcess07 | 175 | 177.4800000 | 0.5010334 | 0.0000000 | 177.000 | 178.000 | 0.0793788 | -2.0050587 |
| ManufacturingProcess08 | 175 | 177.5542857 | 0.4984706 | 0.0000000 | 177.000 | 178.000 | -0.2165645 | -1.9642262 |
| ManufacturingProcess09 | 176 | 45.6601136 | 1.5464407 | 1.2157320 | 38.890 | 49.360 | -0.9406685 | 3.2701986 |
| ManufacturingProcess10 | 167 | 9.1790419 | 0.7666884 | 0.5930400 | 7.500 | 11.600 | 0.6492504 | 0.6317264 |
| ManufacturingProcess11 | 166 | 9.3855422 | 0.7157336 | 0.6671700 | 7.500 | 11.500 | -0.0193109 | 0.3227966 |
| ManufacturingProcess12 | 175 | 857.8114286 | 1784.5282624 | 0.0000000 | 0.000 | 4549.000 | 1.5786729 | 0.4951353 |
| ManufacturingProcess13 | 176 | 34.5079545 | 1.0152800 | 0.8895600 | 32.100 | 38.600 | 0.4802776 | 1.9593883 |
| ManufacturingProcess14 | 175 | 4853.8685714 | 54.5236412 | 40.0302000 | 4701.000 | 5055.000 | -0.0109687 | 1.0781378 |
| ManufacturingProcess15 | 176 | 6038.9204545 | 58.3125023 | 40.7715000 | 5904.000 | 6233.000 | 0.6743478 | 1.2162163 |
| ManufacturingProcess16 | 176 | 4565.8011364 | 351.6973215 | 42.9954000 | 0.000 | 4852.000 | -12.4202248 | 158.3981993 |
| ManufacturingProcess17 | 176 | 34.3437500 | 1.2482059 | 1.1860800 | 31.300 | 40.000 | 1.1629715 | 4.6626982 |
| ManufacturingProcess18 | 176 | 4809.6818182 | 367.4777364 | 34.8411000 | 0.000 | 4971.000 | -12.7361378 | 163.7375845 |
| ManufacturingProcess19 | 176 | 6028.1988636 | 45.5785689 | 36.3237000 | 5890.000 | 6146.000 | 0.2973414 | 0.2962151 |
| ManufacturingProcess20 | 176 | 4556.4602273 | 349.0089784 | 42.9954000 | 0.000 | 4759.000 | -12.6383268 | 162.0663905 |
| ManufacturingProcess21 | 176 | -0.1642045 | 0.7782930 | 0.4447800 | -1.800 | 3.600 | 1.7291140 | 5.0274763 |
| ManufacturingProcess22 | 175 | 5.4057143 | 3.3306262 | 4.4478000 | 0.000 | 12.000 | 0.3148909 | -1.0175458 |
| ManufacturingProcess23 | 175 | 3.0171429 | 1.6625499 | 1.4826000 | 0.000 | 6.000 | 0.1967985 | -0.9975572 |
| ManufacturingProcess24 | 175 | 8.8342857 | 5.7994224 | 7.4130000 | 0.000 | 23.000 | 0.3593200 | -1.0207362 |
| ManufacturingProcess25 | 171 | 4828.1754386 | 373.4810865 | 34.0998000 | 0.000 | 4990.000 | -12.6310220 | 160.3293620 |
| ManufacturingProcess26 | 171 | 6015.5964912 | 464.8674900 | 38.5476000 | 0.000 | 6161.000 | -12.6694398 | 160.9849144 |
| ManufacturingProcess27 | 171 | 4562.5087719 | 353.9848679 | 35.5824000 | 0.000 | 4710.000 | -12.5174778 | 158.3931091 |
| ManufacturingProcess28 | 171 | 6.5918129 | 5.2489823 | 1.0378200 | 0.000 | 11.500 | -0.4556335 | -1.7907822 |
| ManufacturingProcess29 | 171 | 20.0111111 | 1.6638879 | 0.4447800 | 0.000 | 22.000 | -10.0848133 | 119.4378857 |
| ManufacturingProcess30 | 171 | 9.1614035 | 0.9760824 | 0.7413000 | 0.000 | 11.200 | -4.7557268 | 43.0848842 |
| ManufacturingProcess31 | 171 | 70.1847953 | 5.5557816 | 0.8895600 | 0.000 | 72.500 | -11.8231008 | 146.0094297 |
| ManufacturingProcess32 | 176 | 158.4659091 | 5.3972456 | 4.4478000 | 143.000 | 173.000 | 0.2112252 | 0.0602714 |
| ManufacturingProcess33 | 171 | 63.5438596 | 2.4833813 | 1.4826000 | 56.000 | 70.000 | -0.1310030 | 0.2740324 |
| ManufacturingProcess34 | 171 | 2.4935673 | 0.0543910 | 0.0000000 | 2.300 | 2.600 | -0.2634497 | 1.0013075 |
| ManufacturingProcess35 | 171 | 495.5964912 | 10.8196874 | 8.8956000 | 463.000 | 522.000 | -0.1556154 | 0.4130958 |
| ManufacturingProcess36 | 171 | 0.0195731 | 0.0008739 | 0.0014826 | 0.017 | 0.022 | 0.1453141 | -0.0557822 |
| ManufacturingProcess37 | 176 | 1.0136364 | 0.4450828 | 0.4447800 | 0.000 | 2.300 | 0.3783578 | 0.0698597 |
| ManufacturingProcess38 | 176 | 2.5340909 | 0.6493753 | 0.0000000 | 0.000 | 3.000 | -1.6818052 | 3.9189211 |
| ManufacturingProcess39 | 176 | 6.8511364 | 1.5054943 | 0.1482600 | 0.000 | 7.500 | -4.2691214 | 16.4987895 |
| ManufacturingProcess40 | 175 | 0.0177143 | 0.0382885 | 0.0000000 | 0.000 | 0.100 | 1.6768073 | 0.8164458 |
| ManufacturingProcess41 | 175 | 0.0237143 | 0.0538242 | 0.0000000 | 0.000 | 0.200 | 2.1686898 | 3.6290714 |
| ManufacturingProcess42 | 176 | 11.2062500 | 1.9416092 | 0.2965200 | 0.000 | 12.100 | -5.4500082 | 28.5288867 |
| ManufacturingProcess43 | 176 | 0.9119318 | 0.8679860 | 0.2965200 | 0.000 | 11.000 | 9.0548747 | 101.0332345 |
| ManufacturingProcess44 | 176 | 1.8051136 | 0.3220062 | 0.1482600 | 0.000 | 2.100 | -4.9703552 | 25.0876065 |
| ManufacturingProcess45 | 176 | 2.1380682 | 0.4069043 | 0.1482600 | 0.000 | 2.600 | -4.0779411 | 18.7565001 |
The matrix ChemicalManufacturingProcess contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
The variable description highlights that some variables have less than n = 176, these are the variables with missing values that must be imputed. Moreover, they’re quite a few variables that are heavily skewed, this suggests further transformation for normality is needed.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Here is a plot of the portion of missing values per specific variable that needs imputation to make the set complete.
na.counts = as.data.frame(((sapply(ChemicalManufacturingProcess,
function(x) sum(is.na(x))))/ nrow(ChemicalManufacturingProcess))*100)
names(na.counts) = "counts"
na.counts = cbind(variables = rownames(na.counts), data.frame(na.counts, row.names = NULL))
na.counts[na.counts$counts > 0,] %>% arrange(counts) %>% mutate(name = factor(variables, levels = variables)) %>%
ggplot(aes(x = name, y = counts)) + geom_segment( aes(xend = name, yend = 0)) +
geom_point(size = 4, color = "steelblue2") + coord_flip() + theme_bw() +
labs(title = "Proportion of Missing Data", x = "Variables", y = "% of Missing data") +
scale_y_continuous(labels = scales::percent_format(scale = 1))Because these proportions are not too extreme for most of the variables, the imputation by k-Nearest Neighbor is conducted. The distance computation for defining the nearest neighbors is based on Gower distance (Gower, 1971), which can now handle distance variables of the type binary, categorical, ordered, continuous, and semi-continuous. As a result, the data set is now complete.
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Once again, to build a smaller model without predictors with extremely high correlations, it is best to reduce the number of predictors such that there are no absolute pairwise correlations above 0.90. The list below shows only significant correlations (at 5% level) for the top 10 highest correlations by the correlation coefficient. The results show that these ten have a perfect correlation of 1. Afterward, the data is pre-processed to fulfill the assumption of normality using the Yeo-Johnson transformation (Yeo & Johnson, 2000). This technique attempts to find the value of lambda that minimizes the Kullback-Leibler distance between the normal distribution and the transformed distribution. This method has the advantage of working without having to worry about the domain of x.
corr = cor(chemical)
corr[corr == 1] = NA
corr[abs(corr) < 0.85] = NA
corr = na.omit(reshape::melt(corr))
head(corr[order(-abs(corr$value)),], 10)## X1 X2 value
## 2090 ManufacturingProcess26 ManufacturingProcess25 0.9975339
## 2146 ManufacturingProcess25 ManufacturingProcess26 0.9975339
## 2148 ManufacturingProcess27 ManufacturingProcess26 0.9960721
## 2204 ManufacturingProcess26 ManufacturingProcess27 0.9960721
## 2091 ManufacturingProcess27 ManufacturingProcess25 0.9934932
## 2203 ManufacturingProcess25 ManufacturingProcess27 0.9934932
## 1685 ManufacturingProcess20 ManufacturingProcess18 0.9917474
## 1797 ManufacturingProcess18 ManufacturingProcess20 0.9917474
## 2095 ManufacturingProcess31 ManufacturingProcess25 0.9706780
## 2431 ManufacturingProcess25 ManufacturingProcess31 0.9706780
tooHigh = findCorrelation(cor(chemical), 0.90)
chemical = chemical[, -tooHigh]
(pre.process = preProcess(chemical, method = c("YeoJohnson", "center", "scale")))## Created from 176 samples and 47 variables
##
## Pre-processing:
## - centered (47)
## - ignored (0)
## - scaled (47)
## - Yeo-Johnson transformation (41)
##
## Lambda estimates for Yeo-Johnson transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.8404 0.7085 0.8795 0.9698 1.2484 2.7992
Next, the data is split into training and testing data in a ratio of 4:1, and an elastic net model is fitted.
set.seed(525)
intrain = createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)
train.p = chemical[intrain, ]
train.r = ChemicalManufacturingProcess$Yield[intrain]
test.p = chemical[-intrain, ]
test.r = ChemicalManufacturingProcess$Yield[-intrain]
# Elastic Net Regression
elastic.model = train(x = train.p, y = train.r, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10, metric = "Rsquared"
)
elastic.model$bestTune## alpha lambda
## 29 0.3 0.4052523
ggplot(elastic.model) + labs(title = "R-squared Error of Elastic Model") + theme(legend.position = "top")From the elastic net model above, the best tune is with an alpha = 0.3 and lambda = 0.40 since \(R^2\) was used to select the optimal model using the largest value. It is equal to 0.60, while the RMSE is 1.17. Moreover, from the plots with the predictions, the fit with the observed values is quite close to the 1:1 line, and the residuals are quite small.
data.frame(rsquared = elastic.model[["results"]][["Rsquared"]][as.numeric(rownames(elastic.model$bestTune))],
rmse = elastic.model[["results"]][["RMSE"]][as.numeric(rownames(elastic.model$bestTune))])## rsquared rmse
## 1 0.6041717 1.173975
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
This example was particularly interesting because the splitting affected how well the model fit the test data. Using limited training data (< 80%) resulted in a very poor fit. The prediction result below suggests that the elastic net model has an \(R^2\) = 0.71, this means that it accounts for nearly 71% of the variability in the testing data and explains almost 50% of the standard deviation, performing better than the training data. Additionally, the root mean squared error is also smaller, RMSE = 1.02. The basic statistical description suggests that the predictions are similar to that of the actual data. It can therefore be stated that the elastic net model fitted the training data well and produce reasonable predictions.
prediction = predict(elastic.model, test.p)
xyplot(test.r ~ prediction, type = c("p", "g"),
main = "Predicted vs Observed",
xlab = "Predicted",
ylab = "Observed",
panel = function(x, y) {
panel.xyplot(x, y)
panel.abline(lm(y ~ x))
})## RMSE Rsquared MAE
## 1.0264508 0.7105380 0.8501965
rbind(actual = describe(ChemicalManufacturingProcess$Yield)[,-c(1,6,7)],
prediction = describe(prediction)[,-c(1,6,7)])## n mean sd median min max range skew kurtosis se
## actual 176 40.18 1.85 39.97 35.25 46.34 11.09 0.31 -0.11 0.14
## prediction 32 40.27 1.43 40.09 37.09 43.36 6.27 0.00 -0.59 0.25
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
The top 20 most important variable out 47 is ranked below. The caret::varImp calculation of variable importance for regression is the relationship between each predictor and the outcome from a linear model fit and the \(R^2\) statistic is calculated for this model against the intercept only null model. This number is returned as a relative measure of variable importance. The list shows that there are a few variables that are shrunk to zero, and the most contribution variable is ManufacturingProcess32. As a result, it shows that Manufacturing Processes dominate the list.
## glmnet variable importance
##
## only 20 most important variables shown (out of 47)
##
## Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess09 48.10894
## ManufacturingProcess13 42.33765
## ManufacturingProcess17 41.18566
## ManufacturingProcess36 29.06552
## ManufacturingProcess06 28.39314
## BiologicalMaterial06 23.80905
## ManufacturingProcess34 19.33645
## BiologicalMaterial03 14.99996
## ManufacturingProcess30 13.69657
## ManufacturingProcess39 11.23695
## ManufacturingProcess37 9.87180
## ManufacturingProcess15 4.61453
## ManufacturingProcess43 2.59006
## BiologicalMaterial05 2.18062
## ManufacturingProcess07 1.82420
## ManufacturingProcess11 0.06249
## ManufacturingProcess19 0.00000
## ManufacturingProcess21 0.00000
## ManufacturingProcess28 0.00000
From the model, the coefficients of each variable can explain the effect on the target variable. Here, it is clear why ManufacturingProcess32 is the most important variable out of 47, because it has the largest, absolute coefficient.
coeffs = coef(elastic.model$finalModel, elastic.model$bestTune$lambda)
(var = data.frame(cbind(variables = coeffs@Dimnames[[1]][coeffs@i+1], coef = coeffs@x)))## variables coef
## 1 (Intercept) 40.2049689581959
## 2 BiologicalMaterial03 0.083252016351283
## 3 BiologicalMaterial05 0.0121027874973156
## 4 BiologicalMaterial06 0.132143737679476
## 5 ManufacturingProcess06 0.157586151710682
## 6 ManufacturingProcess07 -0.0101245893237502
## 7 ManufacturingProcess09 0.267011760065412
## 8 ManufacturingProcess11 0.00034682448703331
## 9 ManufacturingProcess13 -0.234980200822203
## 10 ManufacturingProcess15 0.0256113311735726
## 11 ManufacturingProcess17 -0.228586487389925
## 12 ManufacturingProcess30 0.0760179761264339
## 13 ManufacturingProcess32 0.555014807434588
## 14 ManufacturingProcess34 0.107320143860194
## 15 ManufacturingProcess36 -0.161317934159577
## 16 ManufacturingProcess37 -0.0547899777019612
## 17 ManufacturingProcess39 0.0623667148787627
## 18 ManufacturingProcess43 0.0143752239743199
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
A positive coefficient indicates that as the value of the predictor increases, the mean of the response variable also tends to increase. A negative coefficient suggests that as the predictor increases, the response variable tends to decrease. The coefficient value signifies how much the mean of the Yield changes given a one-unit shift in the predictor variable while all other variables in the model are held constant. This property of holding the other variables constant is important because it allows for the assessment of the effect of each variable in isolation from the others.
For the positive coefficients, ManufacturingProcess32 improved the yield tremendously, and it also has the highest, positive correlation than the other variables in the model. The ManufacturingProcess32 coefficient in the regression equation is 0.555. This coefficient represents the mean increase of the yield for every additional unit of ManufacturingProcess32.
## variables coef
## 1 (Intercept) 40.2049689581959
## 2 BiologicalMaterial03 0.083252016351283
## 3 BiologicalMaterial05 0.0121027874973156
## 4 BiologicalMaterial06 0.132143737679476
## 5 ManufacturingProcess06 0.157586151710682
## 7 ManufacturingProcess09 0.267011760065412
## 8 ManufacturingProcess11 0.00034682448703331
## 10 ManufacturingProcess15 0.0256113311735726
## 12 ManufacturingProcess30 0.0760179761264339
## 13 ManufacturingProcess32 0.555014807434588
## 14 ManufacturingProcess34 0.107320143860194
## 17 ManufacturingProcess39 0.0623667148787627
## 18 ManufacturingProcess43 0.0143752239743199
## [1] 0.6083321
For the negative coefficients, ManufacturingProcess13 improved the yield tremendously, and it also has the lowest correlation than the other variables in the model. The ManufacturingProcess13 coefficient in the regression equation is -0.235. This coefficient represents the mean decrease of the yield for every additional unit of ManufacturingProcess13.
## variables coef
## 6 ManufacturingProcess07 -0.0101245893237502
## 9 ManufacturingProcess13 -0.234980200822203
## 11 ManufacturingProcess17 -0.228586487389925
## 15 ManufacturingProcess36 -0.161317934159577
## 16 ManufacturingProcess37 -0.0547899777019612
## [1] -0.5036797