library(tidyr)
library(dplyr)
library(knitr)
library(utils)
library(ggplot2)
library(corrplot)
library(purrr)
library(mice)
library(psych)
library(gridExtra)
library(caret)
library(MASS)
library(pls)
library(lars)
library(elasticnet)
library(AppliedPredictiveModeling)
6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
data(permeability)
# Convert to data frames.
fp.full = data.frame(fingerprints)
pb.full = data.frame(permeability)
cat("Number of Samples = ", nrow(fp.full), "\nNumber of Predictors =", ncol(fp.full))
## Number of Samples = 165
## Number of Predictors = 1107
The \(fingerprints\) data frame has 165 rows and 1107 columns. There are far more predictors (P = 1107) than samples (n = 165) in this data set. This implies that multiple linear regression cannot directly be used and techniques such as dimensionality reduction or PCA (or KNN) must be used.
zv = nearZeroVar(fp.full)
# Filter out predictors that have near-zero variance.
fp = fp.full[, -zv]
pb = pb.full[,-zv]
cat(length(zv),
"variables have near-zero variance. ", "After filtering, no. of variables (including response) is ",
ncol(fp))
## 719 variables have near-zero variance. After filtering, no. of variables (including response) is 388
set.seed(100)
index = sample(1:floor(0.7 * nrow(fp)))
trainingData = fp[index,]
trainingY = pb[index]
testData = fp[-index,]
testY = pb[-index]
# To check if preprocessing is needed we summarize the skewness of each predictor.
sk = psych::describe(trainingData) %>% dplyr::select(vars, skew)
ggplot(sk, aes(vars)) +
geom_line(aes(y=skew)) +
ggtitle("Skewness of Permeability predictor variables") +
xlab("Variable Index") + ylab("Skew")
Since skew > about 1.0 is considered right-skewed, we find that a number of variables are highly skewed to the right (as well as a few on the left). To deal with this, we will center and scale the data when invoking the PLS algorithm.
ctrl = trainControl(method = "cv", number = 10)
plsTune = train(trainingData, trainingY, method = "pls",
tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
plot(plsTune)
plsTune
## Partial Least Squares
##
## 115 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 103, 104, 104, 103, 103, 104, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 11.52104 0.3799249 8.743790
## 2 10.78168 0.4502599 7.665984
## 3 11.06183 0.4127380 8.378497
## 4 11.55061 0.3969529 8.741814
## 5 11.45602 0.4370471 8.464968
## 6 11.63693 0.4382304 8.735028
## 7 11.91421 0.4431521 8.987942
## 8 12.00891 0.4305037 9.229677
## 9 12.24457 0.4224444 9.515594
## 10 12.52004 0.4258139 9.598684
## 11 13.06931 0.4091522 9.971984
## 12 13.32178 0.3927331 10.154233
## 13 13.54099 0.3888934 10.286393
## 14 13.85842 0.3639380 10.434880
## 15 13.96797 0.3495088 10.412449
## 16 13.95448 0.3641351 10.339549
## 17 14.10628 0.3616655 10.517164
## 18 14.11821 0.3528987 10.638628
## 19 14.25320 0.3489858 10.683669
## 20 14.50345 0.3378590 10.946146
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
We find that only two components were used and the resampled estimate for \(R^2\) is 0.4502599 and for \(RMSE\) 10.78168.
plsPred = predict(plsTune, testData)
plsResult = data.frame(obs = testY, pred = plsPred)
defaultSummary(plsResult)
## RMSE Rsquared MAE
## 13.2729127 0.5485723 10.2893589
The test set estimate of \(R^2\) is 0.5485723. This is higher than the corresponding values in the train data set obtained earlier, implying that the model performs somewhat better on the test set.
# Build a predictive model using Principal Component Analysis and the Robust Linear Model.
set.seed(100)
# Using the RLM algorithm, the output shows that the algorithm fails to converge even after
# increasing the number of iterations to 40 and reports that two variables -- X613 and X621 --
# have zero variance. Accordingly, these variables were removed from the training set.
#
trainingData2 = trainingData %>% dplyr::select(-X613, -X621)
rlmPCA <- train(trainingData2, trainingY,
method = "rlm", preProcess = "pca", trControl = ctrl, maxit = 40)
rlmPCA
## Robust Linear Model
##
## 115 samples
## 386 predictors
##
## Pre-processing: principal component signal extraction (386),
## centered (386), scaled (386)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 103, 103, 103, 105, 103, 106, ...
## Resampling results across tuning parameters:
##
## intercept psi RMSE Rsquared MAE
## FALSE psi.huber 16.19835 0.3058460 12.356960
## FALSE psi.hampel 16.20411 0.3317913 12.733633
## FALSE psi.bisquare 16.63757 0.2962591 12.451415
## TRUE psi.huber 12.45713 0.2939160 9.091191
## TRUE psi.hampel 12.19295 0.3091146 9.037580
## TRUE psi.bisquare 13.31820 0.2491322 9.216890
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi
## = psi.hampel.
Since the RLM with PCA failed to converge its summary results were ignored.
xyplot(testY ~ plsPred, type = c("p", "g"), xlab = "Predicted", ylab = "Observed")
On the basis of the predicted results and the \(R^2\) value I would not recommend replacing the permeability lab experiment.
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
data(ChemicalManufacturingProcess)
# The matrix processPredictors contains the 57 predictors (12 describing
# the input biological material and 45 describing the process predictors)
# for the 176 manufacturing runs. yield contains the percent yield for each
# run.
chem.full = ChemicalManufacturingProcess
names(chem.full)
## [1] "Yield" "BiologicalMaterial01"
## [3] "BiologicalMaterial02" "BiologicalMaterial03"
## [5] "BiologicalMaterial04" "BiologicalMaterial05"
## [7] "BiologicalMaterial06" "BiologicalMaterial07"
## [9] "BiologicalMaterial08" "BiologicalMaterial09"
## [11] "BiologicalMaterial10" "BiologicalMaterial11"
## [13] "BiologicalMaterial12" "ManufacturingProcess01"
## [15] "ManufacturingProcess02" "ManufacturingProcess03"
## [17] "ManufacturingProcess04" "ManufacturingProcess05"
## [19] "ManufacturingProcess06" "ManufacturingProcess07"
## [21] "ManufacturingProcess08" "ManufacturingProcess09"
## [23] "ManufacturingProcess10" "ManufacturingProcess11"
## [25] "ManufacturingProcess12" "ManufacturingProcess13"
## [27] "ManufacturingProcess14" "ManufacturingProcess15"
## [29] "ManufacturingProcess16" "ManufacturingProcess17"
## [31] "ManufacturingProcess18" "ManufacturingProcess19"
## [33] "ManufacturingProcess20" "ManufacturingProcess21"
## [35] "ManufacturingProcess22" "ManufacturingProcess23"
## [37] "ManufacturingProcess24" "ManufacturingProcess25"
## [39] "ManufacturingProcess26" "ManufacturingProcess27"
## [41] "ManufacturingProcess28" "ManufacturingProcess29"
## [43] "ManufacturingProcess30" "ManufacturingProcess31"
## [45] "ManufacturingProcess32" "ManufacturingProcess33"
## [47] "ManufacturingProcess34" "ManufacturingProcess35"
## [49] "ManufacturingProcess36" "ManufacturingProcess37"
## [51] "ManufacturingProcess38" "ManufacturingProcess39"
## [53] "ManufacturingProcess40" "ManufacturingProcess41"
## [55] "ManufacturingProcess42" "ManufacturingProcess43"
## [57] "ManufacturingProcess44" "ManufacturingProcess45"
cat("The data set has", nrow(chem.full), "rows and", sum(is.na(chem.full)), "missing values.")
## The data set has 176 rows and 106 missing values.
# The MICE library can be used for imputation of missing data.
imputed.1 <- mice(chem.full, m=5, maxit = 5, seed = 500, printFlag=F)
chem = complete(imputed.1, 2)
set.seed(100)
index = sample(1:floor(0.7 * nrow(chem)))
trainingData = chem[index,] %>% dplyr::select(-Yield)
trainingY = chem[index,]$Yield
testData = chem[-index,] %>% dplyr::select(-Yield)
testY = chem[-index,]$Yield
# For this problem we choose an OLS model with PCA processing.
ctrl <- trainControl(method = "cv", number = 10)
set.seed(100)
lmFit1 <- train(x = trainingData, y = trainingY,
method = "lm", preProcess = "pca", trControl = ctrl)
summary(lmFit1)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.41548 -0.50439 -0.08832 0.57300 2.43094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.52211 0.08584 472.082 < 2e-16 ***
## PC1 -0.30817 0.02801 -11.003 < 2e-16 ***
## PC2 0.27135 0.03623 7.490 3.52e-11 ***
## PC3 0.22704 0.03855 5.889 5.85e-08 ***
## PC4 -0.13608 0.04737 -2.873 0.005017 **
## PC5 0.28509 0.04912 5.804 8.50e-08 ***
## PC6 -0.08342 0.05143 -1.622 0.108126
## PC7 0.11547 0.05273 2.190 0.030991 *
## PC8 -0.01759 0.05635 -0.312 0.755584
## PC9 -0.31350 0.05856 -5.353 5.99e-07 ***
## PC10 -0.15654 0.06224 -2.515 0.013588 *
## PC11 0.13317 0.06459 2.062 0.041952 *
## PC12 -0.15437 0.06912 -2.233 0.027871 *
## PC13 -0.14907 0.07292 -2.044 0.043674 *
## PC14 -0.21261 0.07305 -2.911 0.004494 **
## PC15 -0.22180 0.07699 -2.881 0.004900 **
## PC16 -0.02903 0.08325 -0.349 0.728071
## PC17 0.14229 0.08937 1.592 0.114652
## PC18 -0.09485 0.08957 -1.059 0.292305
## PC19 -0.33653 0.09349 -3.600 0.000508 ***
## PC20 -0.19197 0.09762 -1.967 0.052148 .
## PC21 -0.21414 0.09998 -2.142 0.034772 *
## PC22 -0.26422 0.10736 -2.461 0.015663 *
## PC23 -0.09398 0.11137 -0.844 0.400868
## PC24 -0.01211 0.11540 -0.105 0.916672
## PC25 -0.13917 0.12193 -1.141 0.256560
## PC26 0.13988 0.12432 1.125 0.263356
## PC27 0.27330 0.13063 2.092 0.039091 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.952 on 95 degrees of freedom
## Multiple R-squared: 0.7937, Adjusted R-squared: 0.735
## F-statistic: 13.53 on 27 and 95 DF, p-value: < 2.2e-16
lmPred = predict(lmFit1, testData)
lmResult = data.frame(obs = testY, pred = lmPred)
defaultSummary(lmResult)
## RMSE Rsquared MAE
## 5.43779339 0.04074351 3.25249500
p1 = xyplot(testY ~ lmPred, type = c("p", "g"), xlab = "Predicted", ylab = "Observed")
p2 = xyplot(resid(lmFit1) ~ lmPred, type = c("p", "g"), xlab = "Predicted", ylab = "Residuals")
grid.arrange(p1, p2, ncol=2, nrow=1)
# To identify the top PCA predictors, we can examine the results of the PCA
# in the LM model
lmFit1$preProcess
## Created from 123 samples and 57 variables
##
## Pre-processing:
## - centered (57)
## - ignored (0)
## - principal component signal extraction (57)
## - scaled (57)
##
## PCA needed 27 components to capture 95 percent of the variance
rot = lmFit1$preProcess$rotation
# List the top 4 PCA eigenvectors:
rot[1:10,1:4]
## PC1 PC2 PC3 PC4
## BiologicalMaterial01 -0.24414112 0.002841420 -0.093145051 0.041660633
## BiologicalMaterial02 -0.28533000 0.080126938 -0.072748512 -0.036361864
## BiologicalMaterial03 -0.19063359 0.130255258 -0.007997775 -0.180652612
## BiologicalMaterial04 -0.24749288 -0.004208954 -0.097306538 -0.008616674
## BiologicalMaterial05 -0.12355626 -0.067379849 -0.107109954 0.047886870
## BiologicalMaterial06 -0.26592377 0.085802466 -0.061300955 -0.043836489
## BiologicalMaterial07 0.02821910 0.035888228 0.088372949 0.072540870
## BiologicalMaterial08 -0.25615824 0.076885976 -0.028259858 0.041870187
## BiologicalMaterial09 -0.04167002 0.127086920 0.035417195 -0.184801313
## BiologicalMaterial10 -0.19329028 -0.044790465 -0.107697018 0.017421899