Exercises from Kuhn and Johnson Applied Predictive Modeling, Chapter 6 Linear Regression and Its Cousins. All R code is displayed at the end for clarity.
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
R and use these commands to load the data:> library(AppliedPredictiveModeling)
> data(permeability)The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\) ?
Predict the response for the test set. What is the test set estimate of \(R^2\)?
Try building other models discussed in this chapter. Do any have better predictive performance?
Would you recommend any of your models to replace the permeability laboratory experiment?
We load the and permability dataset below and display the dimensions of its predictors and response.
library(AppliedPredictiveModeling)
data("permeability")## [1] "Dimensions of fingerprints: rows: 165 columns: 1107"
## [1] "Dimensions of permability: rows: 165 columns: 1"
We use nearZeroVar to identify predictors with very low variance or large imbalance between the highest frequency values. The default algorithm uses a 95/5 imbalance ratio between the most common and second most common value.
For a binary 0-1 variable \(X\), the 95/5 ratio is equivalent to
\[| E[X] - 0.5 | \geq 0.45 \] That is, near zero variance is equivalent to a mean close to 0 or close to 0.
For first 5 near zero variance predictors and their summary statistics are displayed.
| Name | fingerprints |
| Number of rows | 165 |
| Number of columns | 1107 |
| _______________________ | |
| Column type frequency: | |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| X7 | 0 | 1 | 1.00 | 0.00 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
| X8 | 0 | 1 | 1.00 | 0.00 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
| X9 | 0 | 1 | 0.02 | 0.13 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| X10 | 0 | 1 | 0.02 | 0.15 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| X13 | 0 | 1 | 0.96 | 0.20 | 0 | 1 | 1 | 1 | 1 | ▁▁▁▁▇ |
## [1] "Dimensions of fingerprint predictors retained: rows: 165 columns: 388"
We conclude that 719 predictors were removed and 388 predictors are left for modeling.
We look at the distribution of the response variable permeability below. Since it is skewed to the left, a Box-Cox transformation may help in linear type regression.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We investigate the guerrero algorithm from fabletools and compare the transformation to a log transform. As shown below, the guerrero transform with \(\lambda = -0.148\) produces a more symmetric density plot (in blue) than a log transform with \(\lambda = 0\) (in green). Therefore, we choose to use \(\lambda = -0.148\).
## [1] -0.1486984
To split the data into a test and training set, we use a 80%-20% split of training and test data. We use a stratified random sampling approach by forming quintiles on the response variable permeability. Within each quintile, we divide the observations into training and test partitions. Therefore, the summary statistics of the test and training data should look reasonably similar.
The test data will be used in 6.2d. The training data will be used in this section to choose a model and the number of PLS components.
# When using caret::createDataPartition with a continuous variable,
# the default behavior to take a random sample
# at a default of 5 groups defined by quintiles of the variable: permeability
#
set.seed(10023)
train_indices = createDataPartition(bcperm , times = 1, list = FALSE, p = 0.8 )The summary statistics of the training data containing 133 observations are below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.4933 0.4243 1.4170 1.2086 2.2498 3.0250
## [1] 133
Which are similar to those of the test data containing 32:
## [1] 32
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.6453 0.5139 1.4281 1.3740 2.2753 2.9718
We now tune the PLS model using 9 times repeated 10-fold cross validation.
# perm_df defines the permeability dataframe joined to the reduced set of "big-variance" predictors
perm_df = data.frame( cbind(bcperm, fingerprints_bv ) )
perm_train = perm_df[train_indices,]
perm_trainY = perm_train[,1]
perm_trainX = perm_train[,2:ncol(perm_train)]
perm_test = perm_df[ -train_indices,]
perm_testY = perm_test[,1]
perm_testX = perm_test[,2:ncol(perm_test)]The model specification and training parameters below are used to find the optimal PLS components.
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats, selectionFunction = "oneSE")
set.seed(102030)
plsTune <- train(x = perm_trainX, y = perm_trainY,
method = "pls",
tuneGrid = expand.grid(ncomp = 1:20),
trControl = ctrl ,
preProcess = c("center", "scale")
)
plsTune## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold, repeated 9 times)
## Summary of sample sizes: 120, 119, 121, 120, 120, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.1141205 0.3143435 0.8807533
## 2 1.0618425 0.3738377 0.8347404
## 3 1.0436316 0.4032896 0.7988877
## 4 1.0574966 0.4032326 0.8202256
## 5 1.0504322 0.4331503 0.8261032
## 6 1.0312813 0.4515938 0.8001931
## 7 1.0049849 0.4780387 0.7695146
## 8 0.9770128 0.5103656 0.7495774
## 9 0.9554692 0.5376859 0.7339096
## 10 0.9591361 0.5406371 0.7396557
## 11 0.9730300 0.5400550 0.7454174
## 12 0.9921338 0.5286890 0.7627678
## 13 1.0094432 0.5205338 0.7761608
## 14 1.0230539 0.5153133 0.7869745
## 15 1.0408789 0.5024099 0.8031697
## 16 1.0588834 0.4920740 0.8209528
## 17 1.0831165 0.4782526 0.8364812
## 18 1.1023218 0.4686852 0.8466800
## 19 1.1116160 0.4693260 0.8475908
## 20 1.1291208 0.4613194 0.8582763
##
## RMSE was used to select the optimal model using the one SE rule.
## The final value used for the model was ncomp = 8.
The model output above states that ncomp equal to 8 is the best answer according to the one standard error rule introduced by Leo Breiman. The rule is set with the selectionFunction parameter in the trainControl tuning setup prior to model tuning.
In the plots below, we show the RMSE and \(R^2\) from the above tuning exercise plotted against the different number of PLS components or latent variables.
The cross-validation \(R^2\) and RMSE agree that ncomp equal to 8 components is optimal. We agree this with finding.
The corresponding cross-validation \(R^2\) is 0.5103656
Next, we compute \(R^2\) and \(RMSE\) of the test data below.
pred_perm = predict(plsTune, newdata = perm_testX)
(testRMSE = caret::RMSE(pred_perm, perm_testY ))## [1] 0.9154151
(testR2 = caret::R2(pred_perm, perm_testY) )## [1] 0.5405468
The test set RMSE of the model is 0.9154151 and the test set \(R^2\) is 0.5405468. We plot the diagnostic plots below.
We will consider applying the same caret model fitting approach using PCR and Elastic Net models to evaluate performance. A table of the test data result at the end of this section shows that Elastic net is the top performing model.
Using the same training and test data split and cross validation counts, we observe that PCR requires more iterations. The PCA Regression model requires \(ncomp=11\) latent variables to achieve the desired level of accuracy where PLS regression requires only 8 components.
set.seed(102030)
pcrTune <- train(x = perm_trainX, y = perm_trainY,
method = "pcr",
tuneGrid = expand.grid(ncomp = 1:20),
trControl = ctrl ,
preProcess = c("center", "scale")
)
pcrTune## Principal Component Analysis
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold, repeated 9 times)
## Summary of sample sizes: 120, 119, 121, 120, 120, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.293126 0.07920901 1.0906939
## 2 1.296688 0.07994117 1.0925286
## 3 1.210043 0.20196183 1.0031270
## 4 1.145841 0.26716680 0.9388025
## 5 1.142884 0.27844073 0.9286222
## 6 1.150580 0.27240294 0.9330602
## 7 1.135464 0.28994583 0.9095592
## 8 1.090564 0.33981803 0.8674796
## 9 1.102164 0.32826835 0.8783589
## 10 1.078221 0.35024938 0.8574133
## 11 1.063008 0.37340710 0.8376561
## 12 1.044617 0.39567077 0.8173886
## 13 1.049565 0.38835493 0.8163060
## 14 1.060419 0.37781904 0.8176979
## 15 1.070093 0.36982634 0.8230622
## 16 1.078642 0.36160264 0.8282541
## 17 1.091929 0.34890402 0.8388649
## 18 1.093206 0.34732713 0.8443113
## 19 1.098423 0.34154747 0.8478375
## 20 1.108231 0.33413103 0.8520652
##
## RMSE was used to select the optimal model using the one SE rule.
## The final value used for the model was ncomp = 11.
We conclude that the PCR model prefers to use ncomp = 11 latent variables, two more than the PLS model. This means PLS is more parsimonious in addition to showing higher predictive accuracy in cross validation.
Looking at the plots above for both PLS and PCR models shows the PLS outperforms on cross validated RMSE accuracy in addition to forming a more parsimonious model.
pred_perm_pcr = predict(pcrTune, newdata = perm_testX)
(testRMSE_pcr = caret::RMSE(pred_perm_pcr, perm_testY ))## [1] 0.9518364
(testR2_pcr = caret::R2(pred_perm_pcr, perm_testY) )## [1] 0.2733075
The test set RMSE of the model is 0.9518364 and the test set \(R^2\) is 0.2733075. We plot the diagnostic plots below. Both look reasonable.
Lastly, we consider the elastic net model using the elasticnet package. We make some alterations in the training procedure:
set.seed(102030)
ctrl <- trainControl(method = "cv", selectionFunction = "best")
enetGrid <- expand.grid(lambda = c(0, 0.05, .1 , 0.5, 0.8 ),
fraction = seq(.01, 1, length = 10))
enetTune <- train(x = perm_trainX, y = perm_trainY ,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl ,
preProcess = c("center", "scale")
)We display the parameter tunings below which show that Elastic Net outperform PLS and PCR regressions. The final Elastic net model uses \(\lambda = 0.1\) and Lasso parameter \(\lambda_2 = 0.12\) in our specification.
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 121, 120, 120, 121, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.01 1.1704448 0.4450415 0.9524403
## 0.00 0.12 0.9469834 0.5643682 0.7574892
## 0.00 0.23 1.0022209 0.5514359 0.7872867
## 0.00 0.34 1.0918181 0.5301740 0.8540700
## 0.00 0.45 1.1860356 0.5027960 0.9304146
## 0.00 0.56 1.2873792 0.4838903 1.0034147
## 0.00 0.67 1.3983954 0.4586070 1.1211482
## 0.00 0.78 1.4963757 0.4368210 1.1994294
## 0.00 0.89 1.5939157 0.4202430 1.2808476
## 0.00 1.00 1.6725596 0.4105048 1.3382308
## 0.05 0.01 1.1983976 0.4256382 0.9710976
## 0.05 0.12 1.2870747 0.5416545 0.9654491
## 0.05 0.23 1.6818005 0.5410243 1.2627384
## 0.05 0.34 2.0753044 0.5408564 1.5484362
## 0.05 0.45 2.4854776 0.5305450 1.8570758
## 0.05 0.56 2.8553998 0.5187479 2.1129468
## 0.05 0.67 3.2196498 0.5094695 2.3669359
## 0.05 0.78 3.5790047 0.5074803 2.6374493
## 0.05 0.89 3.9390078 0.5061804 2.9070814
## 0.05 1.00 4.3001080 0.5051146 3.1775626
## 0.10 0.01 1.1988166 0.4369915 0.9993257
## 0.10 0.12 0.8498910 0.6087014 0.6557170
## 0.10 0.23 0.8744997 0.5841962 0.6739327
## 0.10 0.34 0.8810256 0.5803549 0.6742691
## 0.10 0.45 0.8976766 0.5780301 0.6905788
## 0.10 0.56 0.9120180 0.5697710 0.7075800
## 0.10 0.67 0.9225326 0.5617471 0.7173633
## 0.10 0.78 0.9278022 0.5587795 0.7223083
## 0.10 0.89 0.9295182 0.5582120 0.7229917
## 0.10 1.00 0.9319942 0.5570152 0.7244517
## 0.50 0.01 1.2205188 0.4369915 1.0213279
## 0.50 0.12 0.8832301 0.5795634 0.6860241
## 0.50 0.23 0.8930349 0.5853319 0.7053811
## 0.50 0.34 0.9475531 0.5684995 0.7483258
## 0.50 0.45 0.9738020 0.5646285 0.7660949
## 0.50 0.56 0.9976230 0.5618253 0.7815977
## 0.50 0.67 1.0158279 0.5571365 0.7935008
## 0.50 0.78 1.0249255 0.5567606 0.8037094
## 0.50 0.89 1.0290046 0.5597098 0.8102308
## 0.50 1.00 1.0323044 0.5617373 0.8156342
## 0.80 0.01 1.2202535 0.4369915 1.0208854
## 0.80 0.12 0.8894920 0.5684689 0.6897530
## 0.80 0.23 0.9177546 0.5833350 0.7330459
## 0.80 0.34 1.0240080 0.5577688 0.8173089
## 0.80 0.45 1.0754851 0.5486166 0.8551375
## 0.80 0.56 1.1138559 0.5445552 0.8876842
## 0.80 0.67 1.1408920 0.5407047 0.9070979
## 0.80 0.78 1.1577136 0.5396857 0.9231084
## 0.80 0.89 1.1677908 0.5415257 0.9340646
## 0.80 1.00 1.1738798 0.5446962 0.9421534
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.12 and lambda = 0.1.
Below, we compute the out of sample test results and then plot model diagnostics.
pred_enet = predict(enetTune, newdata = perm_testX)
(testRMSE_enet = caret::RMSE(pred_enet, perm_testY ))## [1] 0.7277125
(testR2_enet = caret::R2(pred_enet, perm_testY) )## [1] 0.568133
The above plots look reasonable.
Lastly, we observe that the Elastic Net model has reduced predictor dimension from 388 to only 45 predictors with non-zero coefficients. The predictors and their elastic net coefficients are listed below in descending absolute value order.
coefs = predict.enet(enetTune$finalModel, s=enetTune$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
nonzero_coefs = coefs[ abs(coefs) > 0.00001 ]
var_names = names(nonzero_coefs)
rownames(nonzero_coefs) <- NULL
enet_coefs = data.frame( names = var_names, coefs = nonzero_coefs )
dim(enet_coefs)## [1] 45 2
For completeness, we list the coefficients in tabular format too.
Elastic Net Non-zero Coefficients
| names | coefs | |
|---|---|---|
| X6 | X6 | 0.494 |
| X698 | X698 | 0.080 |
| X719 | X719 | 0.080 |
| X141 | X141 | 0.050 |
| X133 | X133 | 0.027 |
| X138 | X138 | 0.027 |
| X16 | X16 | 0.027 |
| X315 | X315 | 0.024 |
| X316 | X316 | 0.007 |
| X318 | X318 | 0.007 |
| X317 | X317 | 0.007 |
| X108 | X108 | 0.007 |
| X103 | X103 | 0.007 |
| X340 | X340 | 0.001 |
| X344 | X344 | 0.001 |
| X553 | X553 | 0.000 |
| X551 | X551 | 0.000 |
| X554 | X554 | 0.000 |
| X304 | X304 | -0.001 |
| X305 | X305 | -0.001 |
| X298 | X298 | -0.001 |
| X280 | X280 | -0.001 |
| X281 | X281 | -0.001 |
| X300 | X300 | -0.001 |
| X299 | X299 | -0.001 |
| X293 | X293 | -0.009 |
| X613 | X613 | -0.015 |
| X621 | X621 | -0.015 |
| X258 | X258 | -0.024 |
| X245 | X245 | -0.028 |
| X244 | X244 | -0.028 |
| X240 | X240 | -0.028 |
| X253 | X253 | -0.028 |
| X246 | X246 | -0.028 |
| X254 | X254 | -0.028 |
| X239 | X239 | -0.028 |
| X157 | X157 | -0.028 |
| X99 | X99 | -0.029 |
| X307 | X307 | -0.031 |
| X295 | X295 | -0.031 |
| X294 | X294 | -0.031 |
| X308 | X308 | -0.031 |
| X271 | X271 | -0.034 |
| X158 | X158 | -0.050 |
| X93 | X93 | -0.277 |
To summarize our findings in the table below, elastic net outperforms PLS and PCR by both RMSE and R-Squared on the test dataset.
All Models on Test DataSet
| model | RMSE | RSquared | NumVars |
|---|---|---|---|
| Partial Least Squares | 0.92 | 0.54 | 9 |
| Principal Components Regression | 0.95 | 0.27 | 11 |
| Elastic Net | 0.73 | 0.57 | 45 |
None of the three models produced here should be relied upon for production purposes for several reasons:
The small test set sample size relative to the number of predictors and model complexity.
The predictors are anonymized in our data set. There is no way to interpret the model for reasonableness.
from a business point of view, we don’t know the accuracy requirement or the costs of Type I vs. Type II errors.
The original paper by Kansy, Senner and Gubernator 1997 cited by the dataset does not share useful statistical analyses or sufficient sample size to convince this reader of its model accuracy. For example it states: When the drugs are binned into 3 groups by human absorption (a proxy for permeability ), “nearly 80% of the compounds would be correctly predicted in their in vivo absorption ability (Table 1).” While the authors conclude the PAMPA method used together with CACO-2 cell measurements could be a promising high throughput method to predict absorption, the Table 1 data only contains 25 compounds as shown below:
PAMPA Table 1
One possible way to judge the utility of the model is to bin the permeability into 3 classes of permeability (low, medium and high). By transforming a regression problem into a classification problem, we can evaluate whether the new model achieves the 80% accuracy level that appears acceptable to the authors of the original paper.
The experimental equipment should continued to be used in tandem with continued assessment of model predictive accuracy based on the provided information.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),
measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(chemicalManufacturing)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
There is a minor typo in the 6.3 assignment.
– The dataset is named ChemicalManufacturingProcess rather than the name given above. The author’s book errate webpage acknowledges this error.
We save the predictors which are columns 2 to 58 of ChemicalManufacturingProcess into a dataframe chemX of \(N=176\) rows and \(M=57\) columns and the response in chemY in a dataframe of \(N=176\) rows and \(M=1\) columns.
data("ChemicalManufacturingProcess")
chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]
dim(chemX)## [1] 176 57
dim(chemY)## [1] 176 1
Due to the long variable names, we will abbreviate all predictors as follows:
BiologicalMaterial will be abbreviated to BManufacturingProcess will be abbreviated to MchemX %>% rename_at(vars(starts_with("ManufacturingProcess") ),
funs(str_replace(., "ManufacturingProcess", "M"))) %>%
rename_at(vars(starts_with("Biological")),
funs( str_replace( ., "BiologicalMaterial", "B"))) -> chemX2Although exercise 6.3b asks us to do imputation before splitting the data, that is a bad practice because the imputation process can alter training data based on the properties of the test data. So the predictions will have look-ahead bias, and the model accuracy will be artificially inflated.
We perform data splitting in 6.3c before imputation and preprocessing. Then we will tune a PLS model.
We split the \(N\) observations into a 75-25 train/test split based on stratified random sampling of observations based on their response variable value Yield. Yield has no missing data which is helpful for stratified random sampling. Also, no transformation of the response Yield is warranted as the histogram looks symmetric. This is shown in the skim output of Yield below and its histogram.
| Name | chemY |
| Number of rows | 176 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Yield | 0 | 1 | 40.18 | 1.85 | 35.25 | 38.75 | 39.97 | 41.48 | 46.34 | ▁▇▇▃▁ |
We split the data into 75/25 train/test sets below and put into the necessary dataframes.
set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )We drop out a near zero variance predictor B07 from both train and test data sets. The number of predictors declines from \(M=57\) to \(M=56\) as a result. The results below
## [1] "Training data removed variables:B07"
## [1] "Dimension of train data before and after near zero variance check: "
## [1] 132 57
## [1] 132 56
## [1] "Test data removed variables:B07"
## [1] "Dimension of test data before and after near zero variance check: "
## [1] 44 57
## [1] 44 56
Next we identify and fix zeros and outliers in the predictors before imputing missing values.
The reason is that KNN algorithm may be affected by outliers. The table below uses skim to report the minimum, maximum, median, mean and standard deviation for each predictor. In addition, we construct Z-score metrics that tells us how many standard deviations is the minimum and maximum. For example, for the predictor \(M18\), its values are clustered around 4800 except for observation 13 which has a value of 0.
\[ \text{zscore_min}(M18) = \frac{ min(\text{M18}) - \mu(\text{M18}) }{\sigma(\text{M18} )}\] \[\text{zscore_max}(M18) = \frac{ max(\text{M18}) - \mu(\text{M18}) }{\sigma(\text{M18} )}\] We identify 16 predictors with this issue in the table below and replace the zero outlier with the median.
sk_train_bv = skim(chemX_train_bv)
sk_train_bv %>%
mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
) %>%
filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
arrange( desc(zscore_extreme)) %>% yank("numeric")Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist | zscore_min | zscore_max | zscore_extreme |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| M18 | 0 | 1.00 | 4800.09 | 423.29 | 0 | 4813.00 | 4834.00 | 4862.0 | 4971.0 | ▁▁▁▁▇ | -11.34 | 0.40 | 11.34 |
| M20 | 0 | 1.00 | 4545.26 | 402.00 | 0 | 4550.00 | 4579.00 | 4605.5 | 4759.0 | ▁▁▁▁▇ | -11.31 | 0.53 | 11.31 |
| M16 | 0 | 1.00 | 4554.23 | 404.48 | 0 | 4554.75 | 4585.50 | 4619.0 | 4852.0 | ▁▁▁▁▇ | -11.26 | 0.74 | 11.26 |
| M26 | 3 | 0.98 | 6001.38 | 534.47 | 0 | 6018.00 | 6041.00 | 6069.0 | 6161.0 | ▁▁▁▁▇ | -11.23 | 0.30 | 11.23 |
| M25 | 3 | 0.98 | 4816.81 | 429.30 | 0 | 4829.00 | 4852.00 | 4876.0 | 4990.0 | ▁▁▁▁▇ | -11.22 | 0.40 | 11.22 |
| M27 | 3 | 0.98 | 4550.56 | 406.49 | 0 | 4556.00 | 4584.00 | 4607.0 | 4696.0 | ▁▁▁▁▇ | -11.19 | 0.36 | 11.19 |
| M31 | 3 | 0.98 | 70.07 | 6.36 | 0 | 70.10 | 70.80 | 71.4 | 72.5 | ▁▁▁▁▇ | -11.01 | 0.38 | 11.01 |
| M29 | 3 | 0.98 | 19.93 | 1.87 | 0 | 19.70 | 19.90 | 20.4 | 22.0 | ▁▁▁▁▇ | -10.65 | 1.11 | 10.65 |
| M43 | 0 | 1.00 | 0.91 | 0.98 | 0 | 0.50 | 0.75 | 1.0 | 11.0 | ▇▁▁▁▁ | -0.92 | 10.27 | 10.27 |
| M30 | 3 | 0.98 | 9.15 | 1.05 | 0 | 8.80 | 9.10 | 9.7 | 11.2 | ▁▁▁▃▇ | -8.74 | 1.96 | 8.74 |
| M01 | 1 | 0.99 | 11.32 | 1.45 | 0 | 10.80 | 11.40 | 12.2 | 14.1 | ▁▁▁▅▇ | -7.80 | 1.91 | 7.80 |
| M42 | 0 | 1.00 | 11.28 | 1.75 | 0 | 11.40 | 11.60 | 11.7 | 12.1 | ▁▁▁▁▇ | -6.45 | 0.47 | 6.45 |
| M44 | 0 | 1.00 | 1.82 | 0.29 | 0 | 1.80 | 1.90 | 1.9 | 2.1 | ▁▁▁▁▇ | -6.22 | 0.96 | 6.22 |
| M45 | 0 | 1.00 | 2.14 | 0.37 | 0 | 2.10 | 2.20 | 2.3 | 2.6 | ▁▁▁▂▇ | -5.72 | 1.21 | 5.72 |
| M39 | 0 | 1.00 | 6.85 | 1.51 | 0 | 7.10 | 7.20 | 7.3 | 7.5 | ▁▁▁▁▇ | -4.54 | 0.43 | 4.54 |
| M38 | 0 | 1.00 | 2.55 | 0.62 | 0 | 2.00 | 3.00 | 3.0 | 3.0 | ▁▁▁▅▇ | -4.09 | 0.73 | 4.09 |
## # A tibble: 0 x 15
## # … with 15 variables: skim_type <chr>, skim_variable <chr>, n_missing <int>,
## # complete_rate <dbl>, numeric.mean <dbl>, numeric.sd <dbl>,
## # numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## # numeric.p100 <dbl>, numeric.hist <chr>, zscore_min <dbl>, zscore_max <dbl>,
## # zscore_extreme <dbl>
We also observe that the variance of the affected predictors has gone done significantly. For example, the predictor M18 reduced its standard deviation from 423 to 44 (nearly 10 times smaller).
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| M18 | 0 | 1 | 4836.71 | 44.14 | 4701 | 4813.75 | 4834 | 4862 | 4971 | ▁▃▇▃▁ |
For the test dataset, 5 predictors have the same issue. So we apply the same changes.
sk_test_bv = skim(chemX_test_bv)
sk_test_bv %>%
mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
) %>%
filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
arrange( desc(zscore_extreme)) %>% yank("numeric")Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist | zscore_min | zscore_max | zscore_extreme |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| M39 | 0 | 1 | 6.86 | 1.52 | 0 | 7.10 | 7.20 | 7.30 | 7.4 | ▁▁▁▁▇ | -4.52 | 0.35 | 4.52 |
| M42 | 0 | 1 | 10.98 | 2.44 | 0 | 11.38 | 11.50 | 11.62 | 12.0 | ▁▁▁▁▇ | -4.51 | 0.42 | 4.51 |
| M44 | 0 | 1 | 1.76 | 0.40 | 0 | 1.80 | 1.80 | 1.90 | 2.1 | ▁▁▁▁▇ | -4.42 | 0.85 | 4.42 |
| M45 | 0 | 1 | 2.12 | 0.50 | 0 | 2.10 | 2.20 | 2.30 | 2.5 | ▁▁▁▁▇ | -4.28 | 0.77 | 4.28 |
| M01 | 0 | 1 | 10.86 | 2.63 | 0 | 10.67 | 11.35 | 12.00 | 14.1 | ▁▁▁▆▇ | -4.13 | 1.23 | 4.13 |
We apply the same treatment of zero values to the test dataset.
## NULL
To summarize, we identified 16 predictors in the training set and 5 predictors in the test set in outlier zeros. By replacing these values with the median of their respective datasets, we reduced the variance of these predictors significantly. We chose not to apply the same outlier treatment to maximum values because we cannot be certain they are data errors.
As we will show, the amount of missing data is small and its occurrence follows the same pattern in the test and training sets. We will do exploratory data analysis and data imputation on the training and test partitions separately.
Looking column-wise in the test data, 14 predictors have missing data. The 3 most impacted predictors are M03 which is 86.4% complete, M11 with 93.2% complete and M10 with 95.5% complete. Only manufacturing process predictors have missing data so this suggest in future industrial applications to fix this issues at the factory. The skim output below displays the variables
skim(chemX_test_nz) %>% filter( n_missing > 0 ) %>% arrange( desc( n_missing) ) %>% yank("numeric")Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| M03 | 6 | 0.86 | 1.54 | 0.03 | 1.48 | 1.52 | 1.54 | 1.55 | 1.60 | ▁▂▇▁▁ |
| M11 | 3 | 0.93 | 9.41 | 0.73 | 7.70 | 9.00 | 9.40 | 9.90 | 11.30 | ▂▇▇▅▁ |
| M10 | 2 | 0.95 | 9.21 | 0.81 | 7.50 | 8.83 | 9.20 | 9.60 | 11.40 | ▃▇▇▅▁ |
| M25 | 2 | 0.95 | 4863.07 | 36.61 | 4785.00 | 4844.75 | 4866.50 | 4883.75 | 4966.00 | ▂▅▇▂▁ |
| M26 | 2 | 0.95 | 6059.26 | 40.39 | 5973.00 | 6038.00 | 6052.50 | 6073.75 | 6137.00 | ▁▃▇▁▃ |
| M27 | 2 | 0.95 | 4599.21 | 43.36 | 4518.00 | 4575.00 | 4594.00 | 4621.50 | 4710.00 | ▃▇▆▂▂ |
| M28 | 2 | 0.95 | 7.68 | 4.93 | 0.00 | 0.00 | 10.50 | 10.95 | 11.40 | ▃▁▁▁▇ |
| M29 | 2 | 0.95 | 20.26 | 0.68 | 19.30 | 19.70 | 20.05 | 20.70 | 21.70 | ▇▇▅▁▅ |
| M30 | 2 | 0.95 | 9.19 | 0.73 | 7.60 | 8.72 | 9.15 | 9.80 | 10.90 | ▂▇▆▇▁ |
| M31 | 2 | 0.95 | 70.53 | 1.17 | 67.90 | 69.70 | 70.70 | 71.40 | 72.30 | ▂▂▃▇▅ |
| M33 | 2 | 0.95 | 63.40 | 2.24 | 58.00 | 62.25 | 63.50 | 64.75 | 68.00 | ▂▂▇▃▁ |
| M34 | 2 | 0.95 | 2.51 | 0.06 | 2.40 | 2.50 | 2.50 | 2.50 | 2.60 | ▂▁▇▁▂ |
| M35 | 2 | 0.95 | 496.12 | 11.05 | 463.00 | 490.00 | 497.00 | 504.75 | 517.00 | ▁▃▇▇▅ |
| M36 | 2 | 0.95 | 0.02 | 0.00 | 0.02 | 0.02 | 0.02 | 0.02 | 0.02 | ▁▇▁▇▃ |
Next, we look row-wise at which observations have the most missing values in its predictors columns and display the results in order. A total of 9 observations have missing data points across the 14 predictors.
chemX_test_nz %>% # The dataframe
filter_all( any_vars(is.na(.))) %>% # include any rows where any column is NA
select_if( ~ any(is.na(.))) -> missing_test_nz
# only include those columns where some row has NA in that column
missing_test_nz %>% kable(caption = "missing test data - rows") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left") missing test data - rows
| M03 | M10 | M11 | M25 | M26 | M27 | M28 | M29 | M30 | M31 | M33 | M34 | M35 | M36 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | NA | NA | NA | 4869 | 6107 | 4630 | 11.2 | 21.4 | 9.9 | 68.7 | 66 | 2.6 | 508 | 0.019 |
| 3 | NA | NA | NA | 4897 | 6116 | 4637 | 11.1 | 21.3 | 9.4 | 69.3 | 66 | 2.6 | 509 | 0.018 |
| 15 | NA | 9.4 | 10.2 | 4874 | 6125 | 4659 | 11.3 | 21.6 | 9.9 | 68.5 | 64 | 2.5 | 490 | 0.019 |
| 17 | NA | 9.0 | 10.0 | 4860 | 6100 | 4659 | 11.1 | 21.3 | 10.0 | 68.7 | 68 | 2.5 | 495 | 0.019 |
| 18 | NA | 9.5 | 10.6 | 4815 | 6055 | 4631 | 10.8 | 20.9 | 10.6 | 69.0 | 64 | 2.5 | 497 | 0.020 |
| 19 | NA | 8.9 | 9.8 | 4876 | 6109 | 4696 | 11.1 | 21.4 | 9.8 | 68.8 | 64 | 2.4 | 514 | 0.021 |
| 98 | 1.55 | 9.1 | NA | 4869 | 6050 | 4574 | 10.5 | 19.9 | 8.8 | 71.3 | 64 | 2.5 | 493 | 0.019 |
| 173 | 1.56 | 9.5 | 9.9 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 176 | 1.55 | 9.9 | 10.3 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
missing_test_nz %>%
mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_test_nz
num_na_test_cells = sum(missing_summary_test_nz$num_na)
missing_summary_test_nz %>% # add a count of columns with NA values
select(num_na ) %>% # pull out the count
arrange( desc(num_na)) %>% # sort by the count descending
bind_rows(., tibble( num_na = num_na_test_cells ) ) %>%
kable(caption = "Test Data with missing predictor values") %>%
kable_styling(bootstrap_options = c("hover", "striped"),
position = "left"
)Test Data with missing predictor values
| num_na | |
|---|---|
| 173 | 11 |
| 176 | 11 |
| 2 | 3 |
| 3 | 3 |
| 15 | 1 |
| 17 | 1 |
| 18 | 1 |
| 19 | 1 |
| 98 | 1 |
| …10 | 33 |
pct_missing_test = num_na_test_cells/ ( nrow(chemX_test_nz) * ncol(chemX_test_nz)) * 100To tally up the NA values across all cells in the test data, we found 33 cells of missing data. This equals 1.3392857 percent of the test data. Looking at the table above, of the 9 observations with missing values, some observations (173, 176) have 11 predictors with missing values. The rest of the observations are much less affected.
In the training data, 28 manufacturing predictors have missing data. All biological predictors have complete data. The 3 most impacted predictors are M03, M10, M11 with 93.2%, 94.7%, 94.7% completeness rates. This is consistent with the pattern in the test data. The most common completeness rate is 99.2% (1 missing value).
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| M03 | 9 | 0.93 | 1.54 | 0.02 | 1.47 | 1.54 | 1.54 | 1.55 | 1.60 | ▁▂▆▇▁ |
| M10 | 7 | 0.95 | 9.17 | 0.76 | 7.70 | 8.70 | 9.00 | 9.50 | 11.60 | ▃▇▆▁▁ |
| M11 | 7 | 0.95 | 9.38 | 0.72 | 7.50 | 9.00 | 9.40 | 9.90 | 11.50 | ▂▆▇▅▁ |
| M25 | 4 | 0.97 | 4854.45 | 40.40 | 4721.00 | 4829.00 | 4852.50 | 4876.00 | 4990.00 | ▁▃▇▃▁ |
| M26 | 4 | 0.97 | 6048.27 | 45.79 | 5901.00 | 6018.75 | 6041.50 | 6069.50 | 6161.00 | ▁▂▇▃▂ |
| M27 | 4 | 0.97 | 4586.11 | 46.99 | 4416.00 | 4556.75 | 4584.50 | 4607.25 | 4696.00 | ▁▁▇▇▂ |
| M29 | 4 | 0.97 | 20.09 | 0.61 | 18.90 | 19.70 | 19.90 | 20.40 | 22.00 | ▂▇▂▁▁ |
| M30 | 4 | 0.97 | 9.22 | 0.66 | 7.50 | 8.80 | 9.15 | 9.70 | 11.20 | ▁▅▇▃▁ |
| M31 | 4 | 0.97 | 70.62 | 1.36 | 59.80 | 70.18 | 70.80 | 71.40 | 72.50 | ▁▁▁▂▇ |
| M02 | 3 | 0.98 | 16.89 | 8.35 | 0.00 | 19.50 | 21.10 | 21.50 | 22.50 | ▂▁▁▁▇ |
| M28 | 3 | 0.98 | 6.24 | 5.32 | 0.00 | 0.00 | 10.30 | 10.70 | 11.50 | ▆▁▁▁▇ |
| M33 | 3 | 0.98 | 63.59 | 2.56 | 56.00 | 62.00 | 64.00 | 65.00 | 70.00 | ▁▅▇▆▂ |
| M34 | 3 | 0.98 | 2.49 | 0.05 | 2.30 | 2.50 | 2.50 | 2.50 | 2.60 | ▁▂▁▇▁ |
| M35 | 3 | 0.98 | 495.43 | 10.78 | 468.00 | 490.00 | 495.00 | 501.00 | 522.00 | ▁▂▇▃▁ |
| M36 | 3 | 0.98 | 0.02 | 0.00 | 0.02 | 0.02 | 0.02 | 0.02 | 0.02 | ▂▇▇▁▂ |
| M01 | 2 | 0.98 | 11.41 | 1.06 | 9.00 | 10.80 | 11.40 | 12.20 | 14.10 | ▃▅▇▆▁ |
| M06 | 2 | 0.98 | 207.52 | 2.91 | 203.00 | 205.70 | 206.80 | 208.70 | 227.40 | ▇▅▁▁▁ |
| M04 | 1 | 0.99 | 932.34 | 6.14 | 911.00 | 928.00 | 934.00 | 937.00 | 946.00 | ▁▂▃▇▁ |
| M05 | 1 | 0.99 | 999.84 | 26.93 | 923.00 | 987.50 | 999.20 | 1008.10 | 1175.30 | ▁▇▁▁▁ |
| M07 | 1 | 0.99 | 177.50 | 0.50 | 177.00 | 177.00 | 177.00 | 178.00 | 178.00 | ▇▁▁▁▇ |
| M08 | 1 | 0.99 | 177.53 | 0.50 | 177.00 | 177.00 | 178.00 | 178.00 | 178.00 | ▇▁▁▁▇ |
| M12 | 1 | 0.99 | 868.13 | 1794.45 | 0.00 | 0.00 | 0.00 | 0.00 | 4549.00 | ▇▁▁▁▂ |
| M14 | 1 | 0.99 | 4852.15 | 52.70 | 4713.00 | 4826.50 | 4858.00 | 4881.00 | 4992.00 | ▁▂▇▃▁ |
| M22 | 1 | 0.99 | 5.53 | 3.34 | 0.00 | 3.00 | 5.00 | 8.00 | 12.00 | ▇▆▇▅▅ |
| M23 | 1 | 0.99 | 3.01 | 1.61 | 0.00 | 2.00 | 3.00 | 4.00 | 6.00 | ▇▆▇▆▇ |
| M24 | 1 | 0.99 | 8.99 | 5.76 | 0.00 | 4.00 | 8.00 | 14.00 | 23.00 | ▆▇▃▅▁ |
| M40 | 1 | 0.99 | 0.02 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.10 | ▇▁▁▁▂ |
| M41 | 1 | 0.99 | 0.02 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.20 | ▇▁▁▁▁ |
chemX_train_nz %>% # The dataframe
filter_all( any_vars(is.na(.))) %>% # include any rows where any column is NA
select_if( ~ any(is.na(.))) -> missing_train_nz # only include those columns where some row has NA in that column
missing_train_nz %>% kable(caption = "missing training data - rows") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")missing training data - rows
| M01 | M02 | M03 | M04 | M05 | M06 | M07 | M08 | M10 | M11 | M12 | M14 | M22 | M23 | M24 | M25 | M26 | M27 | M28 | M29 | M30 | M31 | M33 | M34 | M35 | M36 | M40 | M41 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4898 | NA | NA | NA | 4873 | 6074 | 4685 | 10.7 | 21.0 | 9.9 | 69.1 | 66 | 2.4 | 486 | 0.019 | NA | NA |
| 4 | NA | 0.0 | NA | 911 | 1014.6 | 213.3 | 177 | 177 | NA | NA | 0 | 4897 | 5 | 2 | 5 | 4892 | 6111 | 4630 | 11.1 | 21.3 | 9.4 | 69.3 | 68 | 2.5 | 496 | 0.018 | 0.0 | 0.0 |
| 5 | 10.7 | 0.0 | NA | 918 | 1027.5 | 205.7 | 178 | 178 | NA | NA | 0 | 4992 | 8 | 4 | 18 | 4930 | 6151 | 4684 | 11.3 | 21.6 | 9.0 | 69.4 | 70 | 2.5 | 468 | 0.017 | 0.0 | 0.0 |
| 6 | 12.0 | 0.0 | NA | 924 | 1016.8 | 208.9 | 178 | 178 | NA | NA | 0 | 4985 | 9 | 1 | 1 | 4871 | 6128 | 4687 | 11.4 | 21.7 | 10.1 | 68.2 | 70 | 2.5 | 490 | 0.018 | 0.0 | 0.0 |
| 16 | 11.1 | 0.0 | NA | 928 | 1073.6 | 207.1 | 177 | 177 | 9.6 | 10.2 | 0 | 4805 | 12 | 4 | 16 | 4848 | 6095 | 4630 | 11.1 | 21.2 | 10.3 | 68.5 | 66 | 2.5 | 493 | 0.019 | 0.1 | 0.2 |
| 20 | 12.7 | 0.0 | NA | 929 | 1175.3 | 204.1 | 177 | 177 | 8.9 | 9.9 | 0 | 4961 | 4 | 3 | 4 | 4850 | 6094 | 4665 | 11.0 | 21.2 | 10.1 | 68.7 | 62 | 2.5 | 490 | 0.020 | 0.0 | 0.0 |
| 22 | 10.9 | 0.0 | NA | 936 | 1024.4 | 218.7 | 178 | 178 | NA | NA | 0 | 4791 | 6 | 2 | 16 | 4918 | 6152 | 4618 | 11.4 | 21.1 | 8.8 | 70.1 | 68 | 2.4 | 479 | 0.018 | 0.0 | 0.0 |
| 23 | 11.1 | 0.0 | NA | 937 | 997.7 | 209.6 | 177 | 177 | NA | NA | 0 | NA | 7 | 3 | 17 | 4906 | 6134 | 4626 | 11.2 | 21.0 | 8.9 | 70.1 | 66 | 2.4 | 492 | 0.019 | 0.0 | 0.0 |
| 24 | 11.3 | 0.0 | NA | 940 | 1031.3 | 215.1 | 177 | 177 | NA | NA | 0 | 4864 | 1 | 1 | 1 | 4875 | 6095 | 4606 | 11.2 | 20.9 | 9.5 | 69.7 | 65 | 2.5 | 468 | 0.018 | 0.0 | 0.0 |
| 90 | 10.8 | 21.5 | 1.55 | 935 | 954.8 | NA | 178 | 178 | 8.4 | 8.3 | 0 | 4866 | 8 | 3 | 3 | 4897 | 6047 | 4572 | 10.3 | 19.7 | 8.1 | 72.2 | 64 | 2.5 | 493 | 0.020 | 0.0 | 0.0 |
| 108 | 9.8 | 21.9 | 1.50 | 934 | 987.9 | 208.9 | 178 | 178 | 9.4 | 9.3 | 0 | 4807 | 11 | 5 | 5 | NA | NA | NA | 0.0 | NA | NA | NA | 62 | 2.5 | 485 | 0.019 | 0.0 | 0.0 |
| 134 | 10.9 | NA | 1.55 | 934 | 1000.1 | 208.7 | 177 | 178 | 8.7 | 9.4 | 0 | 4875 | 2 | 2 | 13 | 4816 | 6037 | 4568 | 0.0 | 19.8 | 9.6 | 70.6 | 62 | 2.5 | 495 | 0.020 | 0.0 | 0.0 |
| 139 | 10.0 | NA | 1.55 | 935 | 986.1 | 205.2 | 177 | 178 | 9.5 | 9.5 | 0 | 4823 | 8 | 2 | 2 | 4822 | 6018 | 4606 | 0.0 | 19.8 | 9.6 | 70.6 | 60 | 2.6 | 497 | 0.020 | 0.0 | 0.0 |
| 172 | 12.8 | 21.5 | 1.54 | 935 | 1027.0 | 206.2 | 178 | 177 | 9.6 | 9.6 | 0 | 4829 | 2 | 2 | 8 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.0 | 0.0 |
| 174 | 13.0 | 20.4 | 1.55 | 930 | 1040.0 | 208.7 | 178 | 177 | 10.1 | 10.4 | 0 | 4795 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.0 | 0.0 |
| 175 | 14.1 | 21.6 | 1.55 | 935 | 1044.8 | 208.0 | 177 | 177 | 10.2 | 10.3 | 0 | 4793 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.0 | 0.0 |
missing_train_nz %>%
mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_train_nz
num_na_cells_train = sum(missing_summary_train_nz$num_na)
missing_summary_train_nz %>% # add a count of columns with NA values
select(num_na ) %>% # pull out the count
arrange( desc(num_na)) %>% # sort by the count descending
bind_rows(., tibble( num_na = num_na_cells_train ) ) %>%
kable(caption = "Training Data with missing predictor values") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")Training Data with missing predictor values
| num_na | |
|---|---|
| 1 | 16 |
| 172 | 11 |
| 174 | 11 |
| 175 | 11 |
| 108 | 6 |
| 4 | 4 |
| 23 | 4 |
| 5 | 3 |
| 6 | 3 |
| 22 | 3 |
| 24 | 3 |
| 16 | 1 |
| 20 | 1 |
| 90 | 1 |
| 134 | 1 |
| 139 | 1 |
| …17 | 80 |
pct_missing_train = num_na_cells_train/ ( nrow(chemX_train_nz) * ncol(chemX_train_nz)) * 100We conclude there are a small and similar percentage of missing values in the train and test data sets.
In the training set, there are 1.0822511% of missing cells.
In the test data, there are 1.3392857 % of missing cells.
Due to the small percentage of missing data and after treatment of zero values, we expect kNN imputation to work well.
We will use knn imputation to replace those values by using caret preProcess functionality.
# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))
# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc, chemX_train_nz )
# Becomes the final version of test data for prediction
chemX_test_pp = predict( preProcFunc, chemX_test_nz )Next, we visualize the histogram and density of all the scaled imputed variables constructed above using caret. The plot below shows that imputation and removal of zeros reveals no predictors with odd densities or extremely long tails. Either of these indicates problems with outliers.
Next, we compare the density plots of each of the in-scope predictors between the training and test datasets. We conclude that both data sets are similar. Most importantly, the test data was not used to adjust or impute the training data.
We use a partial least squares model (PLS) for prediction of the Yield response. We use standard training arguments shown below to fit the model. We use repeated cross validation with 9 repeats of a 10-fold cross validation process on the training data. Lastly, we use Rsquared as the performance metric and the oneSE rule to choose the number of latent variables (a.k.a. components) to use in prediction.
repeats = 9
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats, selectionFunction = "oneSE")
set.seed(102030)
plsTune <- train(x = chemX_train_pp, y = chemY_train,
method = "pls",
tuneGrid = expand.grid(ncomp = 1:20),
trControl = ctrl ,
preProcess = c("center", "scale") ,
metric = "Rsquared"
)
plsTune## Partial Least Squares
##
## 132 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold, repeated 9 times)
## Summary of sample sizes: 119, 118, 120, 120, 119, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.351048 0.4731740 1.1140210
## 2 1.260863 0.5272981 1.0203550
## 3 1.203779 0.5660351 0.9880331
## 4 1.226812 0.5616346 0.9927586
## 5 1.233899 0.5598473 1.0068803
## 6 1.239375 0.5624817 1.0190834
## 7 1.238962 0.5666788 1.0233615
## 8 1.244006 0.5677683 1.0267789
## 9 1.280803 0.5562125 1.0546023
## 10 1.304934 0.5486787 1.0802050
## 11 1.330050 0.5392558 1.0934149
## 12 1.364049 0.5257729 1.1144246
## 13 1.384194 0.5232286 1.1247945
## 14 1.401976 0.5203377 1.1320915
## 15 1.414448 0.5147468 1.1385261
## 16 1.420774 0.5091043 1.1428049
## 17 1.431185 0.5058737 1.1497164
## 18 1.436509 0.5066840 1.1504161
## 19 1.441805 0.5075062 1.1480669
## 20 1.454394 0.5039491 1.1523342
##
## Rsquared was used to select the optimal model using the one SE rule.
## The final value used for the model was ncomp = 3.
The plots above show the RMSE and \(R^2\) profile of the model plotted against the number of latent variables. We see that the RMSE plot gives an absolute minimum at
ncomp equal to 3. The \(R^2\) plot gives a local maximum at ncomp equal to 3. Thus, our visual study of the RMSE, \(R^2\) is consistent with the oneSE rule.
There, we decide to use a PLS model with 3 latent variables.
Now we evaluate the model performance on the test data set.
pred_chemY = predict(plsTune, newdata = chemX_test_pp)
(testRMSE_chem = caret::RMSE(pred_chemY, chemY_test ))## [1] 1.22817
(testR2_chem = caret::R2( pred_chemY, chemY_test) )## [1] 0.5872227
The test set RMSE of the model is 1.2281695 and the test set \(R^2\) is 0.5872227. We plot the diagnostic plots below.
The PLS model residuals on the test data generally look close to the resample training statistics. There are no strange or noticeable patterns in the model residuals plot and the output to predicted plot. There is a slight possibility that the model residuals are growing in volatility as the predicted value increases. However this could be a small sample issue.
Based on the comparison table below, we conclude that the test statistics are consistent with training statistics.
All Datasets using PLS Tuned Model
| dataset | RMSE | RSquared | NumObservations |
|---|---|---|---|
| Training Data | 1.20 | 0.57 | 132 |
| Test Data | 1.23 | 0.59 | 44 |
We consider the variance importance below of the top 10 variables.
We see that M32, M13 and M36 are the top three variables. Manufacturing predictors dominate the list. The first biological predictor B02 enters the variable importance scores at rank number six. Therefore,
The featurePlot above shows the sensitivity of yield to the impact of each of the top 4 predictors. The green line shows the loess curve fitted to the data. Both on the slope of each fitted line we conclude that:
M32 , M09 increase, the yield will increase.M13 , M36 increase, the yield will decrease.We should advise based on the model that yield can be controlled through changes to the manufacturing process based on these parameter estimates.
We summarize all the R code used in this project in this appendix for ease of reading.
knitr::opts_chunk$set(echo = FALSE)
library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(caret)
library(AppliedPredictiveModeling)
library(pls)
library(elasticnet)
library(glmnet)
.code-bg {
background-color: lightgray;
}
runProblem1 = TRUE
runModels1 = TRUE
runProblem2 = TRUE
> library(AppliedPredictiveModeling)
> data(permeability)
library(AppliedPredictiveModeling)
data("permeability")
print(paste0( "Dimensions of fingerprints: rows: ", nrow(fingerprints), " columns: ", ncol(fingerprints) ) )
print(paste0( "Dimensions of permability: rows: ", nrow(permeability), " columns: ", ncol(permeability) ) )
sk_fingerprints = skim(fingerprints)
nzv_vars = nearZeroVar(fingerprints, names = TRUE)
# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
fingerprints_bv = fingerprints[, -nearZeroVar(fingerprints) ]
sk_fingerprints %>% filter( skim_variable %in% nzv_vars[1:5] )
print(paste0( "Dimensions of fingerprint predictors retained: rows: ", nrow(fingerprints_bv), " columns: ", ncol(fingerprints_bv) ) )
predictors_removed = ncol(fingerprints) - ncol(fingerprints_bv)
as_tibble(permeability) %>% ggplot(aes(x=permeability)) + geom_histogram() + ggtitle("Distribution of Permeability on entire dataset")
library(fpp3)
perm_ts = as_tibble(permeability) %>% mutate( id = row_number()) %>% as_tsibble(index = id )
(lambda <- perm_ts %>% features( permeability , features = guerrero) %>% pull(lambda_guerrero) )
perm_ts = perm_ts %>% mutate( bc_perm = box_cox(permeability, lambda), log_perm = log(permeability) )
perm_ts %>% ggplot() +
geom_density(aes(log_perm), alpha = 0.5 , fill = "green") +
geom_density(aes(bc_perm), alpha = 0.5, fill = "blue")
# Box-Cox values of permeability
bcperm = perm_ts$bc_perm
# When using caret::createDataPartition with a continuous variable,
# the default behavior to take a random sample
# at a default of 5 groups defined by quintiles of the variable: permeability
#
set.seed(10023)
train_indices = createDataPartition(bcperm , times = 1, list = FALSE, p = 0.8 )
summary( bcperm[train_indices])
length(bcperm[train_indices])
length(bcperm[-train_indices])
summary( bcperm[-train_indices])
# perm_df defines the permeability dataframe joined to the reduced set of "big-variance" predictors
perm_df = data.frame( cbind(bcperm, fingerprints_bv ) )
perm_train = perm_df[train_indices,]
perm_trainY = perm_train[,1]
perm_trainX = perm_train[,2:ncol(perm_train)]
perm_test = perm_df[ -train_indices,]
perm_testY = perm_test[,1]
perm_testX = perm_test[,2:ncol(perm_test)]
repeats = 9
#repeats = 1
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats, selectionFunction = "oneSE")
set.seed(102030)
plsTune <- train(x = perm_trainX, y = perm_trainY,
method = "pls",
tuneGrid = expand.grid(ncomp = 1:20),
trControl = ctrl ,
preProcess = c("center", "scale")
)
plsTune
plsResamples <- plsTune$results
plsResamples$Model <- "PLS"
plsPlotData <- plsResamples
xyplot(RMSE ~ ncomp,
data = plsPlotData,
#aspect = 1,
xlab = "# Components",
ylab = "RMSE (Cross-Validation)",
auto.key = list(columns = 2),
groups = Model,
type = c("o") )
xyplot(Rsquared ~ ncomp,
data = plsPlotData,
#aspect = 1,
xlab = "# Components",
ylab = "R2 (Cross-Validation)",
auto.key = list(columns = 2),
groups = Model,
type = c("o"))
pred_perm = predict(plsTune, newdata = perm_testX)
(testRMSE = caret::RMSE(pred_perm, perm_testY ))
(testR2 = caret::R2(pred_perm, perm_testY) )
xyplot( perm_testY ~ pred_perm, type = c("p", "g"),
main = "PLS Model Prediction vs. Observation of Test Data",
xlab = "Predicted",
ylab = "Observed", panel = function(...){
panel.abline(a = 0, b = 1 , lty = 2)
panel.xyplot(...)
})
xyplot( perm_testY - pred_perm ~ pred_perm, type = c("p", "g"),
main = "PLS Model Prediction vs. Residual of Test Data",
xlab = "Predicted", ylab = "Residual", panel = function(...){
panel.abline(a = 0, b = 0 , lty = 2)
panel.xyplot(...)
})
set.seed(102030)
pcrTune <- train(x = perm_trainX, y = perm_trainY,
method = "pcr",
tuneGrid = expand.grid(ncomp = 1:20),
trControl = ctrl ,
preProcess = c("center", "scale")
)
pcrTune
pcrResamples <- pcrTune$results
pcrResamples$Model <- "PCR"
jointPlotData <- rbind( pcrResamples, plsResamples )
xyplot(RMSE ~ ncomp,
data = jointPlotData,
#aspect = 1,
main = "PCR and PLS RMSE Results on Test Data" ,
xlab = "# Components",
ylab = "RMSE (Cross-Validation)",
auto.key = list(columns = 2),
groups = Model,
type = c("o", "g"))
xyplot(Rsquared ~ ncomp,
data = jointPlotData,
main = "PCR and PLS R2 Results on Test Data" ,
#aspect = 1,
xlab = "# Components",
ylab = "R2 (Cross-Validation)",
auto.key = list(columns = 2),
groups = Model,
type = c("o"))
pred_perm_pcr = predict(pcrTune, newdata = perm_testX)
(testRMSE_pcr = caret::RMSE(pred_perm_pcr, perm_testY ))
(testR2_pcr = caret::R2(pred_perm_pcr, perm_testY) )
xyplot( perm_testY ~ pred_perm_pcr, type = c("p", "g"),
xlab = "Predicted", ylab = "Observed",
main = "Observed vs. Predicted Values for PCR Model" ,
panel = function(...){
panel.abline(a = 0, b = 1 , lty = 2)
panel.xyplot(...)
})
xyplot( perm_testY - pred_perm_pcr ~ pred_perm,
type = c("p", "g"),
xlab = "Predicted",
ylab = "Residual",
main = "Residual vs. Prediction for PCR" ,
panel = function(...){
panel.abline(a = 0, b = 0 , lty = 1 )
panel.xyplot(...)
})
set.seed(102030)
ctrl <- trainControl(method = "cv", selectionFunction = "best")
enetGrid <- expand.grid(lambda = c(0, 0.05, .1 , 0.5, 0.8 ),
fraction = seq(.01, 1, length = 10))
enetTune <- train(x = perm_trainX, y = perm_trainY ,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl ,
preProcess = c("center", "scale")
)
enetTune
pred_enet = predict(enetTune, newdata = perm_testX)
(testRMSE_enet = caret::RMSE(pred_enet, perm_testY ))
(testR2_enet = caret::R2(pred_enet, perm_testY) )
xyplot( perm_testY ~ pred_enet, type = c("p", "g"),
xlab = "Predicted", ylab = "Observed",
main = "Observed vs. Predicted Values for Elastic Net Model" ,
panel = function(...){
panel.abline(a = 0, b = 1 , lty = 2)
panel.xyplot(...)
})
xyplot( perm_testY - pred_enet ~ pred_enet,
type = c("p", "g"),
xlab = "Predicted",
ylab = "Residual",
main = "Residual vs. Prediction for Elastic Net" ,
panel = function(...){
panel.abline(a = 0, b = 0 , lty = 1 )
panel.xyplot(...)
})
coefs = predict.enet(enetTune$finalModel, s=enetTune$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
nonzero_coefs = coefs[ abs(coefs) > 0.00001 ]
var_names = names(nonzero_coefs)
rownames(nonzero_coefs) <- NULL
enet_coefs = data.frame( names = var_names, coefs = nonzero_coefs )
dim(enet_coefs)
enet_coefs %>% ggplot(aes(x = coefs, y = names)) + geom_col() + ggtitle("Elastic Net Model Standardized Coefficients")
enet_coefs %>% arrange( desc(coefs)) %>%
kable(caption = "Elastic Net Non-zero Coefficients", digits = 3 ) %>%
kable_styling(bootstrap_options = c("hover", "striped" ), position = "left" )
test_results = data.frame(model = c("Partial Least Squares", "Principal Components Regression", "Elastic Net") ,
RMSE = c( testRMSE, testRMSE_pcr, testRMSE_enet ) ,
RSquared = c( testR2, testR2_pcr, testR2_enet ) ,
NumVars = c( 9 , 11, 45 )
)
test_results %>% kable(caption = "All Models on Test DataSet", digits = 2 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
data("ChemicalManufacturingProcess")
chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]
dim(chemX)
dim(chemY)
chemX %>% rename_at(vars(starts_with("ManufacturingProcess") ),
funs(str_replace(., "ManufacturingProcess", "M"))) %>%
rename_at(vars(starts_with("Biological")),
funs( str_replace( ., "BiologicalMaterial", "B"))) -> chemX2
skim(chemY)
chemY %>% ggplot(aes(x=Yield)) + geom_histogram(bins = 20, color= "white", fill = "red") +
labs(title = "Yield distribution")
set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )
chemX_test = chemX2[-train_indices , ]
chemX_train = chemX2[ train_indices , ]
chemY_test = chemY[-train_indices , ]
chemY_train = chemY[ train_indices , ]
paste0("Training data removed variables:", (nzv_chemtrain = nearZeroVar(chemX_train, names = TRUE) ) )
# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_train_bv = chemX_train[, -nearZeroVar(chemX_train) ]
paste0("Dimension of train data before and after near zero variance check: ")
dim(chemX_train)
dim(chemX_train_bv)
paste0("Test data removed variables:", (nzv_chemtest = nearZeroVar(chemX_test, names = TRUE) ) )
# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_test_bv = chemX_test[, -nearZeroVar(chemX_test) ]
paste0("Dimension of test data before and after near zero variance check: ")
dim(chemX_test)
dim(chemX_test_bv)
sk_train_bv = skim(chemX_train_bv)
sk_train_bv %>%
mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
) %>%
filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
arrange( desc(zscore_extreme)) %>% yank("numeric")
# Replace the outlier zeros with the median value for the variables below.
chemX_train_bv %>% mutate( M18 = ifelse(M18==0, median(M18), M18 ) ,
M20 = ifelse(M20==0, median(M20), M20 ) ,
M16 = ifelse(M16==0, median(M16), M16 ) ,
M26 = ifelse(M26==0, median(M26), M26 ) ,
M25 = ifelse(M25==0, median(M25), M25 ) ,
M27 = ifelse(M27==0, median(M27), M27 ) ,
M31 = ifelse(M31==0, median(M31), M31 ) ,
M29 = ifelse(M29==0, median(M29), M29 ) ,
M43 = ifelse(M43==0, median(M43), M43 ) ,
M30 = ifelse(M30==0, median(M30), M30 ) ,
M01 = ifelse(M01==0, median(M01), M01 ) ,
M42 = ifelse(M42==0, median(M42), M42 ) ,
M44 = ifelse(M44==0, median(M44), M44 ) ,
M45 = ifelse(M45==0, median(M45), M45 ) ,
M39 = ifelse(M39==0, median(M39), M39 ) ,
M38 = ifelse(M38==0, median(M38), M38 ) ) -> chemX_train_nz
sk_train_nz = skim(chemX_train_nz)
sk_train_nz %>%
mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
) %>%
filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
arrange( desc(zscore_extreme))
sk_train_nz %>% yank("numeric") %>% filter(skim_variable== 'M18')
sk_test_bv = skim(chemX_test_bv)
sk_test_bv %>%
mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
) %>%
filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
arrange( desc(zscore_extreme)) %>% yank("numeric")
# Replace the outlier zeros with the median value for the variables below.
chemX_test_bv %>% mutate( M39 = ifelse(M39==0, median(M39), M39 ) ,
M42 = ifelse(M42==0, median(M42), M42 ) ,
M44 = ifelse(M44==0, median(M44), M44 ) ,
M45 = ifelse(M45==0, median(M45), M45 ) ,
M01 = ifelse(M01==0, median(M01), M01 ) ) -> chemX_test_nz
sk_test_nz = skim(chemX_test_nz)
sk_test_nz %>%
mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
) %>%
filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
arrange( desc(zscore_extreme)) %>% yank("numeric")
skim(chemX_test_nz) %>% filter( n_missing > 0 ) %>% arrange( desc( n_missing) ) %>% yank("numeric")
chemX_test_nz %>% # The dataframe
filter_all( any_vars(is.na(.))) %>% # include any rows where any column is NA
select_if( ~ any(is.na(.))) -> missing_test_nz
# only include those columns where some row has NA in that column
missing_test_nz %>% kable(caption = "missing test data - rows") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
missing_test_nz %>%
mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_test_nz
num_na_test_cells = sum(missing_summary_test_nz$num_na)
missing_summary_test_nz %>% # add a count of columns with NA values
select(num_na ) %>% # pull out the count
arrange( desc(num_na)) %>% # sort by the count descending
bind_rows(., tibble( num_na = num_na_test_cells ) ) %>%
kable(caption = "Test Data with missing predictor values") %>%
kable_styling(bootstrap_options = c("hover", "striped"),
position = "left"
)
pct_missing_test = num_na_test_cells/ ( nrow(chemX_test_nz) * ncol(chemX_test_nz)) * 100
skim(chemX_train_nz) %>% filter( n_missing > 0 ) %>% arrange( desc( n_missing) ) %>% yank("numeric")
chemX_train_nz %>% # The dataframe
filter_all( any_vars(is.na(.))) %>% # include any rows where any column is NA
select_if( ~ any(is.na(.))) -> missing_train_nz # only include those columns where some row has NA in that column
missing_train_nz %>% kable(caption = "missing training data - rows") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
missing_train_nz %>%
mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_train_nz
num_na_cells_train = sum(missing_summary_train_nz$num_na)
missing_summary_train_nz %>% # add a count of columns with NA values
select(num_na ) %>% # pull out the count
arrange( desc(num_na)) %>% # sort by the count descending
bind_rows(., tibble( num_na = num_na_cells_train ) ) %>%
kable(caption = "Training Data with missing predictor values") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
pct_missing_train = num_na_cells_train/ ( nrow(chemX_train_nz) * ncol(chemX_train_nz)) * 100
# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))
# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc, chemX_train_nz )
# Becomes the final version of test data for prediction
chemX_test_pp = predict( preProcFunc, chemX_test_nz )
chemX_train_pp %>%
mutate(id = row_number()) %>%
pivot_longer( !id, names_to = "X", values_to = "Val" ) -> piv_chemX_train
plot1 = piv_chemX_train %>% filter(!is.na(Val)) %>% ggplot(aes(x=Val)) +
facet_wrap( ~X , scales = "free" , ncol = 4) +
geom_histogram(aes(y = ..density..), binwidth = .25, fill = "red", alpha = 0.5) +
geom_density( fill="blue", alpha = 0.3) +
labs(title = "Training Data: Histogram/Density of Imputed Standardized Predictors")
plot1
chemX_test_pp %>%
mutate(id = row_number()) %>%
pivot_longer( !id, names_to = "X", values_to = "Val" ) -> piv_chemX_test
plot2 = piv_chemX_test %>% filter(!is.na(Val)) %>% ggplot(aes(x=Val)) +
facet_wrap( ~X , scales = "free" , ncol = 4) +
geom_histogram(aes(y = ..density..), binwidth = .25, fill = "red", alpha = 0.5) +
geom_density( fill="blue", alpha = 0.3) +
labs(title = "Test Data: Histogram and Density of Standardized Predictors")
plot2
piv_chemX = rbind(piv_chemX_test %>% mutate(Set = "Test"),
piv_chemX_train %>% mutate(Set = "Train") )
plot3 = piv_chemX %>% ggplot(aes(x=Val, fill = Set )) +
facet_wrap( ~X , scales = "free" , ncol = 4) +
geom_density( alpha = 0.5) +
labs(title = "Data: Density of Imputed Training and Test Standardized Predictors")
plot3
repeats = 9
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats, selectionFunction = "oneSE")
set.seed(102030)
plsTune <- train(x = chemX_train_pp, y = chemY_train,
method = "pls",
tuneGrid = expand.grid(ncomp = 1:20),
trControl = ctrl ,
preProcess = c("center", "scale") ,
metric = "Rsquared"
)
plsTune
plsResamples <- plsTune$results
plsResamples$Model <- "PLS"
plsPlotData <- plsResamples
xyplot(RMSE ~ ncomp,
data = plsPlotData,
#aspect = 1,
xlab = "# Components",
ylab = "RMSE (Cross-Validation)",
auto.key = list(columns = 2),
groups = Model,
main = "Chemical Manufacturing Process" ,
type = c("o") )
xyplot(Rsquared ~ ncomp,
data = plsPlotData,
#aspect = 1,
xlab = "# Components",
ylab = "R2 (Cross-Validation)",
auto.key = list(columns = 2),
groups = Model,
main = "Chemical Manufacturing Process",
type = c("o"))
pred_chemY = predict(plsTune, newdata = chemX_test_pp)
(testRMSE_chem = caret::RMSE(pred_chemY, chemY_test ))
(testR2_chem = caret::R2( pred_chemY, chemY_test) )
xyplot( chemY_test ~ pred_chemY, type = c("p", "g"),
main = "PLS Model Prediction vs. Observation of Test Data",
xlab = "Predicted",
ylab = "Observed", panel = function(...){
panel.abline(a = 0, b = 1 , lty = 2)
panel.xyplot(...)
})
xyplot( chemY_test - pred_chemY ~ pred_chemY, type = c("p", "g"),
main = "PLS Model Prediction vs. Residual of Test Data",
xlab = "Predicted", ylab = "Residual", panel = function(...){
panel.abline(a = 0, b = 0 , lty = 2)
panel.xyplot(...)
})
test_results = data.frame(dataset = c("Training Data", "Test Data") ,
RMSE = c( plsTune$results$RMSE[ plsTune$bestTune$ncomp] , testRMSE_chem ) ,
RSquared = c( plsTune$results$Rsquared[ plsTune$bestTune$ncomp], testR2_chem ) ,
NumObservations = c( nrow( chemX_train_pp) , nrow( chemX_test_pp ) )
)
test_results %>% kable(caption = "All Datasets using PLS Tuned Model", digits = 2 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
importanceScores = varImp(plsTune, scale = FALSE)
plot(importanceScores , top=10 )
vars_to_plot = c("M32", "M13", "M36", "M09")
theme1 <- trellis.par.get()
theme1$plot.symbol$col = rgb(.2, .2, .2, .4)
theme1$plot.symbol$pch = 16
theme1$plot.line$col = "green"
theme1$plot.line$lwd = 2
trellis.par.set(theme1)
caret::featurePlot(x = chemX_train_pp[, vars_to_plot],
y = chemY_train , plot = "scatter" ,
type = c("p", "smooth") ,
span = 0.5,
layout = c(4,1))