Overview

Exercises from Kuhn and Johnson Applied Predictive Modeling, Chapter 6 Linear Regression and Its Cousins. All R code is displayed at the end for clarity.

Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a suﬃcient permeability to become a drug:

Start R and use these commands to load the data:

> library(AppliedPredictiveModeling)
> data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

The ﬁngerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\) ?
Predict the response for the test set. What is the test set estimate of \(R^2\)?
Try building other models discussed in this chapter. Do any have better predictive performance?
Would you recommend any of your models to replace the permeability laboratory experiment?

6.2a Response

We load the and permability dataset below and display the dimensions of its predictors and response.

library(AppliedPredictiveModeling)
data("permeability")

## [1] "Dimensions of fingerprints: rows: 165 columns: 1107"

## [1] "Dimensions of permability: rows: 165 columns: 1"

6.2b Response

We use nearZeroVar to identify predictors with very low variance or large imbalance between the highest frequency values. The default algorithm uses a 95/5 imbalance ratio between the most common and second most common value.

For a binary 0-1 variable \(X\), the 95/5 ratio is equivalent to

\[| E[X] - 0.5 | \geq 0.45 \] That is, near zero variance is equivalent to a mean close to 0 or close to 0.

For first 5 near zero variance predictors and their summary statistics are displayed.

Data summary

Name	fingerprints
Number of rows	165
Number of columns	1107
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
X7	1	1.00	0.00	1	1	1	1	1	▁▁▇▁▁
X8	1	1.00	0.00	1	1	1	1	1	▁▁▇▁▁
X9	1	0.02	0.13	0	0	0	0	1	▇▁▁▁▁
X10	1	0.02	0.15	0	0	0	0	1	▇▁▁▁▁
X13	1	0.96	0.20	0	1	1	1	1	▁▁▁▁▇

## [1] "Dimensions of fingerprint predictors retained: rows: 165 columns: 388"

We conclude that 719 predictors were removed and 388 predictors are left for modeling.

6.2c Response

We look at the distribution of the response variable permeability below. Since it is skewed to the left, a Box-Cox transformation may help in linear type regression.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We investigate the guerrero algorithm from fabletools and compare the transformation to a log transform. As shown below, the guerrero transform with \(\lambda = -0.148\) produces a more symmetric density plot (in blue) than a log transform with \(\lambda = 0\) (in green). Therefore, we choose to use \(\lambda = -0.148\).

## [1] -0.1486984

To split the data into a test and training set, we use a 80%-20% split of training and test data. We use a stratified random sampling approach by forming quintiles on the response variable permeability. Within each quintile, we divide the observations into training and test partitions. Therefore, the summary statistics of the test and training data should look reasonably similar.

The test data will be used in 6.2d. The training data will be used in this section to choose a model and the number of PLS components.

#  When using caret::createDataPartition with a continuous variable, 
#  the default behavior to take a random sample
#  at a default of 5 groups defined by quintiles of the variable:  permeability 
#
set.seed(10023)
train_indices = createDataPartition(bcperm , times = 1, list = FALSE, p = 0.8 )

The summary statistics of the training data containing 133 observations are below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.4933  0.4243  1.4170  1.2086  2.2498  3.0250

## [1] 133

Which are similar to those of the test data containing 32:

## [1] 32

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6453  0.5139  1.4281  1.3740  2.2753  2.9718

We now tune the PLS model using 9 times repeated 10-fold cross validation.

We process the predictors by centering and scaling them.
The 10-fold cross validation is a standard cross validation approach that balances computational speed and variance in the model fitting process.
We use the default kernel algorithm for PLS.
We set a seed prior to the model tuning to ensure reproducibility.

# perm_df defines the permeability dataframe joined to the reduced set of "big-variance" predictors
perm_df = data.frame( cbind(bcperm, fingerprints_bv ) )

perm_train = perm_df[train_indices,]
perm_trainY = perm_train[,1]
perm_trainX = perm_train[,2:ncol(perm_train)]

perm_test  = perm_df[ -train_indices,]
perm_testY = perm_test[,1]
perm_testX = perm_test[,2:ncol(perm_test)]

The model specification and training parameters below are used to find the optimal PLS components.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "oneSE")

set.seed(102030)

plsTune <- train(x = perm_trainX, y = perm_trainY,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale")
                 )
plsTune

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 9 times) 
## Summary of sample sizes: 120, 119, 121, 120, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     1.1141205  0.3143435  0.8807533
##    2     1.0618425  0.3738377  0.8347404
##    3     1.0436316  0.4032896  0.7988877
##    4     1.0574966  0.4032326  0.8202256
##    5     1.0504322  0.4331503  0.8261032
##    6     1.0312813  0.4515938  0.8001931
##    7     1.0049849  0.4780387  0.7695146
##    8     0.9770128  0.5103656  0.7495774
##    9     0.9554692  0.5376859  0.7339096
##   10     0.9591361  0.5406371  0.7396557
##   11     0.9730300  0.5400550  0.7454174
##   12     0.9921338  0.5286890  0.7627678
##   13     1.0094432  0.5205338  0.7761608
##   14     1.0230539  0.5153133  0.7869745
##   15     1.0408789  0.5024099  0.8031697
##   16     1.0588834  0.4920740  0.8209528
##   17     1.0831165  0.4782526  0.8364812
##   18     1.1023218  0.4686852  0.8466800
##   19     1.1116160  0.4693260  0.8475908
##   20     1.1291208  0.4613194  0.8582763
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was ncomp = 8.

The model output above states that ncomp equal to 8 is the best answer according to the one standard error rule introduced by Leo Breiman. The rule is set with the selectionFunction parameter in the trainControl tuning setup prior to model tuning.

In the plots below, we show the RMSE and \(R^2\) from the above tuning exercise plotted against the different number of PLS components or latent variables.

The cross-validation \(R^2\) and RMSE agree that ncomp equal to 8 components is optimal. We agree this with finding.

The corresponding cross-validation \(R^2\) is 0.5103656

6.2d Response

Next, we compute \(R^2\) and \(RMSE\) of the test data below.

pred_perm = predict(plsTune, newdata = perm_testX)

(testRMSE = caret::RMSE(pred_perm, perm_testY ))

## [1] 0.9154151

(testR2 = caret::R2(pred_perm, perm_testY) )

## [1] 0.5405468

The test set RMSE of the model is 0.9154151 and the test set \(R^2\) is 0.5405468. We plot the diagnostic plots below.

6.2e Response

We will consider applying the same caret model fitting approach using PCR and Elastic Net models to evaluate performance. A table of the test data result at the end of this section shows that Elastic net is the top performing model.

Using the same training and test data split and cross validation counts, we observe that PCR requires more iterations. The PCA Regression model requires \(ncomp=11\) latent variables to achieve the desired level of accuracy where PLS regression requires only 8 components.

set.seed(102030)

pcrTune <- train(x = perm_trainX, y = perm_trainY,
                 method = "pcr",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale")
                 )
pcrTune

## Principal Component Analysis 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 9 times) 
## Summary of sample sizes: 120, 119, 121, 120, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared    MAE      
##    1     1.293126  0.07920901  1.0906939
##    2     1.296688  0.07994117  1.0925286
##    3     1.210043  0.20196183  1.0031270
##    4     1.145841  0.26716680  0.9388025
##    5     1.142884  0.27844073  0.9286222
##    6     1.150580  0.27240294  0.9330602
##    7     1.135464  0.28994583  0.9095592
##    8     1.090564  0.33981803  0.8674796
##    9     1.102164  0.32826835  0.8783589
##   10     1.078221  0.35024938  0.8574133
##   11     1.063008  0.37340710  0.8376561
##   12     1.044617  0.39567077  0.8173886
##   13     1.049565  0.38835493  0.8163060
##   14     1.060419  0.37781904  0.8176979
##   15     1.070093  0.36982634  0.8230622
##   16     1.078642  0.36160264  0.8282541
##   17     1.091929  0.34890402  0.8388649
##   18     1.093206  0.34732713  0.8443113
##   19     1.098423  0.34154747  0.8478375
##   20     1.108231  0.33413103  0.8520652
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was ncomp = 11.

We conclude that the PCR model prefers to use ncomp = 11 latent variables, two more than the PLS model. This means PLS is more parsimonious in addition to showing higher predictive accuracy in cross validation.

Looking at the plots above for both PLS and PCR models shows the PLS outperforms on cross validated RMSE accuracy in addition to forming a more parsimonious model.

pred_perm_pcr = predict(pcrTune, newdata = perm_testX)

(testRMSE_pcr = caret::RMSE(pred_perm_pcr, perm_testY ))

## [1] 0.9518364

(testR2_pcr = caret::R2(pred_perm_pcr, perm_testY) )

## [1] 0.2733075

The test set RMSE of the model is 0.9518364 and the test set \(R^2\) is 0.2733075. We plot the diagnostic plots below. Both look reasonable.

Lastly, we consider the elastic net model using the elasticnet package. We make some alterations in the training procedure:

Repeated cross validation does not work for unknown reasons. We restore 10-fold cross validation.
The selection function uses a different rule: It selects the model with the best RMSE. This is because the parsimony of the model is unrelated to the choice of \(\lambda_1\) or \(\lambda_2\) for the weights.

set.seed(102030)

ctrl <- trainControl(method = "cv",  selectionFunction = "best")

enetGrid <- expand.grid(lambda = c(0, 0.05, .1 , 0.5,  0.8 ),    
                        fraction = seq(.01, 1, length = 10))

enetTune <- train(x = perm_trainX, y = perm_trainY ,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl ,
                  preProcess = c("center", "scale")
                  )

We display the parameter tunings below which show that Elastic Net outperform PLS and PCR regressions. The final Elastic net model uses \(\lambda = 0.1\) and Lasso parameter \(\lambda_2 = 0.12\) in our specification.

## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 121, 120, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.00    0.01      1.1704448  0.4450415  0.9524403
##   0.00    0.12      0.9469834  0.5643682  0.7574892
##   0.00    0.23      1.0022209  0.5514359  0.7872867
##   0.00    0.34      1.0918181  0.5301740  0.8540700
##   0.00    0.45      1.1860356  0.5027960  0.9304146
##   0.00    0.56      1.2873792  0.4838903  1.0034147
##   0.00    0.67      1.3983954  0.4586070  1.1211482
##   0.00    0.78      1.4963757  0.4368210  1.1994294
##   0.00    0.89      1.5939157  0.4202430  1.2808476
##   0.00    1.00      1.6725596  0.4105048  1.3382308
##   0.05    0.01      1.1983976  0.4256382  0.9710976
##   0.05    0.12      1.2870747  0.5416545  0.9654491
##   0.05    0.23      1.6818005  0.5410243  1.2627384
##   0.05    0.34      2.0753044  0.5408564  1.5484362
##   0.05    0.45      2.4854776  0.5305450  1.8570758
##   0.05    0.56      2.8553998  0.5187479  2.1129468
##   0.05    0.67      3.2196498  0.5094695  2.3669359
##   0.05    0.78      3.5790047  0.5074803  2.6374493
##   0.05    0.89      3.9390078  0.5061804  2.9070814
##   0.05    1.00      4.3001080  0.5051146  3.1775626
##   0.10    0.01      1.1988166  0.4369915  0.9993257
##   0.10    0.12      0.8498910  0.6087014  0.6557170
##   0.10    0.23      0.8744997  0.5841962  0.6739327
##   0.10    0.34      0.8810256  0.5803549  0.6742691
##   0.10    0.45      0.8976766  0.5780301  0.6905788
##   0.10    0.56      0.9120180  0.5697710  0.7075800
##   0.10    0.67      0.9225326  0.5617471  0.7173633
##   0.10    0.78      0.9278022  0.5587795  0.7223083
##   0.10    0.89      0.9295182  0.5582120  0.7229917
##   0.10    1.00      0.9319942  0.5570152  0.7244517
##   0.50    0.01      1.2205188  0.4369915  1.0213279
##   0.50    0.12      0.8832301  0.5795634  0.6860241
##   0.50    0.23      0.8930349  0.5853319  0.7053811
##   0.50    0.34      0.9475531  0.5684995  0.7483258
##   0.50    0.45      0.9738020  0.5646285  0.7660949
##   0.50    0.56      0.9976230  0.5618253  0.7815977
##   0.50    0.67      1.0158279  0.5571365  0.7935008
##   0.50    0.78      1.0249255  0.5567606  0.8037094
##   0.50    0.89      1.0290046  0.5597098  0.8102308
##   0.50    1.00      1.0323044  0.5617373  0.8156342
##   0.80    0.01      1.2202535  0.4369915  1.0208854
##   0.80    0.12      0.8894920  0.5684689  0.6897530
##   0.80    0.23      0.9177546  0.5833350  0.7330459
##   0.80    0.34      1.0240080  0.5577688  0.8173089
##   0.80    0.45      1.0754851  0.5486166  0.8551375
##   0.80    0.56      1.1138559  0.5445552  0.8876842
##   0.80    0.67      1.1408920  0.5407047  0.9070979
##   0.80    0.78      1.1577136  0.5396857  0.9231084
##   0.80    0.89      1.1677908  0.5415257  0.9340646
##   0.80    1.00      1.1738798  0.5446962  0.9421534
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.12 and lambda = 0.1.

Below, we compute the out of sample test results and then plot model diagnostics.

pred_enet = predict(enetTune, newdata = perm_testX)

(testRMSE_enet = caret::RMSE(pred_enet, perm_testY ))

## [1] 0.7277125

(testR2_enet = caret::R2(pred_enet, perm_testY) )

## [1] 0.568133

The above plots look reasonable.

Lastly, we observe that the Elastic Net model has reduced predictor dimension from 388 to only 45 predictors with non-zero coefficients. The predictors and their elastic net coefficients are listed below in descending absolute value order.

coefs = predict.enet(enetTune$finalModel, s=enetTune$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients 

nonzero_coefs = coefs[ abs(coefs) > 0.00001 ]
var_names = names(nonzero_coefs)
rownames(nonzero_coefs) <- NULL

enet_coefs = data.frame( names = var_names, coefs = nonzero_coefs )

dim(enet_coefs)

## [1] 45  2

For completeness, we list the coefficients in tabular format too.

Elastic Net Non-zero Coefficients

	names	coefs
X6	X6	0.494
X698	X698	0.080
X719	X719	0.080
X141	X141	0.050
X133	X133	0.027
X138	X138	0.027
X16	X16	0.027
X315	X315	0.024
X316	X316	0.007
X318	X318	0.007
X317	X317	0.007
X108	X108	0.007
X103	X103	0.007
X340	X340	0.001
X344	X344	0.001
X553	X553	0.000
X551	X551	0.000
X554	X554	0.000
X304	X304	-0.001
X305	X305	-0.001
X298	X298	-0.001
X280	X280	-0.001
X281	X281	-0.001
X300	X300	-0.001
X299	X299	-0.001
X293	X293	-0.009
X613	X613	-0.015
X621	X621	-0.015
X258	X258	-0.024
X245	X245	-0.028
X244	X244	-0.028
X240	X240	-0.028
X253	X253	-0.028
X246	X246	-0.028
X254	X254	-0.028
X239	X239	-0.028
X157	X157	-0.028
X99	X99	-0.029
X307	X307	-0.031
X295	X295	-0.031
X294	X294	-0.031
X308	X308	-0.031
X271	X271	-0.034
X158	X158	-0.050
X93	X93	-0.277

To summarize our findings in the table below, elastic net outperforms PLS and PCR by both RMSE and R-Squared on the test dataset.

All Models on Test DataSet

model	RMSE	RSquared	NumVars
Partial Least Squares	0.92	0.54	9
Principal Components Regression	0.95	0.27	11
Elastic Net	0.73	0.57	45

6.2f Response

None of the three models produced here should be relied upon for production purposes for several reasons:

The small test set sample size relative to the number of predictors and model complexity.
The predictors are anonymized in our data set. There is no way to interpret the model for reasonableness.
from a business point of view, we don’t know the accuracy requirement or the costs of Type I vs. Type II errors.
The original paper by Kansy, Senner and Gubernator 1997 cited by the dataset does not share useful statistical analyses or sufficient sample size to convince this reader of its model accuracy. For example it states: When the drugs are binned into 3 groups by human absorption (a proxy for permeability ), “nearly 80% of the compounds would be correctly predicted in their in vivo absorption ability (Table 1).” While the authors conclude the PAMPA method used together with CACO-2 cell measurements could be a promising high throughput method to predict absorption, the Table 1 data only contains 25 compounds as shown below:

PAMPA Table 1

One possible way to judge the utility of the model is to bin the permeability into 3 classes of permeability (low, medium and high). By transforming a regression problem into a classification problem, we can evaluate whether the new model achieves the 80% accuracy level that appears acceptable to the authors of the original paper.

The experimental equipment should continued to be used in tandem with continued assessment of model predictive accuracy based on the provided information.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),

measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)

data(chemicalManufacturing)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

A small percentage of cells in the predictor set contain missing values. Use an imputation function to ﬁll in these missing values (e.g., see Sect. 3.8).
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Response 6.3a

There is a minor typo in the 6.3 assignment.

– The dataset is named ChemicalManufacturingProcess rather than the name given above. The author’s book errate webpage acknowledges this error.

We save the predictors which are columns 2 to 58 of ChemicalManufacturingProcess into a dataframe chemX of \(N=176\) rows and \(M=57\) columns and the response in chemY in a dataframe of \(N=176\) rows and \(M=1\) columns.

data("ChemicalManufacturingProcess")

chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]
dim(chemX)

## [1] 176  57

dim(chemY)

## [1] 176   1

Due to the long variable names, we will abbreviate all predictors as follows:

Names starting with BiologicalMaterial will be abbreviated to B
Names starting with ManufacturingProcess will be abbreviated to M

chemX %>% rename_at(vars(starts_with("ManufacturingProcess") ), 
                    funs(str_replace(., "ManufacturingProcess", "M"))) %>% 
  rename_at(vars(starts_with("Biological")), 
            funs( str_replace( ., "BiologicalMaterial", "B"))) -> chemX2

Response 6.3b and 6.3c Combined

Although exercise 6.3b asks us to do imputation before splitting the data, that is a bad practice because the imputation process can alter training data based on the properties of the test data. So the predictions will have look-ahead bias, and the model accuracy will be artificially inflated.

We perform data splitting in 6.3c before imputation and preprocessing. Then we will tune a PLS model.

Splitting the data

We split the \(N\) observations into a 75-25 train/test split based on stratified random sampling of observations based on their response variable value Yield. Yield has no missing data which is helpful for stratified random sampling. Also, no transformation of the response Yield is warranted as the histogram looks symmetric. This is shown in the skim output of Yield below and its histogram.

Data summary

Name	chemY
Number of rows	176
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Yield	0	1	40.18	1.85	35.25	38.75	39.97	41.48	46.34	▁▇▇▃▁

We split the data into 75/25 train/test sets below and put into the necessary dataframes.

set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )

Near Zero Variance

We drop out a near zero variance predictor B07 from both train and test data sets. The number of predictors declines from \(M=57\) to \(M=56\) as a result. The results below

## [1] "Training data removed variables:B07"

## [1] "Dimension of train data before and after near zero variance check: "

## [1] 132  57

## [1] 132  56

## [1] "Test data removed variables:B07"

## [1] "Dimension of test data before and after near zero variance check: "

## [1] 44 57

## [1] 44 56

Outliers and Zeroes

Next we identify and fix zeros and outliers in the predictors before imputing missing values.
The reason is that KNN algorithm may be affected by outliers. The table below uses skim to report the minimum, maximum, median, mean and standard deviation for each predictor. In addition, we construct Z-score metrics that tells us how many standard deviations is the minimum and maximum. For example, for the predictor \(M18\), its values are clustered around 4800 except for observation 13 which has a value of 0.

\[ \text{zscore_min}(M18) = \frac{ min(\text{M18}) - \mu(\text{M18}) }{\sigma(\text{M18} )}\] \[\text{zscore_max}(M18) = \frac{ max(\text{M18}) - \mu(\text{M18}) }{\sigma(\text{M18} )}\] We identify 16 predictors with this issue in the table below and replace the zero outlier with the median.

sk_train_bv = skim(chemX_train_bv)

sk_train_bv %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme)) %>% yank("numeric")

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p25	p50	p75	p100	hist	zscore_min	zscore_max	zscore_extreme
M18	0	1.00	4800.09	423.29	4813.00	4834.00	4862.0	4971.0	▁▁▁▁▇	-11.34	0.40	11.34
M20	0	1.00	4545.26	402.00	4550.00	4579.00	4605.5	4759.0	▁▁▁▁▇	-11.31	0.53	11.31
M16	0	1.00	4554.23	404.48	4554.75	4585.50	4619.0	4852.0	▁▁▁▁▇	-11.26	0.74	11.26
M26	3	0.98	6001.38	534.47	6018.00	6041.00	6069.0	6161.0	▁▁▁▁▇	-11.23	0.30	11.23
M25	3	0.98	4816.81	429.30	4829.00	4852.00	4876.0	4990.0	▁▁▁▁▇	-11.22	0.40	11.22
M27	3	0.98	4550.56	406.49	4556.00	4584.00	4607.0	4696.0	▁▁▁▁▇	-11.19	0.36	11.19
M31	3	0.98	70.07	6.36	70.10	70.80	71.4	72.5	▁▁▁▁▇	-11.01	0.38	11.01
M29	3	0.98	19.93	1.87	19.70	19.90	20.4	22.0	▁▁▁▁▇	-10.65	1.11	10.65
M43	0	1.00	0.91	0.98	0.50	0.75	1.0	11.0	▇▁▁▁▁	-0.92	10.27	10.27
M30	3	0.98	9.15	1.05	8.80	9.10	9.7	11.2	▁▁▁▃▇	-8.74	1.96	8.74
M01	1	0.99	11.32	1.45	10.80	11.40	12.2	14.1	▁▁▁▅▇	-7.80	1.91	7.80
M42	0	1.00	11.28	1.75	11.40	11.60	11.7	12.1	▁▁▁▁▇	-6.45	0.47	6.45
M44	0	1.00	1.82	0.29	1.80	1.90	1.9	2.1	▁▁▁▁▇	-6.22	0.96	6.22
M45	0	1.00	2.14	0.37	2.10	2.20	2.3	2.6	▁▁▁▂▇	-5.72	1.21	5.72
M39	0	1.00	6.85	1.51	7.10	7.20	7.3	7.5	▁▁▁▁▇	-4.54	0.43	4.54
M38	0	1.00	2.55	0.62	2.00	3.00	3.0	3.0	▁▁▁▅▇	-4.09	0.73	4.09

## # A tibble: 0 x 15
## # … with 15 variables: skim_type <chr>, skim_variable <chr>, n_missing <int>,
## #   complete_rate <dbl>, numeric.mean <dbl>, numeric.sd <dbl>,
## #   numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## #   numeric.p100 <dbl>, numeric.hist <chr>, zscore_min <dbl>, zscore_max <dbl>,
## #   zscore_extreme <dbl>

We also observe that the variance of the affected predictors has gone done significantly. For example, the predictor M18 reduced its standard deviation from 423 to 44 (nearly 10 times smaller).

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
M18	0	1	4836.71	44.14	4701	4813.75	4834	4862	4971	▁▃▇▃▁

For the test dataset, 5 predictors have the same issue. So we apply the same changes.

sk_test_bv = skim(chemX_test_bv)

sk_test_bv %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme)) %>% yank("numeric")

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist	zscore_min	zscore_max	zscore_extreme
M39	1	6.86	1.52	7.10	7.20	7.30	7.4	▁▁▁▁▇	-4.52	0.35	4.52
M42	1	10.98	2.44	11.38	11.50	11.62	12.0	▁▁▁▁▇	-4.51	0.42	4.51
M44	1	1.76	0.40	1.80	1.80	1.90	2.1	▁▁▁▁▇	-4.42	0.85	4.42
M45	1	2.12	0.50	2.10	2.20	2.30	2.5	▁▁▁▁▇	-4.28	0.77	4.28
M01	1	10.86	2.63	10.67	11.35	12.00	14.1	▁▁▁▆▇	-4.13	1.23	4.13

We apply the same treatment of zero values to the test dataset.

## NULL

To summarize, we identified 16 predictors in the training set and 5 predictors in the test set in outlier zeros. By replacing these values with the median of their respective datasets, we reduced the variance of these predictors significantly. We chose not to apply the same outlier treatment to maximum values because we cannot be certain they are data errors.

Missing Data Analysis

As we will show, the amount of missing data is small and its occurrence follows the same pattern in the test and training sets. We will do exploratory data analysis and data imputation on the training and test partitions separately.

Looking column-wise in the test data, 14 predictors have missing data. The 3 most impacted predictors are M03 which is 86.4% complete, M11 with 93.2% complete and M10 with 95.5% complete. Only manufacturing process predictors have missing data so this suggest in future industrial applications to fix this issues at the factory. The skim output below displays the variables

skim(chemX_test_nz) %>% filter( n_missing > 0 ) %>% arrange( desc( n_missing) ) %>% yank("numeric")

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
M03	6	0.86	1.54	0.03	1.48	1.52	1.54	1.55	1.60	▁▂▇▁▁
M11	3	0.93	9.41	0.73	7.70	9.00	9.40	9.90	11.30	▂▇▇▅▁
M10	2	0.95	9.21	0.81	7.50	8.83	9.20	9.60	11.40	▃▇▇▅▁
M25	2	0.95	4863.07	36.61	4785.00	4844.75	4866.50	4883.75	4966.00	▂▅▇▂▁
M26	2	0.95	6059.26	40.39	5973.00	6038.00	6052.50	6073.75	6137.00	▁▃▇▁▃
M27	2	0.95	4599.21	43.36	4518.00	4575.00	4594.00	4621.50	4710.00	▃▇▆▂▂
M28	2	0.95	7.68	4.93	0.00	0.00	10.50	10.95	11.40	▃▁▁▁▇
M29	2	0.95	20.26	0.68	19.30	19.70	20.05	20.70	21.70	▇▇▅▁▅
M30	2	0.95	9.19	0.73	7.60	8.72	9.15	9.80	10.90	▂▇▆▇▁
M31	2	0.95	70.53	1.17	67.90	69.70	70.70	71.40	72.30	▂▂▃▇▅
M33	2	0.95	63.40	2.24	58.00	62.25	63.50	64.75	68.00	▂▂▇▃▁
M34	2	0.95	2.51	0.06	2.40	2.50	2.50	2.50	2.60	▂▁▇▁▂
M35	2	0.95	496.12	11.05	463.00	490.00	497.00	504.75	517.00	▁▃▇▇▅
M36	2	0.95	0.02	0.00	0.02	0.02	0.02	0.02	0.02	▁▇▁▇▃

Next, we look row-wise at which observations have the most missing values in its predictors columns and display the results in order. A total of 9 observations have missing data points across the 14 predictors.

chemX_test_nz %>%                           # The dataframe
  filter_all( any_vars(is.na(.))) %>%    # include any rows where any column is NA
  select_if( ~ any(is.na(.))) -> missing_test_nz  
       # only include those columns where some row has NA in that column

missing_test_nz %>% kable(caption = "missing test data - rows") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

missing test data - rows

	M03	M10	M11	M25	M26	M27	M28	M29	M30	M31	M33	M34	M35	M36
2	NA	NA	NA	4869	6107	4630	11.2	21.4	9.9	68.7	66	2.6	508	0.019
3	NA	NA	NA	4897	6116	4637	11.1	21.3	9.4	69.3	66	2.6	509	0.018
15	NA	9.4	10.2	4874	6125	4659	11.3	21.6	9.9	68.5	64	2.5	490	0.019
17	NA	9.0	10.0	4860	6100	4659	11.1	21.3	10.0	68.7	68	2.5	495	0.019
18	NA	9.5	10.6	4815	6055	4631	10.8	20.9	10.6	69.0	64	2.5	497	0.020
19	NA	8.9	9.8	4876	6109	4696	11.1	21.4	9.8	68.8	64	2.4	514	0.021
98	1.55	9.1	NA	4869	6050	4574	10.5	19.9	8.8	71.3	64	2.5	493	0.019
173	1.56	9.5	9.9	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
176	1.55	9.9	10.3	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

missing_test_nz %>%
  mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_test_nz

num_na_test_cells = sum(missing_summary_test_nz$num_na)

missing_summary_test_nz %>%  # add a count of columns with NA values
  select(num_na ) %>%        # pull out the count
  arrange( desc(num_na)) %>% # sort by the count descending
  bind_rows(., tibble( num_na = num_na_test_cells ) ) %>%
  kable(caption = "Test Data with missing predictor values") %>%
  kable_styling(bootstrap_options = c("hover", "striped"),
                position = "left"
                )

Test Data with missing predictor values

	num_na
173	11
176	11
2	3
3	3
15	1
17	1
18	1
19	1
98	1
…10	33

pct_missing_test = num_na_test_cells/ ( nrow(chemX_test_nz) * ncol(chemX_test_nz)) * 100

To tally up the NA values across all cells in the test data, we found 33 cells of missing data. This equals 1.3392857 percent of the test data. Looking at the table above, of the 9 observations with missing values, some observations (173, 176) have 11 predictors with missing values. The rest of the observations are much less affected.

In the training data, 28 manufacturing predictors have missing data. All biological predictors have complete data. The 3 most impacted predictors are M03, M10, M11 with 93.2%, 94.7%, 94.7% completeness rates. This is consistent with the pattern in the test data. The most common completeness rate is 99.2% (1 missing value).

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
M03	9	0.93	1.54	0.02	1.47	1.54	1.54	1.55	1.60	▁▂▆▇▁
M10	7	0.95	9.17	0.76	7.70	8.70	9.00	9.50	11.60	▃▇▆▁▁
M11	7	0.95	9.38	0.72	7.50	9.00	9.40	9.90	11.50	▂▆▇▅▁
M25	4	0.97	4854.45	40.40	4721.00	4829.00	4852.50	4876.00	4990.00	▁▃▇▃▁
M26	4	0.97	6048.27	45.79	5901.00	6018.75	6041.50	6069.50	6161.00	▁▂▇▃▂
M27	4	0.97	4586.11	46.99	4416.00	4556.75	4584.50	4607.25	4696.00	▁▁▇▇▂
M29	4	0.97	20.09	0.61	18.90	19.70	19.90	20.40	22.00	▂▇▂▁▁
M30	4	0.97	9.22	0.66	7.50	8.80	9.15	9.70	11.20	▁▅▇▃▁
M31	4	0.97	70.62	1.36	59.80	70.18	70.80	71.40	72.50	▁▁▁▂▇
M02	3	0.98	16.89	8.35	0.00	19.50	21.10	21.50	22.50	▂▁▁▁▇
M28	3	0.98	6.24	5.32	0.00	0.00	10.30	10.70	11.50	▆▁▁▁▇
M33	3	0.98	63.59	2.56	56.00	62.00	64.00	65.00	70.00	▁▅▇▆▂
M34	3	0.98	2.49	0.05	2.30	2.50	2.50	2.50	2.60	▁▂▁▇▁
M35	3	0.98	495.43	10.78	468.00	490.00	495.00	501.00	522.00	▁▂▇▃▁
M36	3	0.98	0.02	0.00	0.02	0.02	0.02	0.02	0.02	▂▇▇▁▂
M01	2	0.98	11.41	1.06	9.00	10.80	11.40	12.20	14.10	▃▅▇▆▁
M06	2	0.98	207.52	2.91	203.00	205.70	206.80	208.70	227.40	▇▅▁▁▁
M04	1	0.99	932.34	6.14	911.00	928.00	934.00	937.00	946.00	▁▂▃▇▁
M05	1	0.99	999.84	26.93	923.00	987.50	999.20	1008.10	1175.30	▁▇▁▁▁
M07	1	0.99	177.50	0.50	177.00	177.00	177.00	178.00	178.00	▇▁▁▁▇
M08	1	0.99	177.53	0.50	177.00	177.00	178.00	178.00	178.00	▇▁▁▁▇
M12	1	0.99	868.13	1794.45	0.00	0.00	0.00	0.00	4549.00	▇▁▁▁▂
M14	1	0.99	4852.15	52.70	4713.00	4826.50	4858.00	4881.00	4992.00	▁▂▇▃▁
M22	1	0.99	5.53	3.34	0.00	3.00	5.00	8.00	12.00	▇▆▇▅▅
M23	1	0.99	3.01	1.61	0.00	2.00	3.00	4.00	6.00	▇▆▇▆▇
M24	1	0.99	8.99	5.76	0.00	4.00	8.00	14.00	23.00	▆▇▃▅▁
M40	1	0.99	0.02	0.04	0.00	0.00	0.00	0.00	0.10	▇▁▁▁▂
M41	1	0.99	0.02	0.05	0.00	0.00	0.00	0.00	0.20	▇▁▁▁▁

chemX_train_nz %>%                           # The dataframe
  filter_all( any_vars(is.na(.))) %>%    # include any rows where any column is NA
  select_if( ~ any(is.na(.))) -> missing_train_nz     # only include those columns where some row has NA in that column

missing_train_nz %>% kable(caption = "missing training data - rows") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

missing training data - rows

	M01	M02	M03	M04	M05	M06	M07	M08	M10	M11	M12	M14	M22	M23	M24	M25	M26	M27	M28	M29	M30	M31	M33	M34	M35	M36	M40	M41
1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	4898	NA	NA	NA	4873	6074	4685	10.7	21.0	9.9	69.1	66	2.4	486	0.019	NA	NA
4	NA	0.0	NA	911	1014.6	213.3	177	177	NA	NA	0	4897	5	2	5	4892	6111	4630	11.1	21.3	9.4	69.3	68	2.5	496	0.018	0.0	0.0
5	10.7	0.0	NA	918	1027.5	205.7	178	178	NA	NA	0	4992	8	4	18	4930	6151	4684	11.3	21.6	9.0	69.4	70	2.5	468	0.017	0.0	0.0
6	12.0	0.0	NA	924	1016.8	208.9	178	178	NA	NA	0	4985	9	1	1	4871	6128	4687	11.4	21.7	10.1	68.2	70	2.5	490	0.018	0.0	0.0
16	11.1	0.0	NA	928	1073.6	207.1	177	177	9.6	10.2	0	4805	12	4	16	4848	6095	4630	11.1	21.2	10.3	68.5	66	2.5	493	0.019	0.1	0.2
20	12.7	0.0	NA	929	1175.3	204.1	177	177	8.9	9.9	0	4961	4	3	4	4850	6094	4665	11.0	21.2	10.1	68.7	62	2.5	490	0.020	0.0	0.0
22	10.9	0.0	NA	936	1024.4	218.7	178	178	NA	NA	0	4791	6	2	16	4918	6152	4618	11.4	21.1	8.8	70.1	68	2.4	479	0.018	0.0	0.0
23	11.1	0.0	NA	937	997.7	209.6	177	177	NA	NA	0	NA	7	3	17	4906	6134	4626	11.2	21.0	8.9	70.1	66	2.4	492	0.019	0.0	0.0
24	11.3	0.0	NA	940	1031.3	215.1	177	177	NA	NA	0	4864	1	1	1	4875	6095	4606	11.2	20.9	9.5	69.7	65	2.5	468	0.018	0.0	0.0
90	10.8	21.5	1.55	935	954.8	NA	178	178	8.4	8.3	0	4866	8	3	3	4897	6047	4572	10.3	19.7	8.1	72.2	64	2.5	493	0.020	0.0	0.0
108	9.8	21.9	1.50	934	987.9	208.9	178	178	9.4	9.3	0	4807	11	5	5	NA	NA	NA	0.0	NA	NA	NA	62	2.5	485	0.019	0.0	0.0
134	10.9	NA	1.55	934	1000.1	208.7	177	178	8.7	9.4	0	4875	2	2	13	4816	6037	4568	0.0	19.8	9.6	70.6	62	2.5	495	0.020	0.0	0.0
139	10.0	NA	1.55	935	986.1	205.2	177	178	9.5	9.5	0	4823	8	2	2	4822	6018	4606	0.0	19.8	9.6	70.6	60	2.6	497	0.020	0.0	0.0
172	12.8	21.5	1.54	935	1027.0	206.2	178	177	9.6	9.6	0	4829	2	2	8	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0.0	0.0
174	13.0	20.4	1.55	930	1040.0	208.7	178	177	10.1	10.4	0	4795	0	0	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0.0	0.0
175	14.1	21.6	1.55	935	1044.8	208.0	177	177	10.2	10.3	0	4793	0	0	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0.0	0.0

missing_train_nz %>%
  mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_train_nz

num_na_cells_train = sum(missing_summary_train_nz$num_na)

missing_summary_train_nz %>%  # add a count of columns with NA values
  select(num_na ) %>%        # pull out the count
  arrange( desc(num_na)) %>% # sort by the count descending
  bind_rows(., tibble( num_na = num_na_cells_train ) ) %>%
  kable(caption = "Training Data with missing predictor values") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

Training Data with missing predictor values

	num_na
1	16
172	11
174	11
175	11
108	6
4	4
23	4
5	3
6	3
22	3
24	3
16	1
20	1
90	1
134	1
139	1
…17	80

pct_missing_train = num_na_cells_train/ ( nrow(chemX_train_nz) * ncol(chemX_train_nz)) * 100

We conclude there are a small and similar percentage of missing values in the train and test data sets.

In the training set, there are 1.0822511% of missing cells.
In the test data, there are 1.3392857 % of missing cells.

Due to the small percentage of missing data and after treatment of zero values, we expect kNN imputation to work well.

KNN Imputation

We will use knn imputation to replace those values by using caret preProcess functionality.

# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))

# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc,  chemX_train_nz )

# Becomes the final version of test data for prediction
chemX_test_pp  = predict( preProcFunc,  chemX_test_nz )

Next, we visualize the histogram and density of all the scaled imputed variables constructed above using caret. The plot below shows that imputation and removal of zeros reveals no predictors with odd densities or extremely long tails. Either of these indicates problems with outliers.

Next, we compare the density plots of each of the in-scope predictors between the training and test datasets. We conclude that both data sets are similar. Most importantly, the test data was not used to adjust or impute the training data.

PLS Model Tuning

We use a partial least squares model (PLS) for prediction of the Yield response. We use standard training arguments shown below to fit the model. We use repeated cross validation with 9 repeats of a 10-fold cross validation process on the training data. Lastly, we use Rsquared as the performance metric and the oneSE rule to choose the number of latent variables (a.k.a. components) to use in prediction.

repeats = 9

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "oneSE")

set.seed(102030)

plsTune <- train(x = chemX_train_pp, y = chemY_train,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale") ,
                 metric = "Rsquared"
                 )
plsTune

## Partial Least Squares 
## 
## 132 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold, repeated 9 times) 
## Summary of sample sizes: 119, 118, 120, 120, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.351048  0.4731740  1.1140210
##    2     1.260863  0.5272981  1.0203550
##    3     1.203779  0.5660351  0.9880331
##    4     1.226812  0.5616346  0.9927586
##    5     1.233899  0.5598473  1.0068803
##    6     1.239375  0.5624817  1.0190834
##    7     1.238962  0.5666788  1.0233615
##    8     1.244006  0.5677683  1.0267789
##    9     1.280803  0.5562125  1.0546023
##   10     1.304934  0.5486787  1.0802050
##   11     1.330050  0.5392558  1.0934149
##   12     1.364049  0.5257729  1.1144246
##   13     1.384194  0.5232286  1.1247945
##   14     1.401976  0.5203377  1.1320915
##   15     1.414448  0.5147468  1.1385261
##   16     1.420774  0.5091043  1.1428049
##   17     1.431185  0.5058737  1.1497164
##   18     1.436509  0.5066840  1.1504161
##   19     1.441805  0.5075062  1.1480669
##   20     1.454394  0.5039491  1.1523342
## 
## Rsquared was used to select the optimal model using  the one SE rule.
## The final value used for the model was ncomp = 3.

The plots above show the RMSE and \(R^2\) profile of the model plotted against the number of latent variables. We see that the RMSE plot gives an absolute minimum at ncomp equal to 3. The \(R^2\) plot gives a local maximum at ncomp equal to 3. Thus, our visual study of the RMSE, \(R^2\) is consistent with the oneSE rule.

There, we decide to use a PLS model with 3 latent variables.

Response 6.3d

Now we evaluate the model performance on the test data set.

pred_chemY = predict(plsTune, newdata = chemX_test_pp)

(testRMSE_chem = caret::RMSE(pred_chemY, chemY_test ))

## [1] 1.22817

(testR2_chem   = caret::R2(  pred_chemY, chemY_test) )

## [1] 0.5872227

The test set RMSE of the model is 1.2281695 and the test set \(R^2\) is 0.5872227. We plot the diagnostic plots below.

The PLS model residuals on the test data generally look close to the resample training statistics. There are no strange or noticeable patterns in the model residuals plot and the output to predicted plot. There is a slight possibility that the model residuals are growing in volatility as the predicted value increases. However this could be a small sample issue.

Based on the comparison table below, we conclude that the test statistics are consistent with training statistics.

All Datasets using PLS Tuned Model

dataset	RMSE	RSquared	NumObservations
Training Data	1.20	0.57	132
Test Data	1.23	0.59	44

Response 6.3e

We consider the variance importance below of the top 10 variables.

We see that M32, M13 and M36 are the top three variables. Manufacturing predictors dominate the list. The first biological predictor B02 enters the variable importance scores at rank number six. Therefore,

Response 6.3f

The featurePlot above shows the sensitivity of yield to the impact of each of the top 4 predictors. The green line shows the loess curve fitted to the data. Both on the slope of each fitted line we conclude that:

As M32 , M09 increase, the yield will increase.
As M13 , M36 increase, the yield will decrease.

We should advise based on the model that yield can be controlled through changes to the manufacturing process based on these parameter estimates.

Code

We summarize all the R code used in this project in this appendix for ease of reading.

knitr::opts_chunk$set(echo = FALSE)

library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(caret)
library(AppliedPredictiveModeling)
library(pls)
library(elasticnet)
library(glmnet)
.code-bg {
  background-color: lightgray;
}

runProblem1 = TRUE
runModels1  = TRUE
runProblem2 = TRUE

> library(AppliedPredictiveModeling)
> data(permeability)
library(AppliedPredictiveModeling)
data("permeability")
print(paste0( "Dimensions of fingerprints: rows: ", nrow(fingerprints), " columns: ", ncol(fingerprints) ) )
print(paste0( "Dimensions of permability: rows: ", nrow(permeability), " columns: ", ncol(permeability) ) )

sk_fingerprints = skim(fingerprints)

nzv_vars = nearZeroVar(fingerprints, names = TRUE) 
# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
fingerprints_bv = fingerprints[, -nearZeroVar(fingerprints) ]

sk_fingerprints %>% filter( skim_variable %in% nzv_vars[1:5] )

print(paste0( "Dimensions of fingerprint predictors retained: rows: ", nrow(fingerprints_bv), " columns: ", ncol(fingerprints_bv) ) )
predictors_removed = ncol(fingerprints) - ncol(fingerprints_bv)
as_tibble(permeability) %>% ggplot(aes(x=permeability)) + geom_histogram() + ggtitle("Distribution of Permeability on entire dataset")

library(fpp3)
perm_ts = as_tibble(permeability) %>% mutate( id = row_number()) %>% as_tsibble(index = id )

(lambda <- perm_ts %>% features( permeability , features = guerrero) %>% pull(lambda_guerrero) )

perm_ts = perm_ts %>% mutate( bc_perm = box_cox(permeability, lambda), log_perm = log(permeability) )

perm_ts %>% ggplot() + 
  geom_density(aes(log_perm), alpha = 0.5 , fill = "green") +
  geom_density(aes(bc_perm), alpha = 0.5, fill = "blue")

# Box-Cox values of permeability
bcperm = perm_ts$bc_perm

#  When using caret::createDataPartition with a continuous variable, 
#  the default behavior to take a random sample
#  at a default of 5 groups defined by quintiles of the variable:  permeability 
#
set.seed(10023)
train_indices = createDataPartition(bcperm , times = 1, list = FALSE, p = 0.8 )

summary( bcperm[train_indices])
length(bcperm[train_indices])

length(bcperm[-train_indices])
summary( bcperm[-train_indices])
# perm_df defines the permeability dataframe joined to the reduced set of "big-variance" predictors
perm_df = data.frame( cbind(bcperm, fingerprints_bv ) )

perm_train = perm_df[train_indices,]
perm_trainY = perm_train[,1]
perm_trainX = perm_train[,2:ncol(perm_train)]

perm_test  = perm_df[ -train_indices,]
perm_testY = perm_test[,1]
perm_testX = perm_test[,2:ncol(perm_test)]

repeats = 9
#repeats = 1


ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "oneSE")

set.seed(102030)

plsTune <- train(x = perm_trainX, y = perm_trainY,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale")
                 )
plsTune

plsResamples <- plsTune$results
plsResamples$Model <- "PLS"

plsPlotData <- plsResamples

xyplot(RMSE ~ ncomp,
       data = plsPlotData,
       #aspect = 1,
       xlab = "# Components",
       ylab = "RMSE (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o") )


xyplot(Rsquared ~ ncomp,
       data = plsPlotData,
       #aspect = 1,
       xlab = "# Components",
       ylab = "R2 (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o"))


pred_perm = predict(plsTune, newdata = perm_testX)

(testRMSE = caret::RMSE(pred_perm, perm_testY ))
(testR2 = caret::R2(pred_perm, perm_testY) )

xyplot( perm_testY ~ pred_perm, type = c("p", "g"), 
        main = "PLS Model Prediction vs. Observation of Test Data",
        xlab = "Predicted", 
        ylab = "Observed", panel = function(...){ 
  panel.abline(a = 0, b = 1 , lty = 2)
  panel.xyplot(...)
  })


xyplot( perm_testY - pred_perm ~ pred_perm, type = c("p", "g"), 
        main = "PLS Model Prediction vs. Residual of Test Data",
        xlab = "Predicted", ylab = "Residual", panel = function(...){ 
  panel.abline(a = 0, b = 0 , lty = 2)
  panel.xyplot(...)
  })


set.seed(102030)

pcrTune <- train(x = perm_trainX, y = perm_trainY,
                 method = "pcr",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale")
                 )
pcrTune


pcrResamples <- pcrTune$results
pcrResamples$Model <- "PCR"

jointPlotData <- rbind( pcrResamples, plsResamples )

xyplot(RMSE ~ ncomp,
       data = jointPlotData,
       #aspect = 1,
       main = "PCR and PLS RMSE Results on Test Data" ,
       xlab = "# Components",
       ylab = "RMSE (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o", "g"))



xyplot(Rsquared ~ ncomp,
       data = jointPlotData,
      main = "PCR and PLS R2 Results on Test Data" ,
       #aspect = 1,
       xlab = "# Components",
       ylab = "R2 (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o"))


pred_perm_pcr = predict(pcrTune, newdata = perm_testX)

(testRMSE_pcr = caret::RMSE(pred_perm_pcr, perm_testY ))
(testR2_pcr = caret::R2(pred_perm_pcr, perm_testY) )

xyplot( perm_testY ~ pred_perm_pcr, type = c("p", "g"), 
        xlab = "Predicted", ylab = "Observed", 
        main = "Observed vs. Predicted Values for PCR Model" ,
        panel = function(...){ 
          panel.abline(a = 0, b = 1 , lty = 2)
          panel.xyplot(...)
  })
xyplot( perm_testY - pred_perm_pcr ~ pred_perm, 
        type = c("p", "g"), 
        xlab = "Predicted", 
        ylab = "Residual", 
        main = "Residual vs. Prediction for PCR" ,
        panel = function(...){ 
  panel.abline(a = 0, b = 0 , lty = 1 )
  panel.xyplot(...)
  })



set.seed(102030)

ctrl <- trainControl(method = "cv",  selectionFunction = "best")

enetGrid <- expand.grid(lambda = c(0, 0.05, .1 , 0.5,  0.8 ),    
                        fraction = seq(.01, 1, length = 10))

enetTune <- train(x = perm_trainX, y = perm_trainY ,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl ,
                  preProcess = c("center", "scale")
                  )

enetTune

pred_enet = predict(enetTune, newdata = perm_testX)

(testRMSE_enet = caret::RMSE(pred_enet, perm_testY ))
(testR2_enet = caret::R2(pred_enet, perm_testY) )

xyplot( perm_testY ~ pred_enet, type = c("p", "g"), 
        xlab = "Predicted", ylab = "Observed", 
        main = "Observed vs. Predicted Values for Elastic Net Model" ,
        panel = function(...){ 
          panel.abline(a = 0, b = 1 , lty = 2)
          panel.xyplot(...)
  })
xyplot( perm_testY - pred_enet ~ pred_enet, 
        type = c("p", "g"), 
        xlab = "Predicted", 
        ylab = "Residual", 
        main = "Residual vs. Prediction for Elastic Net" ,
        panel = function(...){ 
  panel.abline(a = 0, b = 0 , lty = 1 )
  panel.xyplot(...)
  })


coefs = predict.enet(enetTune$finalModel, s=enetTune$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients 

nonzero_coefs = coefs[ abs(coefs) > 0.00001 ]
var_names = names(nonzero_coefs)
rownames(nonzero_coefs) <- NULL

enet_coefs = data.frame( names = var_names, coefs = nonzero_coefs )

dim(enet_coefs)

enet_coefs %>% ggplot(aes(x = coefs, y = names)) + geom_col() + ggtitle("Elastic Net Model Standardized Coefficients")

enet_coefs %>% arrange( desc(coefs)) %>%  
  kable(caption = "Elastic Net Non-zero Coefficients", digits = 3 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped" ), position = "left" )


test_results = data.frame(model = c("Partial Least Squares", "Principal Components Regression", "Elastic Net") ,
                          RMSE  = c( testRMSE, testRMSE_pcr, testRMSE_enet ) ,
                          RSquared = c( testR2, testR2_pcr, testR2_enet ) ,
                          NumVars = c(  9 ,  11, 45 )
                          )

test_results %>% kable(caption = "All Models on Test DataSet", digits = 2 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

data("ChemicalManufacturingProcess")

chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]
dim(chemX)
dim(chemY)

chemX %>% rename_at(vars(starts_with("ManufacturingProcess") ), 
                    funs(str_replace(., "ManufacturingProcess", "M"))) %>% 
  rename_at(vars(starts_with("Biological")), 
            funs( str_replace( ., "BiologicalMaterial", "B"))) -> chemX2

skim(chemY)

chemY %>% ggplot(aes(x=Yield)) + geom_histogram(bins = 20, color= "white", fill = "red") + 
  labs(title = "Yield distribution")

set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )
chemX_test   = chemX2[-train_indices , ]
chemX_train  = chemX2[ train_indices , ]
chemY_test   = chemY[-train_indices , ]
chemY_train  = chemY[ train_indices , ]


paste0("Training data removed variables:", (nzv_chemtrain = nearZeroVar(chemX_train, names = TRUE) ) )

# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_train_bv = chemX_train[, -nearZeroVar(chemX_train) ]
paste0("Dimension of train data before and after near zero variance check: ")
dim(chemX_train)
dim(chemX_train_bv)


paste0("Test data removed variables:", (nzv_chemtest = nearZeroVar(chemX_test, names = TRUE) ) )

# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_test_bv = chemX_test[, -nearZeroVar(chemX_test) ]
paste0("Dimension of test data before and after near zero variance check: ")
dim(chemX_test)
dim(chemX_test_bv)



sk_train_bv = skim(chemX_train_bv)

sk_train_bv %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme)) %>% yank("numeric")

# Replace the outlier zeros with the median value for the variables below.

chemX_train_bv %>% mutate( M18 = ifelse(M18==0, median(M18), M18 ) ,
                           M20 = ifelse(M20==0, median(M20), M20 ) ,
                           M16 = ifelse(M16==0, median(M16), M16 ) ,
                           M26 = ifelse(M26==0, median(M26), M26 ) ,
                           M25 = ifelse(M25==0, median(M25), M25 ) ,
                           M27 = ifelse(M27==0, median(M27), M27 ) ,
                           M31 = ifelse(M31==0, median(M31), M31 ) ,
                           M29 = ifelse(M29==0, median(M29), M29 ) ,
                           M43 = ifelse(M43==0, median(M43), M43 ) ,
                           M30 = ifelse(M30==0, median(M30), M30 ) ,
                           M01 = ifelse(M01==0, median(M01), M01 ) ,
                           M42 = ifelse(M42==0, median(M42), M42 ) ,
                           M44 = ifelse(M44==0, median(M44), M44 ) ,
                           M45 = ifelse(M45==0, median(M45), M45 ) ,
                           M39 = ifelse(M39==0, median(M39), M39 ) ,                           
                           M38 = ifelse(M38==0, median(M38), M38 ) ) -> chemX_train_nz 
                           

sk_train_nz = skim(chemX_train_nz)

sk_train_nz %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme))

sk_train_nz %>% yank("numeric") %>% filter(skim_variable== 'M18')


sk_test_bv = skim(chemX_test_bv)

sk_test_bv %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme)) %>% yank("numeric")

# Replace the outlier zeros with the median value for the variables below.

chemX_test_bv %>% mutate( M39 = ifelse(M39==0, median(M39), M39 ) ,
                          M42 = ifelse(M42==0, median(M42), M42 ) ,
                          M44 = ifelse(M44==0, median(M44), M44 ) ,
                          M45 = ifelse(M45==0, median(M45), M45 ) ,
                          M01 = ifelse(M01==0, median(M01), M01 ) ) -> chemX_test_nz 
                           

sk_test_nz = skim(chemX_test_nz)

sk_test_nz %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme)) %>% yank("numeric")

skim(chemX_test_nz) %>% filter( n_missing > 0 ) %>% arrange( desc( n_missing) ) %>% yank("numeric")
chemX_test_nz %>%                           # The dataframe
  filter_all( any_vars(is.na(.))) %>%    # include any rows where any column is NA
  select_if( ~ any(is.na(.))) -> missing_test_nz  
       # only include those columns where some row has NA in that column

missing_test_nz %>% kable(caption = "missing test data - rows") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")                                                                                             
missing_test_nz %>%
  mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_test_nz

num_na_test_cells = sum(missing_summary_test_nz$num_na)

missing_summary_test_nz %>%  # add a count of columns with NA values
  select(num_na ) %>%        # pull out the count
  arrange( desc(num_na)) %>% # sort by the count descending
  bind_rows(., tibble( num_na = num_na_test_cells ) ) %>%
  kable(caption = "Test Data with missing predictor values") %>%
  kable_styling(bootstrap_options = c("hover", "striped"),
                position = "left"
                )

pct_missing_test = num_na_test_cells/ ( nrow(chemX_test_nz) * ncol(chemX_test_nz)) * 100
skim(chemX_train_nz) %>% filter( n_missing > 0 ) %>% arrange( desc( n_missing) ) %>% yank("numeric")
chemX_train_nz %>%                           # The dataframe
  filter_all( any_vars(is.na(.))) %>%    # include any rows where any column is NA
  select_if( ~ any(is.na(.))) -> missing_train_nz     # only include those columns where some row has NA in that column

missing_train_nz %>% kable(caption = "missing training data - rows") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
missing_train_nz %>%
  mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_train_nz

num_na_cells_train = sum(missing_summary_train_nz$num_na)

missing_summary_train_nz %>%  # add a count of columns with NA values
  select(num_na ) %>%        # pull out the count
  arrange( desc(num_na)) %>% # sort by the count descending
  bind_rows(., tibble( num_na = num_na_cells_train ) ) %>%
  kable(caption = "Training Data with missing predictor values") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")


pct_missing_train = num_na_cells_train/ ( nrow(chemX_train_nz) * ncol(chemX_train_nz)) * 100

# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))

# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc,  chemX_train_nz )

# Becomes the final version of test data for prediction
chemX_test_pp  = predict( preProcFunc,  chemX_test_nz )


chemX_train_pp %>%
  mutate(id = row_number()) %>% 
  pivot_longer( !id, names_to = "X", values_to = "Val" ) -> piv_chemX_train
plot1 = piv_chemX_train %>% filter(!is.na(Val)) %>% ggplot(aes(x=Val)) + 
  facet_wrap( ~X ,  scales = "free" ,  ncol = 4) + 
  geom_histogram(aes(y = ..density..), binwidth = .25,  fill = "red", alpha = 0.5) +
  geom_density( fill="blue", alpha = 0.3) +
  labs(title = "Training Data: Histogram/Density of Imputed Standardized Predictors")

plot1

chemX_test_pp %>% 
  mutate(id = row_number()) %>% 
  pivot_longer( !id, names_to = "X", values_to = "Val" ) -> piv_chemX_test
plot2 = piv_chemX_test %>% filter(!is.na(Val)) %>% ggplot(aes(x=Val)) + 
  facet_wrap( ~X ,  scales = "free" ,  ncol = 4) + 
  geom_histogram(aes(y = ..density..), binwidth = .25,  fill = "red", alpha = 0.5) +
  geom_density( fill="blue", alpha = 0.3) +
  labs(title = "Test Data: Histogram and Density of Standardized Predictors")

plot2

piv_chemX = rbind(piv_chemX_test %>% mutate(Set = "Test"), 
                  piv_chemX_train %>% mutate(Set = "Train") )


plot3 = piv_chemX %>% ggplot(aes(x=Val, fill = Set )) + 
  facet_wrap( ~X ,  scales = "free" ,  ncol = 4) + 
  geom_density( alpha = 0.5) +
  labs(title = "Data: Density of Imputed Training and Test Standardized Predictors")

plot3
repeats = 9

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "oneSE")

set.seed(102030)

plsTune <- train(x = chemX_train_pp, y = chemY_train,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale") ,
                 metric = "Rsquared"
                 )
plsTune

plsResamples <- plsTune$results
plsResamples$Model <- "PLS"

plsPlotData <- plsResamples

xyplot(RMSE ~ ncomp,
       data = plsPlotData,
       #aspect = 1,
       xlab = "# Components",
       ylab = "RMSE (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       main = "Chemical Manufacturing Process" ,
       type = c("o") )


xyplot(Rsquared ~ ncomp,
       data = plsPlotData,
       #aspect = 1,
       xlab = "# Components",
       ylab = "R2 (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       main = "Chemical Manufacturing Process",
       type = c("o"))


pred_chemY = predict(plsTune, newdata = chemX_test_pp)

(testRMSE_chem = caret::RMSE(pred_chemY, chemY_test ))
(testR2_chem   = caret::R2(  pred_chemY, chemY_test) )

xyplot( chemY_test ~ pred_chemY, type = c("p", "g"), 
        main = "PLS Model Prediction vs. Observation of Test Data",
        xlab = "Predicted", 
        ylab = "Observed", panel = function(...){ 
  panel.abline(a = 0, b = 1 , lty = 2)
  panel.xyplot(...)
  })


xyplot( chemY_test - pred_chemY ~ pred_chemY, type = c("p", "g"), 
        main = "PLS Model Prediction vs. Residual of Test Data",
        xlab = "Predicted", ylab = "Residual", panel = function(...){ 
  panel.abline(a = 0, b = 0 , lty = 2)
  panel.xyplot(...)
  })



test_results = data.frame(dataset = c("Training Data", "Test Data") ,
                          RMSE  = c(  plsTune$results$RMSE[ plsTune$bestTune$ncomp] , testRMSE_chem ) ,
                          RSquared = c( plsTune$results$Rsquared[ plsTune$bestTune$ncomp],  testR2_chem  ) ,
                          NumObservations = c( nrow( chemX_train_pp) , nrow( chemX_test_pp ) )
                          )

test_results %>% kable(caption = "All Datasets using PLS Tuned Model", digits = 2 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

importanceScores =  varImp(plsTune, scale = FALSE)


plot(importanceScores , top=10 )

vars_to_plot = c("M32", "M13", "M36", "M09")

theme1 <- trellis.par.get()
theme1$plot.symbol$col = rgb(.2, .2, .2, .4)
theme1$plot.symbol$pch = 16
theme1$plot.line$col = "green"
theme1$plot.line$lwd = 2
trellis.par.set(theme1)

caret::featurePlot(x = chemX_train_pp[, vars_to_plot], 
                   y = chemY_train , plot = "scatter" ,
                   type = c("p", "smooth") ,
                   span = 0.5, 
                   layout = c(4,1))

Data 624 HW 7 (Week 10) Linear Regression

Alexander Ng

Due 11/7/2021

Overview

Exercise 6.2

6.2a Response

6.2b Response

6.2c Response

6.2d Response

6.2e Response

6.2f Response

Exercise 6.3

Response 6.3a

Response 6.3b and 6.3c Combined

Splitting the data

Near Zero Variance

Outliers and Zeroes

Missing Data Analysis

KNN Imputation

PLS Model Tuning

Response 6.3d

Response 6.3e

Response 6.3f

Code