Data 624 HW 7 (Week 10) Linear Regression

Alexander Ng

Due 11/7/2021

Overview

Exercises from Kuhn and Johnson Applied Predictive Modeling, Chapter 6 Linear Regression and Its Cousins. All R code is displayed at the end for clarity.

Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
> library(AppliedPredictiveModeling)
> data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

  2. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\) ?

  3. Predict the response for the test set. What is the test set estimate of \(R^2\)?

  4. Try building other models discussed in this chapter. Do any have better predictive performance?

  5. Would you recommend any of your models to replace the permeability laboratory experiment?

6.2a Response

We load the and permability dataset below and display the dimensions of its predictors and response.

library(AppliedPredictiveModeling)
data("permeability")
## [1] "Dimensions of fingerprints: rows: 165 columns: 1107"
## [1] "Dimensions of permability: rows: 165 columns: 1"

6.2b Response

We use nearZeroVar to identify predictors with very low variance or large imbalance between the highest frequency values. The default algorithm uses a 95/5 imbalance ratio between the most common and second most common value.

For a binary 0-1 variable \(X\), the 95/5 ratio is equivalent to

\[| E[X] - 0.5 | \geq 0.45 \] That is, near zero variance is equivalent to a mean close to 0 or close to 0.

For first 5 near zero variance predictors and their summary statistics are displayed.

Data summary
Name fingerprints
Number of rows 165
Number of columns 1107
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X7 0 1 1.00 0.00 1 1 1 1 1 ▁▁▇▁▁
X8 0 1 1.00 0.00 1 1 1 1 1 ▁▁▇▁▁
X9 0 1 0.02 0.13 0 0 0 0 1 ▇▁▁▁▁
X10 0 1 0.02 0.15 0 0 0 0 1 ▇▁▁▁▁
X13 0 1 0.96 0.20 0 1 1 1 1 ▁▁▁▁▇
## [1] "Dimensions of fingerprint predictors retained: rows: 165 columns: 388"

We conclude that 719 predictors were removed and 388 predictors are left for modeling.

6.2c Response

We look at the distribution of the response variable permeability below. Since it is skewed to the left, a Box-Cox transformation may help in linear type regression.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We investigate the guerrero algorithm from fabletools and compare the transformation to a log transform. As shown below, the guerrero transform with \(\lambda = -0.148\) produces a more symmetric density plot (in blue) than a log transform with \(\lambda = 0\) (in green). Therefore, we choose to use \(\lambda = -0.148\).

## [1] -0.1486984

To split the data into a test and training set, we use a 80%-20% split of training and test data. We use a stratified random sampling approach by forming quintiles on the response variable permeability. Within each quintile, we divide the observations into training and test partitions. Therefore, the summary statistics of the test and training data should look reasonably similar.

The test data will be used in 6.2d. The training data will be used in this section to choose a model and the number of PLS components.

#  When using caret::createDataPartition with a continuous variable, 
#  the default behavior to take a random sample
#  at a default of 5 groups defined by quintiles of the variable:  permeability 
#
set.seed(10023)
train_indices = createDataPartition(bcperm , times = 1, list = FALSE, p = 0.8 )

The summary statistics of the training data containing 133 observations are below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.4933  0.4243  1.4170  1.2086  2.2498  3.0250
## [1] 133

Which are similar to those of the test data containing 32:

## [1] 32
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6453  0.5139  1.4281  1.3740  2.2753  2.9718

We now tune the PLS model using 9 times repeated 10-fold cross validation.

# perm_df defines the permeability dataframe joined to the reduced set of "big-variance" predictors
perm_df = data.frame( cbind(bcperm, fingerprints_bv ) )

perm_train = perm_df[train_indices,]
perm_trainY = perm_train[,1]
perm_trainX = perm_train[,2:ncol(perm_train)]

perm_test  = perm_df[ -train_indices,]
perm_testY = perm_test[,1]
perm_testX = perm_test[,2:ncol(perm_test)]

The model specification and training parameters below are used to find the optimal PLS components.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "oneSE")

set.seed(102030)

plsTune <- train(x = perm_trainX, y = perm_trainY,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale")
                 )
plsTune
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 9 times) 
## Summary of sample sizes: 120, 119, 121, 120, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     1.1141205  0.3143435  0.8807533
##    2     1.0618425  0.3738377  0.8347404
##    3     1.0436316  0.4032896  0.7988877
##    4     1.0574966  0.4032326  0.8202256
##    5     1.0504322  0.4331503  0.8261032
##    6     1.0312813  0.4515938  0.8001931
##    7     1.0049849  0.4780387  0.7695146
##    8     0.9770128  0.5103656  0.7495774
##    9     0.9554692  0.5376859  0.7339096
##   10     0.9591361  0.5406371  0.7396557
##   11     0.9730300  0.5400550  0.7454174
##   12     0.9921338  0.5286890  0.7627678
##   13     1.0094432  0.5205338  0.7761608
##   14     1.0230539  0.5153133  0.7869745
##   15     1.0408789  0.5024099  0.8031697
##   16     1.0588834  0.4920740  0.8209528
##   17     1.0831165  0.4782526  0.8364812
##   18     1.1023218  0.4686852  0.8466800
##   19     1.1116160  0.4693260  0.8475908
##   20     1.1291208  0.4613194  0.8582763
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was ncomp = 8.

The model output above states that ncomp equal to 8 is the best answer according to the one standard error rule introduced by Leo Breiman. The rule is set with the selectionFunction parameter in the trainControl tuning setup prior to model tuning.

In the plots below, we show the RMSE and \(R^2\) from the above tuning exercise plotted against the different number of PLS components or latent variables.

The cross-validation \(R^2\) and RMSE agree that ncomp equal to 8 components is optimal. We agree this with finding.

The corresponding cross-validation \(R^2\) is 0.5103656

6.2d Response

Next, we compute \(R^2\) and \(RMSE\) of the test data below.

pred_perm = predict(plsTune, newdata = perm_testX)

(testRMSE = caret::RMSE(pred_perm, perm_testY ))
## [1] 0.9154151
(testR2 = caret::R2(pred_perm, perm_testY) )
## [1] 0.5405468

The test set RMSE of the model is 0.9154151 and the test set \(R^2\) is 0.5405468. We plot the diagnostic plots below.

6.2e Response

We will consider applying the same caret model fitting approach using PCR and Elastic Net models to evaluate performance. A table of the test data result at the end of this section shows that Elastic net is the top performing model.

Using the same training and test data split and cross validation counts, we observe that PCR requires more iterations. The PCA Regression model requires \(ncomp=11\) latent variables to achieve the desired level of accuracy where PLS regression requires only 8 components.

set.seed(102030)

pcrTune <- train(x = perm_trainX, y = perm_trainY,
                 method = "pcr",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale")
                 )
pcrTune
## Principal Component Analysis 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 9 times) 
## Summary of sample sizes: 120, 119, 121, 120, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared    MAE      
##    1     1.293126  0.07920901  1.0906939
##    2     1.296688  0.07994117  1.0925286
##    3     1.210043  0.20196183  1.0031270
##    4     1.145841  0.26716680  0.9388025
##    5     1.142884  0.27844073  0.9286222
##    6     1.150580  0.27240294  0.9330602
##    7     1.135464  0.28994583  0.9095592
##    8     1.090564  0.33981803  0.8674796
##    9     1.102164  0.32826835  0.8783589
##   10     1.078221  0.35024938  0.8574133
##   11     1.063008  0.37340710  0.8376561
##   12     1.044617  0.39567077  0.8173886
##   13     1.049565  0.38835493  0.8163060
##   14     1.060419  0.37781904  0.8176979
##   15     1.070093  0.36982634  0.8230622
##   16     1.078642  0.36160264  0.8282541
##   17     1.091929  0.34890402  0.8388649
##   18     1.093206  0.34732713  0.8443113
##   19     1.098423  0.34154747  0.8478375
##   20     1.108231  0.33413103  0.8520652
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was ncomp = 11.

We conclude that the PCR model prefers to use ncomp = 11 latent variables, two more than the PLS model. This means PLS is more parsimonious in addition to showing higher predictive accuracy in cross validation.

Looking at the plots above for both PLS and PCR models shows the PLS outperforms on cross validated RMSE accuracy in addition to forming a more parsimonious model.

pred_perm_pcr = predict(pcrTune, newdata = perm_testX)

(testRMSE_pcr = caret::RMSE(pred_perm_pcr, perm_testY ))
## [1] 0.9518364
(testR2_pcr = caret::R2(pred_perm_pcr, perm_testY) )
## [1] 0.2733075

The test set RMSE of the model is 0.9518364 and the test set \(R^2\) is 0.2733075. We plot the diagnostic plots below. Both look reasonable.

Lastly, we consider the elastic net model using the elasticnet package. We make some alterations in the training procedure:

set.seed(102030)

ctrl <- trainControl(method = "cv",  selectionFunction = "best")

enetGrid <- expand.grid(lambda = c(0, 0.05, .1 , 0.5,  0.8 ),    
                        fraction = seq(.01, 1, length = 10))

enetTune <- train(x = perm_trainX, y = perm_trainY ,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl ,
                  preProcess = c("center", "scale")
                  )

We display the parameter tunings below which show that Elastic Net outperform PLS and PCR regressions. The final Elastic net model uses \(\lambda = 0.1\) and Lasso parameter \(\lambda_2 = 0.12\) in our specification.

## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 121, 120, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.00    0.01      1.1704448  0.4450415  0.9524403
##   0.00    0.12      0.9469834  0.5643682  0.7574892
##   0.00    0.23      1.0022209  0.5514359  0.7872867
##   0.00    0.34      1.0918181  0.5301740  0.8540700
##   0.00    0.45      1.1860356  0.5027960  0.9304146
##   0.00    0.56      1.2873792  0.4838903  1.0034147
##   0.00    0.67      1.3983954  0.4586070  1.1211482
##   0.00    0.78      1.4963757  0.4368210  1.1994294
##   0.00    0.89      1.5939157  0.4202430  1.2808476
##   0.00    1.00      1.6725596  0.4105048  1.3382308
##   0.05    0.01      1.1983976  0.4256382  0.9710976
##   0.05    0.12      1.2870747  0.5416545  0.9654491
##   0.05    0.23      1.6818005  0.5410243  1.2627384
##   0.05    0.34      2.0753044  0.5408564  1.5484362
##   0.05    0.45      2.4854776  0.5305450  1.8570758
##   0.05    0.56      2.8553998  0.5187479  2.1129468
##   0.05    0.67      3.2196498  0.5094695  2.3669359
##   0.05    0.78      3.5790047  0.5074803  2.6374493
##   0.05    0.89      3.9390078  0.5061804  2.9070814
##   0.05    1.00      4.3001080  0.5051146  3.1775626
##   0.10    0.01      1.1988166  0.4369915  0.9993257
##   0.10    0.12      0.8498910  0.6087014  0.6557170
##   0.10    0.23      0.8744997  0.5841962  0.6739327
##   0.10    0.34      0.8810256  0.5803549  0.6742691
##   0.10    0.45      0.8976766  0.5780301  0.6905788
##   0.10    0.56      0.9120180  0.5697710  0.7075800
##   0.10    0.67      0.9225326  0.5617471  0.7173633
##   0.10    0.78      0.9278022  0.5587795  0.7223083
##   0.10    0.89      0.9295182  0.5582120  0.7229917
##   0.10    1.00      0.9319942  0.5570152  0.7244517
##   0.50    0.01      1.2205188  0.4369915  1.0213279
##   0.50    0.12      0.8832301  0.5795634  0.6860241
##   0.50    0.23      0.8930349  0.5853319  0.7053811
##   0.50    0.34      0.9475531  0.5684995  0.7483258
##   0.50    0.45      0.9738020  0.5646285  0.7660949
##   0.50    0.56      0.9976230  0.5618253  0.7815977
##   0.50    0.67      1.0158279  0.5571365  0.7935008
##   0.50    0.78      1.0249255  0.5567606  0.8037094
##   0.50    0.89      1.0290046  0.5597098  0.8102308
##   0.50    1.00      1.0323044  0.5617373  0.8156342
##   0.80    0.01      1.2202535  0.4369915  1.0208854
##   0.80    0.12      0.8894920  0.5684689  0.6897530
##   0.80    0.23      0.9177546  0.5833350  0.7330459
##   0.80    0.34      1.0240080  0.5577688  0.8173089
##   0.80    0.45      1.0754851  0.5486166  0.8551375
##   0.80    0.56      1.1138559  0.5445552  0.8876842
##   0.80    0.67      1.1408920  0.5407047  0.9070979
##   0.80    0.78      1.1577136  0.5396857  0.9231084
##   0.80    0.89      1.1677908  0.5415257  0.9340646
##   0.80    1.00      1.1738798  0.5446962  0.9421534
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.12 and lambda = 0.1.

Below, we compute the out of sample test results and then plot model diagnostics.

pred_enet = predict(enetTune, newdata = perm_testX)

(testRMSE_enet = caret::RMSE(pred_enet, perm_testY ))
## [1] 0.7277125
(testR2_enet = caret::R2(pred_enet, perm_testY) )
## [1] 0.568133

The above plots look reasonable.

Lastly, we observe that the Elastic Net model has reduced predictor dimension from 388 to only 45 predictors with non-zero coefficients. The predictors and their elastic net coefficients are listed below in descending absolute value order.

coefs = predict.enet(enetTune$finalModel, s=enetTune$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients 

nonzero_coefs = coefs[ abs(coefs) > 0.00001 ]
var_names = names(nonzero_coefs)
rownames(nonzero_coefs) <- NULL

enet_coefs = data.frame( names = var_names, coefs = nonzero_coefs )

dim(enet_coefs)
## [1] 45  2

For completeness, we list the coefficients in tabular format too.

Elastic Net Non-zero Coefficients

names coefs
X6 X6 0.494
X698 X698 0.080
X719 X719 0.080
X141 X141 0.050
X133 X133 0.027
X138 X138 0.027
X16 X16 0.027
X315 X315 0.024
X316 X316 0.007
X318 X318 0.007
X317 X317 0.007
X108 X108 0.007
X103 X103 0.007
X340 X340 0.001
X344 X344 0.001
X553 X553 0.000
X551 X551 0.000
X554 X554 0.000
X304 X304 -0.001
X305 X305 -0.001
X298 X298 -0.001
X280 X280 -0.001
X281 X281 -0.001
X300 X300 -0.001
X299 X299 -0.001
X293 X293 -0.009
X613 X613 -0.015
X621 X621 -0.015
X258 X258 -0.024
X245 X245 -0.028
X244 X244 -0.028
X240 X240 -0.028
X253 X253 -0.028
X246 X246 -0.028
X254 X254 -0.028
X239 X239 -0.028
X157 X157 -0.028
X99 X99 -0.029
X307 X307 -0.031
X295 X295 -0.031
X294 X294 -0.031
X308 X308 -0.031
X271 X271 -0.034
X158 X158 -0.050
X93 X93 -0.277

To summarize our findings in the table below, elastic net outperforms PLS and PCR by both RMSE and R-Squared on the test dataset.

All Models on Test DataSet

model RMSE RSquared NumVars
Partial Least Squares 0.92 0.54 9
Principal Components Regression 0.95 0.27 11
Elastic Net 0.73 0.57 45

6.2f Response

None of the three models produced here should be relied upon for production purposes for several reasons:

PAMPA Table 1

One possible way to judge the utility of the model is to bin the permeability into 3 classes of permeability (low, medium and high). By transforming a regression problem into a classification problem, we can evaluate whether the new model achieves the 80% accuracy level that appears acceptable to the authors of the original paper.

The experimental equipment should continued to be used in tandem with continued assessment of model predictive accuracy based on the provided information.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),

measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:

library(AppliedPredictiveModeling)

data(chemicalManufacturing)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

  2. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

  3. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

  4. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

  5. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Response 6.3a

There is a minor typo in the 6.3 assignment.

– The dataset is named ChemicalManufacturingProcess rather than the name given above. The author’s book errate webpage acknowledges this error.

We save the predictors which are columns 2 to 58 of ChemicalManufacturingProcess into a dataframe chemX of \(N=176\) rows and \(M=57\) columns and the response in chemY in a dataframe of \(N=176\) rows and \(M=1\) columns.

data("ChemicalManufacturingProcess")

chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]
dim(chemX)
## [1] 176  57
dim(chemY)
## [1] 176   1

Due to the long variable names, we will abbreviate all predictors as follows:

chemX %>% rename_at(vars(starts_with("ManufacturingProcess") ), 
                    funs(str_replace(., "ManufacturingProcess", "M"))) %>% 
  rename_at(vars(starts_with("Biological")), 
            funs( str_replace( ., "BiologicalMaterial", "B"))) -> chemX2

Response 6.3b and 6.3c Combined

Although exercise 6.3b asks us to do imputation before splitting the data, that is a bad practice because the imputation process can alter training data based on the properties of the test data. So the predictions will have look-ahead bias, and the model accuracy will be artificially inflated.

We perform data splitting in 6.3c before imputation and preprocessing. Then we will tune a PLS model.

Splitting the data

We split the \(N\) observations into a 75-25 train/test split based on stratified random sampling of observations based on their response variable value Yield. Yield has no missing data which is helpful for stratified random sampling. Also, no transformation of the response Yield is warranted as the histogram looks symmetric. This is shown in the skim output of Yield below and its histogram.

Data summary
Name chemY
Number of rows 176
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Yield 0 1 40.18 1.85 35.25 38.75 39.97 41.48 46.34 ▁▇▇▃▁

We split the data into 75/25 train/test sets below and put into the necessary dataframes.

set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )

Near Zero Variance

We drop out a near zero variance predictor B07 from both train and test data sets. The number of predictors declines from \(M=57\) to \(M=56\) as a result. The results below

## [1] "Training data removed variables:B07"
## [1] "Dimension of train data before and after near zero variance check: "
## [1] 132  57
## [1] 132  56
## [1] "Test data removed variables:B07"
## [1] "Dimension of test data before and after near zero variance check: "
## [1] 44 57
## [1] 44 56

Outliers and Zeroes

Next we identify and fix zeros and outliers in the predictors before imputing missing values.
The reason is that KNN algorithm may be affected by outliers. The table below uses skim to report the minimum, maximum, median, mean and standard deviation for each predictor. In addition, we construct Z-score metrics that tells us how many standard deviations is the minimum and maximum. For example, for the predictor \(M18\), its values are clustered around 4800 except for observation 13 which has a value of 0.

\[ \text{zscore_min}(M18) = \frac{ min(\text{M18}) - \mu(\text{M18}) }{\sigma(\text{M18} )}\] \[\text{zscore_max}(M18) = \frac{ max(\text{M18}) - \mu(\text{M18}) }{\sigma(\text{M18} )}\] We identify 16 predictors with this issue in the table below and replace the zero outlier with the median.

sk_train_bv = skim(chemX_train_bv)

sk_train_bv %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme)) %>% yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist zscore_min zscore_max zscore_extreme
M18 0 1.00 4800.09 423.29 0 4813.00 4834.00 4862.0 4971.0 ▁▁▁▁▇ -11.34 0.40 11.34
M20 0 1.00 4545.26 402.00 0 4550.00 4579.00 4605.5 4759.0 ▁▁▁▁▇ -11.31 0.53 11.31
M16 0 1.00 4554.23 404.48 0 4554.75 4585.50 4619.0 4852.0 ▁▁▁▁▇ -11.26 0.74 11.26
M26 3 0.98 6001.38 534.47 0 6018.00 6041.00 6069.0 6161.0 ▁▁▁▁▇ -11.23 0.30 11.23
M25 3 0.98 4816.81 429.30 0 4829.00 4852.00 4876.0 4990.0 ▁▁▁▁▇ -11.22 0.40 11.22
M27 3 0.98 4550.56 406.49 0 4556.00 4584.00 4607.0 4696.0 ▁▁▁▁▇ -11.19 0.36 11.19
M31 3 0.98 70.07 6.36 0 70.10 70.80 71.4 72.5 ▁▁▁▁▇ -11.01 0.38 11.01
M29 3 0.98 19.93 1.87 0 19.70 19.90 20.4 22.0 ▁▁▁▁▇ -10.65 1.11 10.65
M43 0 1.00 0.91 0.98 0 0.50 0.75 1.0 11.0 ▇▁▁▁▁ -0.92 10.27 10.27
M30 3 0.98 9.15 1.05 0 8.80 9.10 9.7 11.2 ▁▁▁▃▇ -8.74 1.96 8.74
M01 1 0.99 11.32 1.45 0 10.80 11.40 12.2 14.1 ▁▁▁▅▇ -7.80 1.91 7.80
M42 0 1.00 11.28 1.75 0 11.40 11.60 11.7 12.1 ▁▁▁▁▇ -6.45 0.47 6.45
M44 0 1.00 1.82 0.29 0 1.80 1.90 1.9 2.1 ▁▁▁▁▇ -6.22 0.96 6.22
M45 0 1.00 2.14 0.37 0 2.10 2.20 2.3 2.6 ▁▁▁▂▇ -5.72 1.21 5.72
M39 0 1.00 6.85 1.51 0 7.10 7.20 7.3 7.5 ▁▁▁▁▇ -4.54 0.43 4.54
M38 0 1.00 2.55 0.62 0 2.00 3.00 3.0 3.0 ▁▁▁▅▇ -4.09 0.73 4.09
## # A tibble: 0 x 15
## # … with 15 variables: skim_type <chr>, skim_variable <chr>, n_missing <int>,
## #   complete_rate <dbl>, numeric.mean <dbl>, numeric.sd <dbl>,
## #   numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## #   numeric.p100 <dbl>, numeric.hist <chr>, zscore_min <dbl>, zscore_max <dbl>,
## #   zscore_extreme <dbl>

We also observe that the variance of the affected predictors has gone done significantly. For example, the predictor M18 reduced its standard deviation from 423 to 44 (nearly 10 times smaller).

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
M18 0 1 4836.71 44.14 4701 4813.75 4834 4862 4971 ▁▃▇▃▁

For the test dataset, 5 predictors have the same issue. So we apply the same changes.

sk_test_bv = skim(chemX_test_bv)

sk_test_bv %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme)) %>% yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist zscore_min zscore_max zscore_extreme
M39 0 1 6.86 1.52 0 7.10 7.20 7.30 7.4 ▁▁▁▁▇ -4.52 0.35 4.52
M42 0 1 10.98 2.44 0 11.38 11.50 11.62 12.0 ▁▁▁▁▇ -4.51 0.42 4.51
M44 0 1 1.76 0.40 0 1.80 1.80 1.90 2.1 ▁▁▁▁▇ -4.42 0.85 4.42
M45 0 1 2.12 0.50 0 2.10 2.20 2.30 2.5 ▁▁▁▁▇ -4.28 0.77 4.28
M01 0 1 10.86 2.63 0 10.67 11.35 12.00 14.1 ▁▁▁▆▇ -4.13 1.23 4.13

We apply the same treatment of zero values to the test dataset.

## NULL

To summarize, we identified 16 predictors in the training set and 5 predictors in the test set in outlier zeros. By replacing these values with the median of their respective datasets, we reduced the variance of these predictors significantly. We chose not to apply the same outlier treatment to maximum values because we cannot be certain they are data errors.

Missing Data Analysis

As we will show, the amount of missing data is small and its occurrence follows the same pattern in the test and training sets. We will do exploratory data analysis and data imputation on the training and test partitions separately.

Looking column-wise in the test data, 14 predictors have missing data. The 3 most impacted predictors are M03 which is 86.4% complete, M11 with 93.2% complete and M10 with 95.5% complete. Only manufacturing process predictors have missing data so this suggest in future industrial applications to fix this issues at the factory. The skim output below displays the variables

skim(chemX_test_nz) %>% filter( n_missing > 0 ) %>% arrange( desc( n_missing) ) %>% yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
M03 6 0.86 1.54 0.03 1.48 1.52 1.54 1.55 1.60 ▁▂▇▁▁
M11 3 0.93 9.41 0.73 7.70 9.00 9.40 9.90 11.30 ▂▇▇▅▁
M10 2 0.95 9.21 0.81 7.50 8.83 9.20 9.60 11.40 ▃▇▇▅▁
M25 2 0.95 4863.07 36.61 4785.00 4844.75 4866.50 4883.75 4966.00 ▂▅▇▂▁
M26 2 0.95 6059.26 40.39 5973.00 6038.00 6052.50 6073.75 6137.00 ▁▃▇▁▃
M27 2 0.95 4599.21 43.36 4518.00 4575.00 4594.00 4621.50 4710.00 ▃▇▆▂▂
M28 2 0.95 7.68 4.93 0.00 0.00 10.50 10.95 11.40 ▃▁▁▁▇
M29 2 0.95 20.26 0.68 19.30 19.70 20.05 20.70 21.70 ▇▇▅▁▅
M30 2 0.95 9.19 0.73 7.60 8.72 9.15 9.80 10.90 ▂▇▆▇▁
M31 2 0.95 70.53 1.17 67.90 69.70 70.70 71.40 72.30 ▂▂▃▇▅
M33 2 0.95 63.40 2.24 58.00 62.25 63.50 64.75 68.00 ▂▂▇▃▁
M34 2 0.95 2.51 0.06 2.40 2.50 2.50 2.50 2.60 ▂▁▇▁▂
M35 2 0.95 496.12 11.05 463.00 490.00 497.00 504.75 517.00 ▁▃▇▇▅
M36 2 0.95 0.02 0.00 0.02 0.02 0.02 0.02 0.02 ▁▇▁▇▃

Next, we look row-wise at which observations have the most missing values in its predictors columns and display the results in order. A total of 9 observations have missing data points across the 14 predictors.

chemX_test_nz %>%                           # The dataframe
  filter_all( any_vars(is.na(.))) %>%    # include any rows where any column is NA
  select_if( ~ any(is.na(.))) -> missing_test_nz  
       # only include those columns where some row has NA in that column

missing_test_nz %>% kable(caption = "missing test data - rows") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")                                                                                             

missing test data - rows

M03 M10 M11 M25 M26 M27 M28 M29 M30 M31 M33 M34 M35 M36
2 NA NA NA 4869 6107 4630 11.2 21.4 9.9 68.7 66 2.6 508 0.019
3 NA NA NA 4897 6116 4637 11.1 21.3 9.4 69.3 66 2.6 509 0.018
15 NA 9.4 10.2 4874 6125 4659 11.3 21.6 9.9 68.5 64 2.5 490 0.019
17 NA 9.0 10.0 4860 6100 4659 11.1 21.3 10.0 68.7 68 2.5 495 0.019
18 NA 9.5 10.6 4815 6055 4631 10.8 20.9 10.6 69.0 64 2.5 497 0.020
19 NA 8.9 9.8 4876 6109 4696 11.1 21.4 9.8 68.8 64 2.4 514 0.021
98 1.55 9.1 NA 4869 6050 4574 10.5 19.9 8.8 71.3 64 2.5 493 0.019
173 1.56 9.5 9.9 NA NA NA NA NA NA NA NA NA NA NA
176 1.55 9.9 10.3 NA NA NA NA NA NA NA NA NA NA NA
missing_test_nz %>%
  mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_test_nz

num_na_test_cells = sum(missing_summary_test_nz$num_na)

missing_summary_test_nz %>%  # add a count of columns with NA values
  select(num_na ) %>%        # pull out the count
  arrange( desc(num_na)) %>% # sort by the count descending
  bind_rows(., tibble( num_na = num_na_test_cells ) ) %>%
  kable(caption = "Test Data with missing predictor values") %>%
  kable_styling(bootstrap_options = c("hover", "striped"),
                position = "left"
                )

Test Data with missing predictor values

num_na
173 11
176 11
2 3
3 3
15 1
17 1
18 1
19 1
98 1
…10 33
pct_missing_test = num_na_test_cells/ ( nrow(chemX_test_nz) * ncol(chemX_test_nz)) * 100

To tally up the NA values across all cells in the test data, we found 33 cells of missing data. This equals 1.3392857 percent of the test data. Looking at the table above, of the 9 observations with missing values, some observations (173, 176) have 11 predictors with missing values. The rest of the observations are much less affected.

In the training data, 28 manufacturing predictors have missing data. All biological predictors have complete data. The 3 most impacted predictors are M03, M10, M11 with 93.2%, 94.7%, 94.7% completeness rates. This is consistent with the pattern in the test data. The most common completeness rate is 99.2% (1 missing value).

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
M03 9 0.93 1.54 0.02 1.47 1.54 1.54 1.55 1.60 ▁▂▆▇▁
M10 7 0.95 9.17 0.76 7.70 8.70 9.00 9.50 11.60 ▃▇▆▁▁
M11 7 0.95 9.38 0.72 7.50 9.00 9.40 9.90 11.50 ▂▆▇▅▁
M25 4 0.97 4854.45 40.40 4721.00 4829.00 4852.50 4876.00 4990.00 ▁▃▇▃▁
M26 4 0.97 6048.27 45.79 5901.00 6018.75 6041.50 6069.50 6161.00 ▁▂▇▃▂
M27 4 0.97 4586.11 46.99 4416.00 4556.75 4584.50 4607.25 4696.00 ▁▁▇▇▂
M29 4 0.97 20.09 0.61 18.90 19.70 19.90 20.40 22.00 ▂▇▂▁▁
M30 4 0.97 9.22 0.66 7.50 8.80 9.15 9.70 11.20 ▁▅▇▃▁
M31 4 0.97 70.62 1.36 59.80 70.18 70.80 71.40 72.50 ▁▁▁▂▇
M02 3 0.98 16.89 8.35 0.00 19.50 21.10 21.50 22.50 ▂▁▁▁▇
M28 3 0.98 6.24 5.32 0.00 0.00 10.30 10.70 11.50 ▆▁▁▁▇
M33 3 0.98 63.59 2.56 56.00 62.00 64.00 65.00 70.00 ▁▅▇▆▂
M34 3 0.98 2.49 0.05 2.30 2.50 2.50 2.50 2.60 ▁▂▁▇▁
M35 3 0.98 495.43 10.78 468.00 490.00 495.00 501.00 522.00 ▁▂▇▃▁
M36 3 0.98 0.02 0.00 0.02 0.02 0.02 0.02 0.02 ▂▇▇▁▂
M01 2 0.98 11.41 1.06 9.00 10.80 11.40 12.20 14.10 ▃▅▇▆▁
M06 2 0.98 207.52 2.91 203.00 205.70 206.80 208.70 227.40 ▇▅▁▁▁
M04 1 0.99 932.34 6.14 911.00 928.00 934.00 937.00 946.00 ▁▂▃▇▁
M05 1 0.99 999.84 26.93 923.00 987.50 999.20 1008.10 1175.30 ▁▇▁▁▁
M07 1 0.99 177.50 0.50 177.00 177.00 177.00 178.00 178.00 ▇▁▁▁▇
M08 1 0.99 177.53 0.50 177.00 177.00 178.00 178.00 178.00 ▇▁▁▁▇
M12 1 0.99 868.13 1794.45 0.00 0.00 0.00 0.00 4549.00 ▇▁▁▁▂
M14 1 0.99 4852.15 52.70 4713.00 4826.50 4858.00 4881.00 4992.00 ▁▂▇▃▁
M22 1 0.99 5.53 3.34 0.00 3.00 5.00 8.00 12.00 ▇▆▇▅▅
M23 1 0.99 3.01 1.61 0.00 2.00 3.00 4.00 6.00 ▇▆▇▆▇
M24 1 0.99 8.99 5.76 0.00 4.00 8.00 14.00 23.00 ▆▇▃▅▁
M40 1 0.99 0.02 0.04 0.00 0.00 0.00 0.00 0.10 ▇▁▁▁▂
M41 1 0.99 0.02 0.05 0.00 0.00 0.00 0.00 0.20 ▇▁▁▁▁
chemX_train_nz %>%                           # The dataframe
  filter_all( any_vars(is.na(.))) %>%    # include any rows where any column is NA
  select_if( ~ any(is.na(.))) -> missing_train_nz     # only include those columns where some row has NA in that column

missing_train_nz %>% kable(caption = "missing training data - rows") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

missing training data - rows

M01 M02 M03 M04 M05 M06 M07 M08 M10 M11 M12 M14 M22 M23 M24 M25 M26 M27 M28 M29 M30 M31 M33 M34 M35 M36 M40 M41
1 NA NA NA NA NA NA NA NA NA NA NA 4898 NA NA NA 4873 6074 4685 10.7 21.0 9.9 69.1 66 2.4 486 0.019 NA NA
4 NA 0.0 NA 911 1014.6 213.3 177 177 NA NA 0 4897 5 2 5 4892 6111 4630 11.1 21.3 9.4 69.3 68 2.5 496 0.018 0.0 0.0
5 10.7 0.0 NA 918 1027.5 205.7 178 178 NA NA 0 4992 8 4 18 4930 6151 4684 11.3 21.6 9.0 69.4 70 2.5 468 0.017 0.0 0.0
6 12.0 0.0 NA 924 1016.8 208.9 178 178 NA NA 0 4985 9 1 1 4871 6128 4687 11.4 21.7 10.1 68.2 70 2.5 490 0.018 0.0 0.0
16 11.1 0.0 NA 928 1073.6 207.1 177 177 9.6 10.2 0 4805 12 4 16 4848 6095 4630 11.1 21.2 10.3 68.5 66 2.5 493 0.019 0.1 0.2
20 12.7 0.0 NA 929 1175.3 204.1 177 177 8.9 9.9 0 4961 4 3 4 4850 6094 4665 11.0 21.2 10.1 68.7 62 2.5 490 0.020 0.0 0.0
22 10.9 0.0 NA 936 1024.4 218.7 178 178 NA NA 0 4791 6 2 16 4918 6152 4618 11.4 21.1 8.8 70.1 68 2.4 479 0.018 0.0 0.0
23 11.1 0.0 NA 937 997.7 209.6 177 177 NA NA 0 NA 7 3 17 4906 6134 4626 11.2 21.0 8.9 70.1 66 2.4 492 0.019 0.0 0.0
24 11.3 0.0 NA 940 1031.3 215.1 177 177 NA NA 0 4864 1 1 1 4875 6095 4606 11.2 20.9 9.5 69.7 65 2.5 468 0.018 0.0 0.0
90 10.8 21.5 1.55 935 954.8 NA 178 178 8.4 8.3 0 4866 8 3 3 4897 6047 4572 10.3 19.7 8.1 72.2 64 2.5 493 0.020 0.0 0.0
108 9.8 21.9 1.50 934 987.9 208.9 178 178 9.4 9.3 0 4807 11 5 5 NA NA NA 0.0 NA NA NA 62 2.5 485 0.019 0.0 0.0
134 10.9 NA 1.55 934 1000.1 208.7 177 178 8.7 9.4 0 4875 2 2 13 4816 6037 4568 0.0 19.8 9.6 70.6 62 2.5 495 0.020 0.0 0.0
139 10.0 NA 1.55 935 986.1 205.2 177 178 9.5 9.5 0 4823 8 2 2 4822 6018 4606 0.0 19.8 9.6 70.6 60 2.6 497 0.020 0.0 0.0
172 12.8 21.5 1.54 935 1027.0 206.2 178 177 9.6 9.6 0 4829 2 2 8 NA NA NA NA NA NA NA NA NA NA NA 0.0 0.0
174 13.0 20.4 1.55 930 1040.0 208.7 178 177 10.1 10.4 0 4795 0 0 0 NA NA NA NA NA NA NA NA NA NA NA 0.0 0.0
175 14.1 21.6 1.55 935 1044.8 208.0 177 177 10.2 10.3 0 4793 0 0 0 NA NA NA NA NA NA NA NA NA NA NA 0.0 0.0
missing_train_nz %>%
  mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_train_nz

num_na_cells_train = sum(missing_summary_train_nz$num_na)

missing_summary_train_nz %>%  # add a count of columns with NA values
  select(num_na ) %>%        # pull out the count
  arrange( desc(num_na)) %>% # sort by the count descending
  bind_rows(., tibble( num_na = num_na_cells_train ) ) %>%
  kable(caption = "Training Data with missing predictor values") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

Training Data with missing predictor values

num_na
1 16
172 11
174 11
175 11
108 6
4 4
23 4
5 3
6 3
22 3
24 3
16 1
20 1
90 1
134 1
139 1
…17 80
pct_missing_train = num_na_cells_train/ ( nrow(chemX_train_nz) * ncol(chemX_train_nz)) * 100

We conclude there are a small and similar percentage of missing values in the train and test data sets.

Due to the small percentage of missing data and after treatment of zero values, we expect kNN imputation to work well.

KNN Imputation

We will use knn imputation to replace those values by using caret preProcess functionality.

# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))

# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc,  chemX_train_nz )

# Becomes the final version of test data for prediction
chemX_test_pp  = predict( preProcFunc,  chemX_test_nz )

Next, we visualize the histogram and density of all the scaled imputed variables constructed above using caret. The plot below shows that imputation and removal of zeros reveals no predictors with odd densities or extremely long tails. Either of these indicates problems with outliers.

Next, we compare the density plots of each of the in-scope predictors between the training and test datasets. We conclude that both data sets are similar. Most importantly, the test data was not used to adjust or impute the training data.

PLS Model Tuning

We use a partial least squares model (PLS) for prediction of the Yield response. We use standard training arguments shown below to fit the model. We use repeated cross validation with 9 repeats of a 10-fold cross validation process on the training data. Lastly, we use Rsquared as the performance metric and the oneSE rule to choose the number of latent variables (a.k.a. components) to use in prediction.

repeats = 9

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "oneSE")

set.seed(102030)

plsTune <- train(x = chemX_train_pp, y = chemY_train,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale") ,
                 metric = "Rsquared"
                 )
plsTune
## Partial Least Squares 
## 
## 132 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold, repeated 9 times) 
## Summary of sample sizes: 119, 118, 120, 120, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.351048  0.4731740  1.1140210
##    2     1.260863  0.5272981  1.0203550
##    3     1.203779  0.5660351  0.9880331
##    4     1.226812  0.5616346  0.9927586
##    5     1.233899  0.5598473  1.0068803
##    6     1.239375  0.5624817  1.0190834
##    7     1.238962  0.5666788  1.0233615
##    8     1.244006  0.5677683  1.0267789
##    9     1.280803  0.5562125  1.0546023
##   10     1.304934  0.5486787  1.0802050
##   11     1.330050  0.5392558  1.0934149
##   12     1.364049  0.5257729  1.1144246
##   13     1.384194  0.5232286  1.1247945
##   14     1.401976  0.5203377  1.1320915
##   15     1.414448  0.5147468  1.1385261
##   16     1.420774  0.5091043  1.1428049
##   17     1.431185  0.5058737  1.1497164
##   18     1.436509  0.5066840  1.1504161
##   19     1.441805  0.5075062  1.1480669
##   20     1.454394  0.5039491  1.1523342
## 
## Rsquared was used to select the optimal model using  the one SE rule.
## The final value used for the model was ncomp = 3.

The plots above show the RMSE and \(R^2\) profile of the model plotted against the number of latent variables. We see that the RMSE plot gives an absolute minimum at ncomp equal to 3. The \(R^2\) plot gives a local maximum at ncomp equal to 3. Thus, our visual study of the RMSE, \(R^2\) is consistent with the oneSE rule.

There, we decide to use a PLS model with 3 latent variables.

Response 6.3d

Now we evaluate the model performance on the test data set.

pred_chemY = predict(plsTune, newdata = chemX_test_pp)

(testRMSE_chem = caret::RMSE(pred_chemY, chemY_test ))
## [1] 1.22817
(testR2_chem   = caret::R2(  pred_chemY, chemY_test) )
## [1] 0.5872227

The test set RMSE of the model is 1.2281695 and the test set \(R^2\) is 0.5872227. We plot the diagnostic plots below.

The PLS model residuals on the test data generally look close to the resample training statistics. There are no strange or noticeable patterns in the model residuals plot and the output to predicted plot. There is a slight possibility that the model residuals are growing in volatility as the predicted value increases. However this could be a small sample issue.

Based on the comparison table below, we conclude that the test statistics are consistent with training statistics.

All Datasets using PLS Tuned Model

dataset RMSE RSquared NumObservations
Training Data 1.20 0.57 132
Test Data 1.23 0.59 44

Response 6.3e

We consider the variance importance below of the top 10 variables.

We see that M32, M13 and M36 are the top three variables. Manufacturing predictors dominate the list. The first biological predictor B02 enters the variable importance scores at rank number six. Therefore,

Response 6.3f

The featurePlot above shows the sensitivity of yield to the impact of each of the top 4 predictors. The green line shows the loess curve fitted to the data. Both on the slope of each fitted line we conclude that:

We should advise based on the model that yield can be controlled through changes to the manufacturing process based on these parameter estimates.

Code

We summarize all the R code used in this project in this appendix for ease of reading.

knitr::opts_chunk$set(echo = FALSE)

library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(caret)
library(AppliedPredictiveModeling)
library(pls)
library(elasticnet)
library(glmnet)
.code-bg {
  background-color: lightgray;
}

runProblem1 = TRUE
runModels1  = TRUE
runProblem2 = TRUE

> library(AppliedPredictiveModeling)
> data(permeability)
library(AppliedPredictiveModeling)
data("permeability")
print(paste0( "Dimensions of fingerprints: rows: ", nrow(fingerprints), " columns: ", ncol(fingerprints) ) )
print(paste0( "Dimensions of permability: rows: ", nrow(permeability), " columns: ", ncol(permeability) ) )

sk_fingerprints = skim(fingerprints)

nzv_vars = nearZeroVar(fingerprints, names = TRUE) 
# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
fingerprints_bv = fingerprints[, -nearZeroVar(fingerprints) ]

sk_fingerprints %>% filter( skim_variable %in% nzv_vars[1:5] )

print(paste0( "Dimensions of fingerprint predictors retained: rows: ", nrow(fingerprints_bv), " columns: ", ncol(fingerprints_bv) ) )
predictors_removed = ncol(fingerprints) - ncol(fingerprints_bv)
as_tibble(permeability) %>% ggplot(aes(x=permeability)) + geom_histogram() + ggtitle("Distribution of Permeability on entire dataset")

library(fpp3)
perm_ts = as_tibble(permeability) %>% mutate( id = row_number()) %>% as_tsibble(index = id )

(lambda <- perm_ts %>% features( permeability , features = guerrero) %>% pull(lambda_guerrero) )

perm_ts = perm_ts %>% mutate( bc_perm = box_cox(permeability, lambda), log_perm = log(permeability) )

perm_ts %>% ggplot() + 
  geom_density(aes(log_perm), alpha = 0.5 , fill = "green") +
  geom_density(aes(bc_perm), alpha = 0.5, fill = "blue")

# Box-Cox values of permeability
bcperm = perm_ts$bc_perm

#  When using caret::createDataPartition with a continuous variable, 
#  the default behavior to take a random sample
#  at a default of 5 groups defined by quintiles of the variable:  permeability 
#
set.seed(10023)
train_indices = createDataPartition(bcperm , times = 1, list = FALSE, p = 0.8 )

summary( bcperm[train_indices])
length(bcperm[train_indices])

length(bcperm[-train_indices])
summary( bcperm[-train_indices])
# perm_df defines the permeability dataframe joined to the reduced set of "big-variance" predictors
perm_df = data.frame( cbind(bcperm, fingerprints_bv ) )

perm_train = perm_df[train_indices,]
perm_trainY = perm_train[,1]
perm_trainX = perm_train[,2:ncol(perm_train)]

perm_test  = perm_df[ -train_indices,]
perm_testY = perm_test[,1]
perm_testX = perm_test[,2:ncol(perm_test)]

repeats = 9
#repeats = 1


ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "oneSE")

set.seed(102030)

plsTune <- train(x = perm_trainX, y = perm_trainY,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale")
                 )
plsTune

plsResamples <- plsTune$results
plsResamples$Model <- "PLS"

plsPlotData <- plsResamples

xyplot(RMSE ~ ncomp,
       data = plsPlotData,
       #aspect = 1,
       xlab = "# Components",
       ylab = "RMSE (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o") )


xyplot(Rsquared ~ ncomp,
       data = plsPlotData,
       #aspect = 1,
       xlab = "# Components",
       ylab = "R2 (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o"))


pred_perm = predict(plsTune, newdata = perm_testX)

(testRMSE = caret::RMSE(pred_perm, perm_testY ))
(testR2 = caret::R2(pred_perm, perm_testY) )

xyplot( perm_testY ~ pred_perm, type = c("p", "g"), 
        main = "PLS Model Prediction vs. Observation of Test Data",
        xlab = "Predicted", 
        ylab = "Observed", panel = function(...){ 
  panel.abline(a = 0, b = 1 , lty = 2)
  panel.xyplot(...)
  })


xyplot( perm_testY - pred_perm ~ pred_perm, type = c("p", "g"), 
        main = "PLS Model Prediction vs. Residual of Test Data",
        xlab = "Predicted", ylab = "Residual", panel = function(...){ 
  panel.abline(a = 0, b = 0 , lty = 2)
  panel.xyplot(...)
  })


set.seed(102030)

pcrTune <- train(x = perm_trainX, y = perm_trainY,
                 method = "pcr",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale")
                 )
pcrTune


pcrResamples <- pcrTune$results
pcrResamples$Model <- "PCR"

jointPlotData <- rbind( pcrResamples, plsResamples )

xyplot(RMSE ~ ncomp,
       data = jointPlotData,
       #aspect = 1,
       main = "PCR and PLS RMSE Results on Test Data" ,
       xlab = "# Components",
       ylab = "RMSE (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o", "g"))



xyplot(Rsquared ~ ncomp,
       data = jointPlotData,
      main = "PCR and PLS R2 Results on Test Data" ,
       #aspect = 1,
       xlab = "# Components",
       ylab = "R2 (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o"))


pred_perm_pcr = predict(pcrTune, newdata = perm_testX)

(testRMSE_pcr = caret::RMSE(pred_perm_pcr, perm_testY ))
(testR2_pcr = caret::R2(pred_perm_pcr, perm_testY) )

xyplot( perm_testY ~ pred_perm_pcr, type = c("p", "g"), 
        xlab = "Predicted", ylab = "Observed", 
        main = "Observed vs. Predicted Values for PCR Model" ,
        panel = function(...){ 
          panel.abline(a = 0, b = 1 , lty = 2)
          panel.xyplot(...)
  })
xyplot( perm_testY - pred_perm_pcr ~ pred_perm, 
        type = c("p", "g"), 
        xlab = "Predicted", 
        ylab = "Residual", 
        main = "Residual vs. Prediction for PCR" ,
        panel = function(...){ 
  panel.abline(a = 0, b = 0 , lty = 1 )
  panel.xyplot(...)
  })



set.seed(102030)

ctrl <- trainControl(method = "cv",  selectionFunction = "best")

enetGrid <- expand.grid(lambda = c(0, 0.05, .1 , 0.5,  0.8 ),    
                        fraction = seq(.01, 1, length = 10))

enetTune <- train(x = perm_trainX, y = perm_trainY ,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl ,
                  preProcess = c("center", "scale")
                  )

enetTune

pred_enet = predict(enetTune, newdata = perm_testX)

(testRMSE_enet = caret::RMSE(pred_enet, perm_testY ))
(testR2_enet = caret::R2(pred_enet, perm_testY) )

xyplot( perm_testY ~ pred_enet, type = c("p", "g"), 
        xlab = "Predicted", ylab = "Observed", 
        main = "Observed vs. Predicted Values for Elastic Net Model" ,
        panel = function(...){ 
          panel.abline(a = 0, b = 1 , lty = 2)
          panel.xyplot(...)
  })
xyplot( perm_testY - pred_enet ~ pred_enet, 
        type = c("p", "g"), 
        xlab = "Predicted", 
        ylab = "Residual", 
        main = "Residual vs. Prediction for Elastic Net" ,
        panel = function(...){ 
  panel.abline(a = 0, b = 0 , lty = 1 )
  panel.xyplot(...)
  })


coefs = predict.enet(enetTune$finalModel, s=enetTune$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients 

nonzero_coefs = coefs[ abs(coefs) > 0.00001 ]
var_names = names(nonzero_coefs)
rownames(nonzero_coefs) <- NULL

enet_coefs = data.frame( names = var_names, coefs = nonzero_coefs )

dim(enet_coefs)

enet_coefs %>% ggplot(aes(x = coefs, y = names)) + geom_col() + ggtitle("Elastic Net Model Standardized Coefficients")

enet_coefs %>% arrange( desc(coefs)) %>%  
  kable(caption = "Elastic Net Non-zero Coefficients", digits = 3 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped" ), position = "left" )


test_results = data.frame(model = c("Partial Least Squares", "Principal Components Regression", "Elastic Net") ,
                          RMSE  = c( testRMSE, testRMSE_pcr, testRMSE_enet ) ,
                          RSquared = c( testR2, testR2_pcr, testR2_enet ) ,
                          NumVars = c(  9 ,  11, 45 )
                          )

test_results %>% kable(caption = "All Models on Test DataSet", digits = 2 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

data("ChemicalManufacturingProcess")

chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]
dim(chemX)
dim(chemY)

chemX %>% rename_at(vars(starts_with("ManufacturingProcess") ), 
                    funs(str_replace(., "ManufacturingProcess", "M"))) %>% 
  rename_at(vars(starts_with("Biological")), 
            funs( str_replace( ., "BiologicalMaterial", "B"))) -> chemX2

skim(chemY)

chemY %>% ggplot(aes(x=Yield)) + geom_histogram(bins = 20, color= "white", fill = "red") + 
  labs(title = "Yield distribution")

set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )
chemX_test   = chemX2[-train_indices , ]
chemX_train  = chemX2[ train_indices , ]
chemY_test   = chemY[-train_indices , ]
chemY_train  = chemY[ train_indices , ]


paste0("Training data removed variables:", (nzv_chemtrain = nearZeroVar(chemX_train, names = TRUE) ) )

# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_train_bv = chemX_train[, -nearZeroVar(chemX_train) ]
paste0("Dimension of train data before and after near zero variance check: ")
dim(chemX_train)
dim(chemX_train_bv)


paste0("Test data removed variables:", (nzv_chemtest = nearZeroVar(chemX_test, names = TRUE) ) )

# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_test_bv = chemX_test[, -nearZeroVar(chemX_test) ]
paste0("Dimension of test data before and after near zero variance check: ")
dim(chemX_test)
dim(chemX_test_bv)



sk_train_bv = skim(chemX_train_bv)

sk_train_bv %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme)) %>% yank("numeric")

# Replace the outlier zeros with the median value for the variables below.

chemX_train_bv %>% mutate( M18 = ifelse(M18==0, median(M18), M18 ) ,
                           M20 = ifelse(M20==0, median(M20), M20 ) ,
                           M16 = ifelse(M16==0, median(M16), M16 ) ,
                           M26 = ifelse(M26==0, median(M26), M26 ) ,
                           M25 = ifelse(M25==0, median(M25), M25 ) ,
                           M27 = ifelse(M27==0, median(M27), M27 ) ,
                           M31 = ifelse(M31==0, median(M31), M31 ) ,
                           M29 = ifelse(M29==0, median(M29), M29 ) ,
                           M43 = ifelse(M43==0, median(M43), M43 ) ,
                           M30 = ifelse(M30==0, median(M30), M30 ) ,
                           M01 = ifelse(M01==0, median(M01), M01 ) ,
                           M42 = ifelse(M42==0, median(M42), M42 ) ,
                           M44 = ifelse(M44==0, median(M44), M44 ) ,
                           M45 = ifelse(M45==0, median(M45), M45 ) ,
                           M39 = ifelse(M39==0, median(M39), M39 ) ,                           
                           M38 = ifelse(M38==0, median(M38), M38 ) ) -> chemX_train_nz 
                           

sk_train_nz = skim(chemX_train_nz)

sk_train_nz %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme))

sk_train_nz %>% yank("numeric") %>% filter(skim_variable== 'M18')


sk_test_bv = skim(chemX_test_bv)

sk_test_bv %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme)) %>% yank("numeric")

# Replace the outlier zeros with the median value for the variables below.

chemX_test_bv %>% mutate( M39 = ifelse(M39==0, median(M39), M39 ) ,
                          M42 = ifelse(M42==0, median(M42), M42 ) ,
                          M44 = ifelse(M44==0, median(M44), M44 ) ,
                          M45 = ifelse(M45==0, median(M45), M45 ) ,
                          M01 = ifelse(M01==0, median(M01), M01 ) ) -> chemX_test_nz 
                           

sk_test_nz = skim(chemX_test_nz)

sk_test_nz %>% 
  mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , numeric.p0 == 0 ) %>%
  arrange( desc(zscore_extreme)) %>% yank("numeric")

skim(chemX_test_nz) %>% filter( n_missing > 0 ) %>% arrange( desc( n_missing) ) %>% yank("numeric")
chemX_test_nz %>%                           # The dataframe
  filter_all( any_vars(is.na(.))) %>%    # include any rows where any column is NA
  select_if( ~ any(is.na(.))) -> missing_test_nz  
       # only include those columns where some row has NA in that column

missing_test_nz %>% kable(caption = "missing test data - rows") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")                                                                                             
missing_test_nz %>%
  mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_test_nz

num_na_test_cells = sum(missing_summary_test_nz$num_na)

missing_summary_test_nz %>%  # add a count of columns with NA values
  select(num_na ) %>%        # pull out the count
  arrange( desc(num_na)) %>% # sort by the count descending
  bind_rows(., tibble( num_na = num_na_test_cells ) ) %>%
  kable(caption = "Test Data with missing predictor values") %>%
  kable_styling(bootstrap_options = c("hover", "striped"),
                position = "left"
                )

pct_missing_test = num_na_test_cells/ ( nrow(chemX_test_nz) * ncol(chemX_test_nz)) * 100
skim(chemX_train_nz) %>% filter( n_missing > 0 ) %>% arrange( desc( n_missing) ) %>% yank("numeric")
chemX_train_nz %>%                           # The dataframe
  filter_all( any_vars(is.na(.))) %>%    # include any rows where any column is NA
  select_if( ~ any(is.na(.))) -> missing_train_nz     # only include those columns where some row has NA in that column

missing_train_nz %>% kable(caption = "missing training data - rows") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
missing_train_nz %>%
  mutate( num_na = rowSums( is.na(.) ) ) -> missing_summary_train_nz

num_na_cells_train = sum(missing_summary_train_nz$num_na)

missing_summary_train_nz %>%  # add a count of columns with NA values
  select(num_na ) %>%        # pull out the count
  arrange( desc(num_na)) %>% # sort by the count descending
  bind_rows(., tibble( num_na = num_na_cells_train ) ) %>%
  kable(caption = "Training Data with missing predictor values") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")


pct_missing_train = num_na_cells_train/ ( nrow(chemX_train_nz) * ncol(chemX_train_nz)) * 100

# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))

# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc,  chemX_train_nz )

# Becomes the final version of test data for prediction
chemX_test_pp  = predict( preProcFunc,  chemX_test_nz )


chemX_train_pp %>%
  mutate(id = row_number()) %>% 
  pivot_longer( !id, names_to = "X", values_to = "Val" ) -> piv_chemX_train
plot1 = piv_chemX_train %>% filter(!is.na(Val)) %>% ggplot(aes(x=Val)) + 
  facet_wrap( ~X ,  scales = "free" ,  ncol = 4) + 
  geom_histogram(aes(y = ..density..), binwidth = .25,  fill = "red", alpha = 0.5) +
  geom_density( fill="blue", alpha = 0.3) +
  labs(title = "Training Data: Histogram/Density of Imputed Standardized Predictors")

plot1

chemX_test_pp %>% 
  mutate(id = row_number()) %>% 
  pivot_longer( !id, names_to = "X", values_to = "Val" ) -> piv_chemX_test
plot2 = piv_chemX_test %>% filter(!is.na(Val)) %>% ggplot(aes(x=Val)) + 
  facet_wrap( ~X ,  scales = "free" ,  ncol = 4) + 
  geom_histogram(aes(y = ..density..), binwidth = .25,  fill = "red", alpha = 0.5) +
  geom_density( fill="blue", alpha = 0.3) +
  labs(title = "Test Data: Histogram and Density of Standardized Predictors")

plot2

piv_chemX = rbind(piv_chemX_test %>% mutate(Set = "Test"), 
                  piv_chemX_train %>% mutate(Set = "Train") )


plot3 = piv_chemX %>% ggplot(aes(x=Val, fill = Set )) + 
  facet_wrap( ~X ,  scales = "free" ,  ncol = 4) + 
  geom_density( alpha = 0.5) +
  labs(title = "Data: Density of Imputed Training and Test Standardized Predictors")

plot3
repeats = 9

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "oneSE")

set.seed(102030)

plsTune <- train(x = chemX_train_pp, y = chemY_train,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale") ,
                 metric = "Rsquared"
                 )
plsTune

plsResamples <- plsTune$results
plsResamples$Model <- "PLS"

plsPlotData <- plsResamples

xyplot(RMSE ~ ncomp,
       data = plsPlotData,
       #aspect = 1,
       xlab = "# Components",
       ylab = "RMSE (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       main = "Chemical Manufacturing Process" ,
       type = c("o") )


xyplot(Rsquared ~ ncomp,
       data = plsPlotData,
       #aspect = 1,
       xlab = "# Components",
       ylab = "R2 (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       main = "Chemical Manufacturing Process",
       type = c("o"))


pred_chemY = predict(plsTune, newdata = chemX_test_pp)

(testRMSE_chem = caret::RMSE(pred_chemY, chemY_test ))
(testR2_chem   = caret::R2(  pred_chemY, chemY_test) )

xyplot( chemY_test ~ pred_chemY, type = c("p", "g"), 
        main = "PLS Model Prediction vs. Observation of Test Data",
        xlab = "Predicted", 
        ylab = "Observed", panel = function(...){ 
  panel.abline(a = 0, b = 1 , lty = 2)
  panel.xyplot(...)
  })


xyplot( chemY_test - pred_chemY ~ pred_chemY, type = c("p", "g"), 
        main = "PLS Model Prediction vs. Residual of Test Data",
        xlab = "Predicted", ylab = "Residual", panel = function(...){ 
  panel.abline(a = 0, b = 0 , lty = 2)
  panel.xyplot(...)
  })



test_results = data.frame(dataset = c("Training Data", "Test Data") ,
                          RMSE  = c(  plsTune$results$RMSE[ plsTune$bestTune$ncomp] , testRMSE_chem ) ,
                          RSquared = c( plsTune$results$Rsquared[ plsTune$bestTune$ncomp],  testR2_chem  ) ,
                          NumObservations = c( nrow( chemX_train_pp) , nrow( chemX_test_pp ) )
                          )

test_results %>% kable(caption = "All Datasets using PLS Tuned Model", digits = 2 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

importanceScores =  varImp(plsTune, scale = FALSE)


plot(importanceScores , top=10 )

vars_to_plot = c("M32", "M13", "M36", "M09")

theme1 <- trellis.par.get()
theme1$plot.symbol$col = rgb(.2, .2, .2, .4)
theme1$plot.symbol$pch = 16
theme1$plot.line$col = "green"
theme1$plot.line$lwd = 2
trellis.par.set(theme1)

caret::featurePlot(x = chemX_train_pp[, vars_to_plot], 
                   y = chemY_train , plot = "scatter" ,
                   type = c("p", "smooth") ,
                   span = 0.5, 
                   layout = c(4,1))