Week 10 - Linear Regression - Homework

C. Rosemond 11.01.20

library(tidyverse)
library(caret)
library(AppliedPredictiveModeling)
library(corrplot)
library(e1071)
library(impute)
library(pls)


6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a. Start R and use these commands to load the data:

data(permeability)

The matrix fingerprints contains the 1107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.


b. The fingerpring predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nzv <- nearZeroVar(fingerprints)
fp <- fingerprints[,-nzv]

719 predictors show near-zero variance. Responding to the prompt, removing those predictors leaves 388 predictors for modeling.


c. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R-squared?

set.seed(624)
fp <- cbind(permeability, fp)
index <- createDataPartition(fp[,1], p = .80, list = FALSE)
fp_train <- as.data.frame(fp[index,])# 133 observations
fp_test <- as.data.frame(fp[-index,]) # 32 observations

Using an 80/20 split results in a training set of 133 observations and a test set of 32 observations.


head(sort(abs(sapply(fp_train[-1], skewness)), decreasing = TRUE), 10)
##     X345     X732     X733     X780     X782     X792     X793     X800     X801     X806 
## 3.961832 3.961832 3.961832 3.961832 3.961832 3.961832 3.961832 3.961832 3.961832 3.961832

Numerous predictors show relatively high skewness, so a Box-Cox transformation seems appropriate.


corr <- cor(fp_train[-1])
hicorr <- findCorrelation(corr)
length(hicorr)
## [1] 282

Given the number of predictors, visualizing correlations among them is difficult. 283 predictors share at least one pairwise correlation statistic above 0.90. Removing these highly correlated predictors in pre-processing, and losing information, is not ideal. Regardless, they will be removed for this exercise.


set.seed(624)
fp_train_slim <- fp_train %>% select(-permeability) %>% select(-all_of(hicorr))
fp_transform <- fp_train_slim %>% preProcess(method = c("BoxCox", "center", "scale")) %>% predict(fp_train_slim) %>% cbind(fp_train$permeability) %>% rename(permeability = "fp_train$permeability")
(fp_pls <- train(permeability ~ .,
            data = fp_transform,
            method = "pls",
            tuneLength = 20,
            trControl = trainControl(method = "cv", number = 10)
            ))
## Partial Least Squares 
## 
## 133 samples
## 106 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 118, 119, 121, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.69401  0.3822378   9.028831
##    2     12.11650  0.4536446   9.020781
##    3     12.16977  0.4510712   9.355811
##    4     12.18054  0.4415965   9.286277
##    5     11.92587  0.4731016   8.991573
##    6     11.87701  0.4716946   9.082994
##    7     11.99605  0.4572040   9.146597
##    8     12.22095  0.4457056   9.383409
##    9     12.31013  0.4404575   9.338620
##   10     12.54428  0.4281621   9.514354
##   11     12.65192  0.4178258   9.544799
##   12     12.86063  0.3931739   9.589682
##   13     12.90922  0.3913631   9.595150
##   14     12.99622  0.3866327   9.653869
##   15     13.15415  0.3862983   9.923034
##   16     13.22796  0.3785622  10.058236
##   17     13.54645  0.3593305  10.280342
##   18     13.59821  0.3585050  10.281220
##   19     13.82503  0.3399958  10.403201
##   20     14.05068  0.3315941  10.600575
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.

The PLS model includes a tuning parameter of 20, with 10-fold cross-validation for resampling.


plot(fp_pls, main = "Partial Least Squares - Training RMSE on # of components")

fp_pls$results %>% select(c(ncomp, RMSE, Rsquared)) %>% filter(ncomp == 5)
##   ncomp     RMSE  Rsquared
## 1     5 11.92587 0.4731016

Per the plot, RMSE appears to be minimized at five components. The associated RMSE and R-squared values are approximately 10.80 and 0.57, respectively.


d. Predict the response for the test set. What is the test set estimate of R-squared?

set.seed(624)
fp_test_slim <- fp_test %>% select(-permeability) %>% select(-all_of(hicorr))
fp_test_transform <- fp_test_slim %>% preProcess(method = c("BoxCox", "center", "scale")) %>% predict(fp_test_slim) %>% cbind(fp_test$permeability) %>% rename(permeability = "fp_test$permeability")
## Warning in preProcess.default(., method = c("BoxCox", "center", "scale")): These variables have zero variances: X561, X568, X595, X621
fp_pls_test <- predict(fp_pls, fp_test_transform)
RMSE(fp_pls_test, fp_test_transform$permeability)
## [1] 11.52554
caret::R2(fp_pls_test, fp_test_transform$permeability)
## [1] 0.4439708

Numerous predictors ended up with zero variance in the test set, which is not necessarily surprising given its small size and the sparsity of the data. The RMSE of the test set predictions is approximately 12.11, and the R-squared is approximately 0.36.


e. Try building other models discussed in this chapter. Do any have better predictive performance?

set.seed(624)
(fp_ridge <- train(permeability ~ .,
                  data = fp_transform,
                  method = "ridge",
                  metric = "Rsquared",
                  tuneGrid = data.frame(.lambda = seq(0, 1, .05)),
                  trControl = trainControl(method = "cv", number = 10)
                  ))
## Ridge Regression 
## 
## 133 samples
## 106 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 118, 119, 121, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE          Rsquared   MAE         
##   0.00    8.746033e+18  0.1914868  2.425713e+18
##   0.05    1.216769e+01  0.4400932  9.078270e+00
##   0.10    1.195083e+01  0.4641163  8.932269e+00
##   0.15    1.192581e+01  0.4747767  8.950300e+00
##   0.20    1.196521e+01  0.4807897  9.007660e+00
##   0.25    1.203842e+01  0.4845063  9.077840e+00
##   0.30    1.213392e+01  0.4868907  9.149850e+00
##   0.35    1.224624e+01  0.4884270  9.231524e+00
##   0.40    1.237232e+01  0.4893901  9.318460e+00
##   0.45    1.251024e+01  0.4899496  9.406859e+00
##   0.50    1.265867e+01  0.4902163  9.498359e+00
##   0.55    1.281656e+01  0.4902660  9.595004e+00
##   0.60    1.298309e+01  0.4901525  9.700865e+00
##   0.65    1.315752e+01  0.4899148  9.812644e+00
##   0.70    1.333919e+01  0.4895820  9.941743e+00
##   0.75    1.352748e+01  0.4891760  1.008697e+01
##   0.80    1.372183e+01  0.4887136  1.024224e+01
##   0.85    1.392169e+01  0.4882078  1.039945e+01
##   0.90    1.412655e+01  0.4876689  1.055967e+01
##   0.95    1.433594e+01  0.4871049  1.071868e+01
##   1.00    1.454938e+01  0.4865223  1.087816e+01
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.55.
fp_ridge_test <- predict(fp_ridge, fp_test_transform)
RMSE(fp_ridge_test, fp_test_transform$permeability)
## [1] 11.62925
caret::R2(fp_ridge_test, fp_test_transform$permeability)
## [1] 0.5237753

Predicting on the test set, a ridge regression model (\(\lambda\) = 0.7) returns an RMSE of approximately 13.79 and an R-squared value of approximately 0.31. The latter is lower than its counterpart for the PLS model.


set.seed(624)
(fp_enet <- train(permeability ~ .,
                 data = fp_transform,
                 method = "enet",
                 metric = 'Rsquared',
                 tuneGrid= expand.grid(.fraction = seq(0, 1, 0.1), .lambda = seq(0, 0.5, 0.1)),
                 trControl=trainControl(method='cv', number = 10)
                 ))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.
## Warning in train.default(x, y, weights = w, ...): missing values found in aggregated results
## Elasticnet 
## 
## 133 samples
## 106 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 118, 119, 121, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE          Rsquared   MAE         
##   0.0     0.0       1.547007e+01        NaN  1.233021e+01
##   0.0     0.1       8.746033e+17  0.1919134  2.425713e+17
##   0.0     0.2       1.749207e+18  0.2197221  4.851426e+17
##   0.0     0.3       2.623810e+18  0.2140722  7.277140e+17
##   0.0     0.4       3.498413e+18  0.2042531  9.702853e+17
##   0.0     0.5       4.373017e+18  0.1983343  1.212857e+18
##   0.0     0.6       5.247620e+18  0.2014248  1.455428e+18
##   0.0     0.7       6.122223e+18  0.1951127  1.697999e+18
##   0.0     0.8       6.996827e+18  0.1931870  1.940571e+18
##   0.0     0.9       7.871430e+18  0.1921536  2.183142e+18
##   0.0     1.0       8.746033e+18  0.1914868  2.425713e+18
##   0.1     0.0       1.547007e+01        NaN  1.233021e+01
##   0.1     0.1       1.240978e+01  0.4127940  9.078734e+00
##   0.1     0.2       1.192706e+01  0.4383011  8.665132e+00
##   0.1     0.3       1.203048e+01  0.4395095  8.967555e+00
##   0.1     0.4       1.186121e+01  0.4504098  8.938470e+00
##   0.1     0.5       1.178722e+01  0.4568207  8.903919e+00
##   0.1     0.6       1.177338e+01  0.4620791  8.853870e+00
##   0.1     0.7       1.177849e+01  0.4663905  8.820909e+00
##   0.1     0.8       1.180348e+01  0.4682777  8.839128e+00
##   0.1     0.9       1.186852e+01  0.4666818  8.887351e+00
##   0.1     1.0       1.195083e+01  0.4641163  8.932269e+00
##   0.2     0.0       1.547007e+01        NaN  1.233021e+01
##   0.2     0.1       1.260630e+01  0.4007224  9.382142e+00
##   0.2     0.2       1.183872e+01  0.4424599  8.513622e+00
##   0.2     0.3       1.198894e+01  0.4437912  8.902009e+00
##   0.2     0.4       1.188883e+01  0.4559725  8.904585e+00
##   0.2     0.5       1.181665e+01  0.4637756  8.890707e+00
##   0.2     0.6       1.181158e+01  0.4689208  8.895337e+00
##   0.2     0.7       1.184408e+01  0.4727918  8.921803e+00
##   0.2     0.8       1.189632e+01  0.4759050  8.967041e+00
##   0.2     0.9       1.191887e+01  0.4796978  8.977207e+00
##   0.2     1.0       1.196521e+01  0.4807897  9.007660e+00
##   0.3     0.0       1.547007e+01        NaN  1.233021e+01
##   0.3     0.1       1.270015e+01  0.3956592  9.531132e+00
##   0.3     0.2       1.188093e+01  0.4396724  8.494756e+00
##   0.3     0.3       1.190840e+01  0.4503394  8.781468e+00
##   0.3     0.4       1.196380e+01  0.4569393  8.930444e+00
##   0.3     0.5       1.190049e+01  0.4670544  8.898140e+00
##   0.3     0.6       1.190328e+01  0.4733318  8.914333e+00
##   0.3     0.7       1.196626e+01  0.4764603  8.984166e+00
##   0.3     0.8       1.203235e+01  0.4806459  9.055404e+00
##   0.3     0.9       1.208809e+01  0.4846014  9.108575e+00
##   0.3     1.0       1.213392e+01  0.4868907  9.149850e+00
##   0.4     0.0       1.547007e+01        NaN  1.233021e+01
##   0.4     0.1       1.275722e+01  0.3926075  9.607002e+00
##   0.4     0.2       1.193008e+01  0.4363910  8.482466e+00
##   0.4     0.3       1.188630e+01  0.4534966  8.709890e+00
##   0.4     0.4       1.201857e+01  0.4595908  8.963715e+00
##   0.4     0.5       1.201309e+01  0.4688136  8.933680e+00
##   0.4     0.6       1.204056e+01  0.4758417  8.960643e+00
##   0.4     0.7       1.212545e+01  0.4789854  9.046077e+00
##   0.4     0.8       1.220231e+01  0.4839448  9.127177e+00
##   0.4     0.9       1.229657e+01  0.4872646  9.235356e+00
##   0.4     1.0       1.237232e+01  0.4893901  9.318460e+00
##   0.5     0.0       1.547007e+01        NaN  1.233021e+01
##   0.5     0.1       1.278640e+01  0.3909361  9.642823e+00
##   0.5     0.2       1.197911e+01  0.4329004  8.480990e+00
##   0.5     0.3       1.190175e+01  0.4555702  8.662550e+00
##   0.5     0.4       1.208876e+01  0.4621049  8.993517e+00
##   0.5     0.5       1.216198e+01  0.4691318  9.014131e+00
##   0.5     0.6       1.222057e+01  0.4765356  9.027456e+00
##   0.5     0.7       1.231767e+01  0.4805636  9.115088e+00
##   0.5     0.8       1.242331e+01  0.4851973  9.240936e+00
##   0.5     0.9       1.255021e+01  0.4881150  9.388832e+00
##   0.5     1.0       1.265867e+01  0.4902163  9.498359e+00
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 1 and lambda = 0.5.
fp_enet_test <- predict(fp_enet, fp_test_transform)
RMSE(fp_enet_test, fp_test_transform$permeability)
## [1] 11.57336
caret::R2(fp_enet_test, fp_test_transform$permeability)
## [1] 0.5176123

Predicting on the test set, an elastic net model (fraction = 0.5 and \(\lambda\) = 0.3) returns an RMSE of approximately 12.85 and an R-squared value of approximately 0.31. The latter is lower than its counterpart for the PLS model and roughly the same as its counterpart for the ridge model. Note: I often encountered train() warning messages related to missing values despite imputation. I was unable to find a suitable explanation through my cursory internet research.


f. Would you recommend any of your models to replace the permeability laboratory experiment?

No, none of the models perform particularly well per either RMSE or R-squared.



6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between the biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:


a. Start R and use these commands to load the data:

data("ChemicalManufacturingProcess")

The data set contains the 57 predictors (12 describing input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.


b. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

sum(is.na(ChemicalManufacturingProcess[-1]))
## [1] 106
sum(is.na(ChemicalManufacturingProcess[1])) # zero missings in response
## [1] 0

The predictor set contains 106 missing values. To confirm, the response variable contains zero missing values.


imputed <- impute.knn(as.matrix(ChemicalManufacturingProcess), rng.seed = 624)
cmp <- as.data.frame(imputed$data)
sum(is.na(cmp))
## [1] 0

Imputing meaning for missing values--meaning where it may not exist--can be problematic, particularly with limited domain expertise. Data are assumed to missing at least at random for the purposes of this exercise, and K-nearest neighbors (KNN) imputation is used to estimate the values. The imputation process uses 10 neighbors.


c. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

set.seed(624)
index <- createDataPartition(cmp$Yield, p = .80, list = FALSE)
cmp_train <- cmp[index,] # 144 observations
cmp_test <- cmp[-index,] # 32 observations

An 80/20 split is used to create a training set of 144 runs and a test set of 32 runs.


head(sort(abs(sapply(cmp_train[-1], skewness)), decreasing = TRUE), 10)
## ManufacturingProcess18 ManufacturingProcess20 ManufacturingProcess26 ManufacturingProcess25 ManufacturingProcess27 ManufacturingProcess31 ManufacturingProcess29 ManufacturingProcess43 
##              11.519798              11.456952              11.430506              11.394196              11.310738              10.736764               9.302817               8.934634 
##   BiologicalMaterial07 ManufacturingProcess42 
##               8.221086               4.904776

Numerous predictors show high skewness, so a Box-Cox transformation seems appropriate.


ex <- nearZeroVar(cmp_train[-1], saveMetrics = TRUE)
ex %>% arrange(-freqRatio, percentUnique, -nzv) %>% head()
##                        freqRatio percentUnique zeroVar   nzv
## BiologicalMaterial07   71.000000      1.388889   FALSE  TRUE
## ManufacturingProcess41  6.500000      2.777778   FALSE FALSE
## ManufacturingProcess28  5.400000     14.583333   FALSE FALSE
## ManufacturingProcess12  4.760000      1.388889   FALSE FALSE
## ManufacturingProcess34  4.636364      6.250000   FALSE FALSE
## ManufacturingProcess40  4.333333      1.388889   FALSE FALSE
sum(ex$nzv)
## [1] 1

A check for non-zero variance predictors returns just one: BiologicalMaterial07, with a frequency ratio of approximately 47. This predictor will be removed in pre-processing for this exercise, though in general, the relatively small number of predictors available as well as limited personal domain expertise would suggest leaving it in. More information is typically better than less.


corr <- cor((cmp_train %>% select(-c("Yield","BiologicalMaterial07"))))
corrplot::corrplot(corr)

hicorr <- findCorrelation(corr)

A plot of between-predictor correlations--unwieldy labels aside--reveals that the biological materials show some positive correlations amongst them, and that clusters of manufacturing processes are highly correlated. Nine predictors show a correlation statistic greater than or equal to 0.90. As was the case with the near-zero variance predictor, removing these highly correlated predictors in pre-processing, and losing information, is not ideal. Regardless, they will be removed for this exercise.

Lastly, Partial Least Squares (PLS), the modeling method of choice, requires centering and scaling. Unlike principal component analysis, PLS is a supervised method that considers the response and predictors in identifying components.


set.seed(624)
cmp_train_slim <- cmp_train %>% select(-c(Yield, BiologicalMaterial07)) %>% select(-all_of(hicorr))
cmp_train_transform <- cmp_train_slim %>% preProcess(method = c("BoxCox", "center", "scale")) %>% predict(cmp_train_slim) %>% cbind(cmp_train$Yield) %>% rename(Yield = "cmp_train$Yield")
(cmp_pls <- train(Yield ~ .,
            data = cmp_train_transform,
            method = "pls",
            tuneLength = 20,
            trControl = trainControl(method = "cv", number = 10)
            ))
## Partial Least Squares 
## 
## 144 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 130, 130, 130, 129, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.330348  0.4763544  1.0834481
##    2     1.213766  0.5625348  0.9840348
##    3     1.196188  0.5774340  0.9886366
##    4     1.184384  0.5911442  0.9810267
##    5     1.196831  0.5826518  0.9916942
##    6     1.201138  0.5722715  0.9885999
##    7     1.226767  0.5583218  1.0228333
##    8     1.272341  0.5388357  1.0604890
##    9     1.282468  0.5328993  1.0545322
##   10     1.289072  0.5385708  1.0516185
##   11     1.294834  0.5375744  1.0460450
##   12     1.313342  0.5287376  1.0546193
##   13     1.328259  0.5284952  1.0591376
##   14     1.341635  0.5208995  1.0724941
##   15     1.355117  0.5182529  1.0877087
##   16     1.358845  0.5191536  1.0877400
##   17     1.359345  0.5198062  1.0914817
##   18     1.352915  0.5246695  1.0937597
##   19     1.358462  0.5261390  1.0984270
##   20     1.367990  0.5270736  1.1019394
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.

The PLS model includes a tuning parameter of 20, with 10-fold cross-validation for resampling.


plot(cmp_pls, main = "Partial Least Squares - Training RMSE on # of components")

cmp_pls$results %>% select(c(ncomp, RMSE, Rsquared)) %>% filter(ncomp == 3)
##   ncomp     RMSE Rsquared
## 1     3 1.196188 0.577434

Per the plot, RMSE appears to be minimized at three components. The associated RMSE and R-squared values are approximately 1.20 and 0.57, respectively.


d. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

set.seed(624)
cmp_test_slim <- cmp_test %>% select(-c(Yield, BiologicalMaterial07)) %>% select(-all_of(hicorr))
cmp_test_transform <- cmp_test_slim %>% preProcess(method = c("BoxCox", "center", "scale")) %>% predict(cmp_test_slim) %>% cbind(cmp_test$Yield) %>% rename(Yield = "cmp_test$Yield")
cmp_pls_test <- predict(cmp_pls, cmp_test_transform)
RMSE(cmp_pls_test, cmp_test_transform$Yield)
## [1] 1.21805
caret::R2(cmp_pls_test, cmp_test_transform$Yield)
## [1] 0.6710741

The RMSE of test set predictions is approximately 1.21, and the associated R-squared value is approximately 0.57. Both values are roughly the same as their counterparts for the training set.


e. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

varImp(cmp_pls)$importance %>%
  arrange(-Overall) %>%
  rownames_to_column("predictor") %>%
  top_n(10) %>%
  ggplot(aes(x = reorder(predictor, Overall), y = Overall)) +
    geom_col() +
    ggtitle("Top 10 predictors of product yield, by importance in PLS model") +
    xlab(NULL) +
    ylab("Importance") +
    coord_flip()

The plot depicts the top ten most important predictors in the PLS model. The manufacturing process predictors tend to be more important than the biological material predictors.


f. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

top10 <- varImp(cmp_pls)$importance %>%
  rownames_to_column("variable") %>%
  top_n(10) %>%
  arrange(-Overall) %>%
  select(-Overall) %>%
  unlist()
cmp %>%
  select(Yield, unname(top10)) %>%
  cor() %>%
  corrplot::corrplot()

Each of the top ten most important predictors shares at least an okay relationship with Yield. It seems like addressing any adverse impacts of ManufacturingProcess13, ManufacturingProcess36, and ManufacturingProcess17 could increase yields, as could emphasizing ManufacturingProcess32, ManufacturingProcess09, or ManufacturingProcess33.


Sources

Kuhn, M. (2019). The caret package. Retrieved October 25, 2020 from https://topepo.github.io/caret/index.html.

Kuhn, M. and Johnson, K. (2013). Applied predictive modeling. doi 10.1007/978-1-4614-6849-3