Problem No. 6.2

The dataset fingerprint has several hundred low variance predictors. Examine some of them and filter them out:

low_var <- nearZeroVar(fingerprints)
table(fingerprints[,7])

## 
##   1 
## 165

table(fingerprints[,1087])

## 
##   0   1 
## 164   1

fingerprints <- fingerprints[,-low_var]

For the most part, these variables are mostly one value, with a small amount of secondary values. Removing them leaves 388 predictors remaining.

Partial least squares model

We can see below that many of the remaining independent variables are highly correlated with eachother. This makes running a linear regression problematic, because linear regression requires limited correlation between predictors to produce sensible parameter estimates.

findCorrelation(fingerprints)

##   [1]   6   8  10  13  20  21  22  23  24  25  26  27  28  29  30  31  32
##  [18]  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
##  [35]  50  51  52  53  54  55  56  57  58  59  60  61  62  63  65  67  68
##  [52]  69  71  72  73  74  75  77  78  81  82  83  84  85  86  87  89  90
##  [69]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
##  [86] 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124
## [103] 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141
## [120] 142 143 144 145 146 147 148 149 150 152 153 154 155 156 157 158 159
## [137] 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## [154] 177 178 179 180 181 183 184 185 186 187 188 189 190 191 192 193 194
## [171] 195 196 197 198 199 200 201 202 203 204 205 209 210 212 213 214 215
## [188] 216 218 221 223 224 225 226 227 228 229 230 231 232 233 234 235 236
## [205] 237 238 239 240 241 242 243 244 245 246 247 248 249 252 253 255 256
## [222] 257 260 261 262 263 267 268 269 270 271 272 273 274 275 276 277 278
## [239] 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295
## [256] 296 298 300 301 302 303 304 305 307 322 326 327 328 329 330 332 334
## [273] 335 336 337 338 347 349 350 351 352 357  12  14   1  18   3  64   9
## [290]  66 151  88  76  79  80  70   0  11  15  17  16  19   2   5   4   7

There are three potential solutions to this. First, we can remove the highly correlated predictors and try to apply least squares. Second, we could apply principle components analysis (PCA), transforming the predictor space into a smaller set of uncorrelated linear transformations of \(X\), then regress \(y\) with least squares. However, neither of these is a guarantee linear regression will work, and the latter option makes parameter estimates difficult to interpret.

Instead, I will attempt a third option, partial least squares (PLS) regression. Similiar to PCA, this method creates linear combinations of the predictors. But unlike PCA, the linear combinations are ‘optimized’ with regard to the dependent variable such that they optimally represent the variation between the predictors and the dependent variable.

First, let’s assemble our data into a single data frame, and then split into a train and test set for cross-validation:

df <- as.data.frame(fingerprints) %>%
  mutate(permeability = permeability)

set.seed(1804)
train_ix <- createDataPartition(df$permeability, p=0.8, list=FALSE)
df_train <- df[train_ix, ]
df_test <- df[-train_ix, ]

Next, fit the model using caret functionality. The determine the optimal number of components to construct, I set up a tuning grid that considers between 1 and 20 components:

ncomp_grid <- expand.grid(.ncomp=1:20)
m <- train(x=fingerprints,
           y=as.numeric(permeability),
           method='pls',
           tuneGrid=ncomp_grid)
m$results

##    ncomp     RMSE  Rsquared       MAE   RMSESD RsquaredSD     MAESD
## 1      1 13.37025 0.2725706 10.018944 1.640362 0.12379638 1.2231750
## 2      2 12.23829 0.3913878  8.628303 1.733991 0.11060395 1.0894052
## 3      3 12.18972 0.4024949  8.852800 1.478972 0.09399176 0.9328696
## 4      4 12.39863 0.3986252  9.144199 1.332974 0.10765960 0.8030293
## 5      5 12.34828 0.4144449  8.913195 1.515397 0.11626737 0.8232598
## 6      6 12.38531 0.4186866  8.965897 1.579712 0.11141798 0.8888452
## 7      7 12.28857 0.4335285  8.987729 1.561448 0.11079135 0.9220147
## 8      8 12.37247 0.4337851  9.168246 1.569104 0.11264925 1.0273177
## 9      9 12.48168 0.4344410  9.276370 1.660569 0.10882410 1.1549753
## 10    10 12.70577 0.4244861  9.364427 1.708348 0.10746999 1.2717124
## 11    11 12.89079 0.4202559  9.463989 1.657418 0.10243136 1.1480815
## 12    12 13.04552 0.4139146  9.594402 1.694032 0.10262387 1.1742957
## 13    13 13.24306 0.4055237  9.684707 1.626240 0.10143651 1.1222862
## 14    14 13.33127 0.4008868  9.764454 1.581731 0.09934271 1.0592563
## 15    15 13.60672 0.3880002  9.928917 1.592925 0.10181765 1.0844762
## 16    16 13.89030 0.3768414 10.103772 1.605639 0.09937927 1.0604847
## 17    17 14.13612 0.3679427 10.323218 1.645648 0.09950948 1.0654986
## 18    18 14.38509 0.3603861 10.538409 1.810424 0.10090865 1.1952592
## 19    19 14.67593 0.3505389 10.788531 1.881610 0.10023536 1.2766361
## 20    20 14.90480 0.3420204 10.983828 1.915988 0.10121105 1.3132307

plot(m)

The results of tuning are displayed. The \(RMSE\) is minimized where ncomp is between 2 and 10—the graph suggests it is most minimized at 3 components. Within the range of 2–10 components, \(MAE\) is smallest at ncomp=3. This provides an model with an \(R^2 = 0.44\). The caret package iotself agrees with 3 components, so I accept 3 as the optimal number of components:

m$bestTune

##   ncomp
## 3     3

Apply this model to the test set:

m_y_hat <- predict(m, newdata=df_test)
defaultSummary( data.frame(obs=as.numeric(df_test$permeability), pred=m_y_hat) )

##       RMSE   Rsquared        MAE 
## 10.0968255  0.3923439  6.5123117

Applied to the test set, the optimal PLS model boasts on \(RMSE\) even lower than the test set, of 10.10. It’s \(R^2 = 0.39\), which is respectable compared to performance on training data.

A plot of the observed versus predicted values shows a reasonable fit, despite some troubling outliers. In practice, more data would probably be required for me to feel completely confident in this model.

scatter.smooth(df_test$permeability, m_y_hat, xlab='Observed', ylab='Predicted')

Penalized regression models

I try to improve performance using a penalized regression model. The tuning funtionality in the caret package allows us to easily test both a ridge regression model \(\alpha = 0\) and elasticnet model \(\alpha=1\). These models may be helpful here, where there are more predictors than observed cases.

Note: In practice, this step should be accomplished before examining the test set above!

Create a tuning grid for \(\alpha\) and \(\lambda\), fit the models, and examine the best fit:

m1_grid <- expand.grid(alpha=0:1,
                       lambda=10^seq(-3, 3, length=100))
set.seed(1805)
m1 <- train(x=fingerprints,
            y=as.numeric(permeability),
            method='glmnet',
            trControl=trainControl('cv', number=4),
            tuneGrid=m1_grid)

plot(m1)

m1$bestTune

##     alpha    lambda
## 143     1 0.3511192

We see from the graph that the ridge regression performs best where \(\lambda < 100\) and \(RMSE\) steadily increases after that. The elasticnet model performs better than ridge regression only at very small values of \(\lambda\). The best performance is an elasticnet model with \(\lambda = 0.35\).

During training, it has an \(RMSE = 11.68\) and \(R^2 = 0.43\). This is comparable to the above PLS model’s performance on the training set.

m1$results %>% filter(round(lambda, 2) == 0.31, alpha == 1)

##   alpha    lambda     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1     1 0.3053856 11.67767 0.4332336 8.578657 1.093457  0.1075829 1.318257

Applied to our test set, we see that performs vastly better than the PLS model. \(RMSE\) is about 25 percent samller at 7.65, and \(R^2\) takes a large bump to 0.64!

m1_y_hat <- predict(m1, newdata=df_test)
defaultSummary( data.frame(obs=as.numeric(df_test$permeability), pred=m1_y_hat) )

##      RMSE  Rsquared       MAE 
## 7.6477011 0.6354759 5.7751656

The observed v. predicted plot also looks improved over the PLS model. This elasticnet model is without qualification a superior model.

scatter.smooth(df_test$permeability, m1_y_hat, xlab='Observed', ylab='Predicted')

The purpose of modeling this data was to find a cheaper and faster alternative to applying an assay to determine a compound’s permeability. Is the elasticnet model worth implementing over the status quo process? To answer this, we would need to know a few things:

Tolerance for error in the process (is the \(RMSE\) and \(R^2\) above acceptable?)
Cost of running an assay
Accuracy of assay

If an assay is expensive enough and/or the tolerance for error high enough, it may make sense to implement the elasticnet model. However, if tolerance for error is extremely low and assays have near 100 percent accuracy, the model should not be implemented as is.

Problem No. 6.3

Load the data:

data("ChemicalManufacturingProcess")
df <- ChemicalManufacturingProcess
colnames(df) <- tolower(colnames(df))

Examine the data set a bit, we determine:

The dependent variable has a roughly normal shape
The independent variables have vastly different ranges
A few columns have <15 missing values

This suggests a few first steps: Scaling and center the data is a good idea. We can also use caret’s bagging imputation function for the missing values, which will also center and scale. This can be handled in the train function.

Divide into train and test sets:

set.seed(1805)
train_ix <- createDataPartition(df$yield, p=0.8, list=FALSE)
df_train <- df[train_ix, ]
df_test <- df[-train_ix, ]

Now, test ridge and elasticnet regressions:

set.seed(1806)
m2_grid <- expand.grid(alpha=0:1,
                       lambda=10^seq(-3, 3, length=100))

m2 <- train(x=df_train[, 2:58],
            y=df_train$yield,
            method='glmnet',
            preProcess='bagImpute',
            trControl=trainControl('cv', number=4),
            tuneGrid=m2_grid)

plot(m2)

m2$bestTune

##     alpha   lambda
## 140     1 0.231013

m2$results %>% filter(round(lambda, 3) == 0.043, alpha == 1)

##   alpha     lambda     RMSE  Rsquared       MAE    RMSESD RsquaredSD
## 1     1 0.04328761 1.226074 0.5791323 0.9231774 0.2293747  0.1030213
##        MAESD
## 1 0.06473497

The best fit comes from an elasticnet model with \(\lambda = 0.043\). This readout suggests such a model has \(RMSE = 1.1\), and interestingly, an \(R^2 = 0.63\). This is fairly reasonable performance.

Test set shows the model fits similarly to out-of-sample data. Both the \(RMSE\) and \(R^2\) are comparable at 1.42, and 0.58, respectively.

m2_y_hat <- predict(m2, newdata=df_test)
defaultSummary( data.frame(obs=as.numeric(df_test$yield), pred=m2_y_hat) )

##      RMSE  Rsquared       MAE 
## 1.5472973 0.4963041 1.2456969

scatter.smooth(df_test$yield, m2_y_hat, xlab='Observed', ylab='Predicted')

What are the most important predictors? The caret package offers a varImp function, which has S3 methods for many kinds of models, including glmnet. However, the documentation suggests the calcuation is simply, ‘the absolute value of the coefficients corresponding the the tuned model’; the corresponding calculation for linear regression is ‘the absolute value of the t-statistic for each model parameter.’ I am skeptical either of these measures deserve the interpretation ‘most import variable.’

Nonetheless running the function:

varImp(m2)

## glmnet variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                         Overall
## manufacturingprocess34 100.0000
## manufacturingprocess09  31.9224
## manufacturingprocess13  19.0217
## manufacturingprocess32  18.2703
## manufacturingprocess17   7.9088
## manufacturingprocess11   2.2548
## manufacturingprocess06   1.3735
## biologicalmaterial03     0.4547
## manufacturingprocess04   0.0000
## manufacturingprocess29   0.0000
## manufacturingprocess26   0.0000
## manufacturingprocess38   0.0000
## manufacturingprocess20   0.0000
## biologicalmaterial01     0.0000
## manufacturingprocess21   0.0000
## manufacturingprocess27   0.0000
## manufacturingprocess43   0.0000
## manufacturingprocess19   0.0000
## manufacturingprocess07   0.0000
## manufacturingprocess41   0.0000

We see that the manufacturing process variables seem to dominate the top of the list. By a long shot, the ‘most important’ variable is manufacturingprocess36. Examining this relationship directly, we see a strong and steep near-linear relationship:

ggplot(df, aes(x=manufacturingprocess36, y=yield)) +
  geom_point(alpha=0.33) +
  geom_smooth()

This plot suggests keeping this manufactering process around 0.018 could help produce increased yield.

DATA 624—Week No. 10

Ben Horvath

April 4, 2020

Problem No. 6.2

Partial least squares model

Penalized regression models

Problem No. 6.3