DATA624 HW7

Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Part (a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

Part (b)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package.

fingerprints <- as.data.frame(fingerprints)
# permeability <- as.data.frame(permeability)
near0 <- nearZeroVar(fingerprints)
fp <- fingerprints[,-near0]

How many predictors are left for modeling?

There are 388 predictors out of 1107 left for modeling.

Part (c)

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

trainingRows <- createDataPartition(permeability, p = .80, list= FALSE)

train_x <- fp[trainingRows, ]
train_y <- permeability[trainingRows]

test_x <- fp[-trainingRows, ]
test_y <- permeability[-trainingRows]

PLS using the `caret` package

pls_model <- train(train_x, train_y, method="pls",
                   tuneLength = 20, 
                   # trControl=trainControl(method="repeatedcv",repeats=5),
                   preProcess = c("center","scale"))
pls_model

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.84920  0.2709024  10.636429
##    2     12.91205  0.3698073   9.213105
##    3     12.93696  0.3786810   9.671910
##    4     13.41528  0.3513935  10.194430
##    5     13.23003  0.3749945  10.009937
##    6     13.24614  0.3840792   9.949120
##    7     13.08158  0.3949232   9.849193
##    8     13.10461  0.4004220   9.924006
##    9     13.20249  0.4013161  10.010824
##   10     13.43059  0.3965166  10.164667
##   11     13.66669  0.3892159  10.297159
##   12     13.82120  0.3848419  10.417121
##   13     14.05644  0.3767074  10.658454
##   14     14.41308  0.3573801  10.927938
##   15     14.67841  0.3470513  11.154293
##   16     14.92639  0.3337784  11.360811
##   17     15.10416  0.3289769  11.508837
##   18     15.39667  0.3189454  11.758257
##   19     15.66749  0.3125186  11.956640
##   20     15.98925  0.3048928  12.233394
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.

summary(pls_model)

## Data:    X dimension: 133 388 
##  Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 2
## TRAINING: % variance explained
##           1 comps  2 comps
## X           21.43    34.77
## .outcome    32.02    48.73

PLS using the `pls` package

library(pls)
train <- train_x
train['y'] <- train_y
test <- test_x
test['y'] <- test_y

plsFit <- plsr(y ~ ., data = train, scale = TRUE, center = TRUE, 
               ncomp=20, validation = "CV")

plsFit_R2 <- 1-sum(plsFit$residuals[,,2]^2)/sum((train$y-mean(train$y))^2)

Both the caret and the pls package find that 2 components are optimal (although when using cross-validation I can’t always get reproducible results and sometimes it recommends more). For the caret model the corresponding \(R^2\) value is 0.3698073. For the pls model the corresponding \(R^2\) value is 0.4872599 (if I calculated that correctly).

Part (d)

Predict the response for the test set. What is the test set estimate of \(R^2\)?

`caret` package predictions

caret_pred <- predict(pls_model, test_x)
postResample(pred = caret_pred, obs = test_y)

##       RMSE   Rsquared        MAE 
## 10.2093175  0.5067774  7.4334935

`pls` package predictions

pls_pred <- predict(plsFit, test_x, ncomp=2)
pls.eval <- data.frame(obs=test_y, pred=pls_pred[,1,1])
defaultSummary(pls.eval)

##       RMSE   Rsquared        MAE 
## 10.2093175  0.5067774  7.4334935

I’m a little shocked that my model evaluation statistics all came out exactly the same for both models. But I guess that’s possible.

Part (e)

Try building other models discussed in this chapter. Do any have better predictive performance?

elastic net

Tune with `caret`

ridge_grid <- data.frame(.lambda = seq(0, .1, length=11))
ctrl <- trainControl(method = "cv", number = 10)
ridge_model <- train(train_x, train_y,
                      method = "ridge",
                      tuneGrid = ridge_grid,
                      trControl = ctrl,
                      preProcess = c("center","scale"))
ridge_model

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 121, 121, 118, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   0.00     14.86420  0.3787189  11.350326
##   0.01    108.35841  0.2926301  77.502509
##   0.02     14.45405  0.3741074  10.628271
##   0.03     13.95151  0.3964276  10.286067
##   0.04     13.77835  0.4078377  10.154959
##   0.05     13.43403  0.4235641   9.900218
##   0.06     13.27553  0.4328217   9.772313
##   0.07     13.16316  0.4398899   9.676881
##   0.08     13.05329  0.4471737   9.600932
##   0.09     12.97369  0.4529305   9.538200
##   0.10     12.90751  0.4579974   9.491182
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

Train with `elasticnet`

library(elasticnet)
ridge_model2 <- enet(x = as.matrix(train_x), y = train_y, lambda = 0.1, 
                   normalize = TRUE)

Predictions

`caret` model

ridge_pred <- predict(ridge_model, test_x, s = 1, 
                     mode = "raw",
                     type = "raw")

postResample(pred = ridge_pred, obs = test_y)

##       RMSE   Rsquared        MAE 
## 10.4890920  0.6111045  8.1409039

`elasticnet` model

ridge_pred2 <- predict(ridge_model2, as.matrix(test_x), s = 1, 
                     mode = "fraction",
                     type = "fit")

postResample(pred = ridge_pred2$fit, obs = test_y)

##       RMSE   Rsquared        MAE 
## 10.4850440  0.6112443  8.1364082

After tuning with the caret package train function an elastic net model created with the elasticnet package and using \(\lambda = 0.1\) resulted in a slightly better \(R^2\) but since we are comparing two different models this may not be the best measure of performance. The RMSE and MAE are not much different though, so I would think the elastic net has better predictive performance.

The model created by the train function in the caret package returned the same results, so I’m not really sure why the textbook has you recreate the model with the elasticnet package?

The plots of the actual vs. predicted values above confirm that the elastic net model seems to be a better fit.

The qq plots also show a better overall fit for the elastic net model even though the two PLS models made with the caret and the pls packages fit the majority of the data better, but have 5 significant outliers.

Part (f)

Would you recommend any of your models to replace the permeability laboratory experiment?

No, based on the plots above I don’t think any of the models would be accurate enough to justify replacing the permeability laboratory experiment.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

Part (a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The data.frame ChemicalManufacturingProcess contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs and one target variable yield, which contains the percent yield for each run.

Part (b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

# missing <- data.frame(t(apply(is.na(ChemicalManufacturingProcess), 2, sum)))
# t(missing[,colSums(missing) > 0])
library(naniar) # for missing data viz
missingness <- ChemicalManufacturingProcess %>%
  miss_var_summary() 
kable(missingness %>% 
  filter(pct_miss > 0)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

variable	n_miss	pct_miss
ManufacturingProcess03	15	8.5227273
ManufacturingProcess11	10	5.6818182
ManufacturingProcess10	9	5.1136364
ManufacturingProcess25	5	2.8409091
ManufacturingProcess26	5	2.8409091
ManufacturingProcess27	5	2.8409091
ManufacturingProcess28	5	2.8409091
ManufacturingProcess29	5	2.8409091
ManufacturingProcess30	5	2.8409091
ManufacturingProcess31	5	2.8409091
ManufacturingProcess33	5	2.8409091
ManufacturingProcess34	5	2.8409091
ManufacturingProcess35	5	2.8409091
ManufacturingProcess36	5	2.8409091
ManufacturingProcess02	3	1.7045455
ManufacturingProcess06	2	1.1363636
ManufacturingProcess01	1	0.5681818
ManufacturingProcess04	1	0.5681818
ManufacturingProcess05	1	0.5681818
ManufacturingProcess07	1	0.5681818
ManufacturingProcess08	1	0.5681818
ManufacturingProcess12	1	0.5681818
ManufacturingProcess14	1	0.5681818
ManufacturingProcess22	1	0.5681818
ManufacturingProcess23	1	0.5681818
ManufacturingProcess24	1	0.5681818
ManufacturingProcess40	1	0.5681818
ManufacturingProcess41	1	0.5681818

imputer <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
CMP_imputed <- predict(imputer, ChemicalManufacturingProcess)
missing <- CMP_imputed %>%
  miss_var_summary() 
kable(missing %>% 
  filter(pct_miss > 0)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

variable	n_miss	pct_miss

Part (c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

near0 <- nearZeroVar(CMP_imputed)
cmp <- CMP_imputed[,-near0]

There are 57 predictors out of 58 left for modeling after removing variables with near zero variance.

trainingRows <- createDataPartition(cmp$Yield, p = .80, list= FALSE)

train <- cmp[trainingRows, ]
test <- cmp[-trainingRows, ]

PLS using the `caret` package

CMP_pls_model <- train(train[,-1], train[,"Yield"], method="pls",
                   tuneLength=20, 
                   # trControl=trainControl(method="repeatedcv",repeats=5),
                   preProcess=c("center","scale"))
CMP_pls_model

## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.023764  0.2662151  0.7024952
##    2     1.517135  0.2556752  0.7678175
##    3     1.144342  0.3256068  0.7054432
##    4     1.051293  0.3594072  0.6832918
##    5     1.134892  0.3501609  0.7039725
##    6     1.186446  0.3432564  0.7301779
##    7     1.269517  0.3291640  0.7605063
##    8     1.409004  0.2999849  0.7995211
##    9     1.527249  0.2787915  0.8366923
##   10     1.635630  0.2667980  0.8673658
##   11     1.728955  0.2603218  0.8882150
##   12     1.783722  0.2508235  0.9134676
##   13     1.869813  0.2296608  0.9384069
##   14     2.043987  0.2143848  0.9771190
##   15     2.197101  0.2077721  1.0105228
##   16     2.316361  0.1920839  1.0408806
##   17     2.471245  0.1893137  1.0761429
##   18     2.591035  0.1916388  1.1029218
##   19     2.688780  0.1857432  1.1276270
##   20     2.815421  0.1805420  1.1566318
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.

I’m not impressed with the results, so let’s try an elastic net…

elastic net

Tune with `caret`

CMP_ridge_model <- train(train[,-1], train$Yield,
                      method = "ridge",
                      tuneGrid = ridge_grid,
                      trControl = ctrl,
                      preProcess = c("center","scale"))
CMP_ridge_model

## Ridge Regression 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 131, 128, 130, 128, 131, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.00    2.331986  0.3814637  1.1611843
##   0.01    1.159724  0.4470011  0.7405773
##   0.02    1.198792  0.4572407  0.7279004
##   0.03    1.183358  0.4632850  0.7177072
##   0.04    1.160956  0.4674776  0.7106364
##   0.05    1.139159  0.4708060  0.7041047
##   0.06    1.119448  0.4737120  0.6988919
##   0.07    1.101965  0.4764262  0.6939708
##   0.08    1.086531  0.4790872  0.6896225
##   0.09    1.072910  0.4817887  0.6860740
##   0.10    1.060882  0.4846009  0.6832646
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

The elastic net model seems to give us significantly better performance than the PLS model. So let’s follow the textbook examples and train with the elasticnet package too using the optimal lambda parameter as determined by the caret model training.

Train with `elasticnet`

library(elasticnet)
CMP_ridge_model2 <- enet(x = as.matrix(train[,-1]), y = train$Yield, lambda = 0.1, 
                   normalize = TRUE)

Part (d)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

`caret` model

CMP_ridge_pred <- predict(CMP_ridge_model, test[,-1], s = 1, 
                     mode = "raw",
                     type = "raw")

postResample(pred = CMP_ridge_pred, obs = test$Yield)

##      RMSE  Rsquared       MAE 
## 0.5727537 0.7324409 0.4646206

`elasticnet` model

CMP_ridge_pred2 <- predict(CMP_ridge_model2, as.matrix(test[,-1]), s = 1, 
                     mode = "fraction",
                     type = "fit")

postResample(pred = CMP_ridge_pred2$fit, obs = test$Yield)

##      RMSE  Rsquared       MAE 
## 0.5727537 0.7324409 0.4646206

Once again we get almost exactly the same performance from the models trained with the caret package and the elasticnet package. Each has an \(R^2\) of about 0.69 which is significantly higher than the resampled \(R^2\) on the training set which was about 0.47.

Part (e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

CMP_enet_coef <- predict(CMP_ridge_model2, as.matrix(test[,-1]), s = 1, 
                     mode = "fraction",
                     type = "coefficients")

top3 <- head(sort(CMP_enet_coef$coefficients, decreasing = TRUE), 3)
bottom3 <- tail(sort(CMP_enet_coef$coefficients, decreasing = TRUE), 3)
top3

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess04 
##              0.3898830              0.2392460              0.1989751

bottom3

## ManufacturingProcess36 ManufacturingProcess13 ManufacturingProcess37 
##             -0.1442161             -0.1476988             -0.1734662

The 6 most influential predictors (3 of which are positively influential and 3 of which are negatively influential) are all manufacturing process predictors.

Part (f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

top_vars <- c('ManufacturingProcess32', 'ManufacturingProcess09', 'ManufacturingProcess04',
         'ManufacturingProcess36', 'ManufacturingProcess13', 'ManufacturingProcess37')

mod <- lm(Yield~ManufacturingProcess32+ManufacturingProcess09+ManufacturingProcess04+
        ManufacturingProcess36+ManufacturingProcess13+ManufacturingProcess37,
        data = CMP_imputed)
summary(mod)

## 
## Call:
## lm(formula = Yield ~ ManufacturingProcess32 + ManufacturingProcess09 + 
##     ManufacturingProcess04 + ManufacturingProcess36 + ManufacturingProcess13 + 
##     ManufacturingProcess37, data = CMP_imputed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.45215 -0.48449 -0.02652  0.37212  1.59264 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -0.001134   0.046399  -0.024   0.9805    
## ManufacturingProcess32  0.526013   0.082529   6.374 1.69e-09 ***
## ManufacturingProcess09  0.389563   0.078537   4.960 1.70e-06 ***
## ManufacturingProcess04  0.100731   0.053438   1.885   0.0611 .  
## ManufacturingProcess36 -0.098533   0.077647  -1.269   0.2062    
## ManufacturingProcess13 -0.148247   0.078044  -1.900   0.0592 .  
## ManufacturingProcess37 -0.118954   0.049016  -2.427   0.0163 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6155 on 169 degrees of freedom
## Multiple R-squared:  0.6342, Adjusted R-squared:  0.6212 
## F-statistic: 48.82 on 6 and 169 DF,  p-value: < 2.2e-16

A simple linear model using only the top 6 predictors to model the response results in an adjusted \(R^2\) value of 0.62 which is only slightly lower than the much more complicated elastic net model. And the top 2 predictors, ManufacturingProcess32 and ManufacturingProcess09, have by far the greatest significance in this model, so these are the first two processes that I would look at if I were trying to optimize this manufacturing process. Although since both of those processes have a positive effect on the yield, it may also be advisable to look at the three processes that have a negative impact on yield to see if improvements to those processes can increase yield.

DATA624 HW7

Linear Regression

Betsy Rosalen

4/18/2020

Exercise 6.2

Part (a)

Part (b)

Part (c)

PLS using the `caret` package

PLS using the `pls` package

Part (d)

`caret` package predictions

`pls` package predictions

Part (e)

elastic net

Tune with `caret`

Train with `elasticnet`

Predictions

`caret` model

`elasticnet` model

Part (f)

Exercise 6.3

Part (a)

Part (b)

Part (c)

PLS using the `caret` package

elastic net

Tune with `caret`

Train with `elasticnet`

Part (d)

`caret` model

`elasticnet` model

Part (e)

Part (f)

DATA624 HW7

Linear Regression

Betsy Rosalen

4/18/2020

Exercise 6.2

Part (a)

Part (b)

Part (c)

PLS using the caret package

PLS using the pls package

Part (d)

caret package predictions

pls package predictions

Part (e)

elastic net

Tune with caret

Train with elasticnet

Predictions

caret model

elasticnet model

Part (f)

Exercise 6.3

Part (a)

Part (b)

Part (c)

PLS using the caret package

elastic net

Tune with caret

Train with elasticnet

Part (d)

caret model

elasticnet model

Part (e)

Part (f)

PLS using the `caret` package

PLS using the `pls` package

`caret` package predictions

`pls` package predictions

Tune with `caret`

Train with `elasticnet`

`caret` model

`elasticnet` model

PLS using the `caret` package

Tune with `caret`

Train with `elasticnet`

`caret` model

`elasticnet` model