Data 624 HW7

11/5/2021

Gabe Abreu

6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become adrug:

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 3.6.3

data("permeability")

dplyr::glimpse(fingerprints)

##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...

dplyr::glimpse(permeability)

##  num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr "permeability"

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package.

How many predictors are left for modeling?

fpframe <- data.frame(fingerprints)
fpframe <- fpframe[,-c(nearZeroVar(fpframe))]

length(fpframe)

## [1] 388

There are 388 predictors left for modeling.

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(100)

#Create Data Partition 

initsplit <- createDataPartition(permeability, p=0.8, list=FALSE)


#Create Training Data to tune the model
training <- fpframe[initsplit,]

training_perm <- permeability[initsplit,]


#Create testing data to evaluate the model
test <- fpframe[-initsplit,]

test_perm <- permeability[-initsplit,]


ctrl <- trainControl(method = 'cv', number = 10)

set.seed(100)

partial_fit <- train(training,
                     training_perm,
                     method = 'pls',
                     metric = 'Rsquared',
                     tuneLength = 10,
                     trControl = ctrl,
                     preProcess =  c("center", "scale"))

plot(partial_fit)

partial_fit

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.85492  0.3362303  9.721289
##    2     11.82217  0.4436244  8.092500
##    3     11.71991  0.4718842  8.673662
##    4     11.56038  0.4864682  8.683859
##    5     11.40392  0.4845769  8.561044
##    6     11.08213  0.4956908  8.266156
##    7     11.21951  0.4861156  8.526250
##    8     11.25938  0.4897819  8.413176
##    9     11.37410  0.4889105  8.555701
##   10     11.55558  0.4772198  8.541903
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 6.

Based on the plot and fit summary, the optimal model uses 6 variables with an estimated Rsquared value of 0.4956908.

Predict the response for the test set. What is the test set estimate of R2?

#Prediction of test with PLS

set.seed(100)

model_pls <- predict(partial_fit, test)

model_comp <- data.frame(obs=test_perm, pred=model_pls)

defaultSummary(model_comp)

##      RMSE  Rsquared       MAE 
## 11.773628  0.471816  8.672468

plot(model_pls, test_perm, ylab="RMSE")

The Rsquared value is 0.471816.

Try building other models discussed in this chapter. Do any have better predictive performance?

Creating a model using Linear Regression:

set.seed(100)
lm_model <- train(training, training_perm, method = 'lm', trControl = ctrl)

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

lm_model

## Linear Regression 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   33.37193  0.1545846  21.39547
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Predicting using the Linear Regression Model:

set.seed(100)
lm_predict <- predict(lm_model, test)

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

lmValues <- data.frame(obs=test_perm, pred = lm_predict)
defaultSummary(lmValues)

##        RMSE    Rsquared         MAE 
## 24.66203992  0.02516594 13.10952089

Ridge Model

## Define the set of values
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))

set.seed(100)

ridgeRegFit <- train(training, training_perm, method = "ridge", tuneGrid = ridgeGrid, trControl = ctrl,
                      preProcess = c("center","scale"))

ridgeRegFit

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.000000000   13.56846  0.3858837  10.092990
##   0.007142857   15.60215  0.2689935  11.569288
##   0.014285714  115.84292  0.3196042  82.969549
##   0.021428571   13.70226  0.3589300  10.306764
##   0.028571429   13.43111  0.3751570  10.118236
##   0.035714286   13.10860  0.3906398   9.825295
##   0.042857143   12.83212  0.4061188   9.623303
##   0.050000000   12.70364  0.4137372   9.503849
##   0.057142857   12.57114  0.4223368   9.407583
##   0.064285714   12.39352  0.4333250   9.282137
##   0.071428571   12.32787  0.4382032   9.225255
##   0.078571429   12.22797  0.4449271   9.148902
##   0.085714286   12.16429  0.4496660   9.092823
##   0.092857143   12.11078  0.4540054   9.042764
##   0.100000000   12.07131  0.4575539   9.008745
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

set.seed(100)
ridgePred <- predict(ridgeRegFit, test)
ridgeValues <- data.frame(obs=test_perm, pred=ridgePred)
defaultSummary(ridgeValues)

##       RMSE   Rsquared        MAE 
## 11.8168925  0.5174488  8.5967495

Lasso Grid

LassoGrid <- expand.grid(.lambda=c(0,0.01, .1), .fraction=seq(.05, 1, length = 20))

set.seed(100)

TedLasso <- train(training, training_perm,
                  method = "enet",
                  tuneGrid = LassoGrid,
                  trControl = ctrl,
                  preProcess = c("center", "scale"))

TedLasso

## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.05      12.47588  0.4208801   9.435047
##   0.00    0.10      11.78923  0.4716106   8.087365
##   0.00    0.15      11.31565  0.4991800   7.876385
##   0.00    0.20      11.17163  0.5008068   7.850165
##   0.00    0.25      11.16277  0.4980254   7.930180
##   0.00    0.30      11.23723  0.4959779   8.095477
##   0.00    0.35      11.26952  0.4957004   8.124020
##   0.00    0.40      11.31378  0.4928421   8.136045
##   0.00    0.45      11.38768  0.4888795   8.208561
##   0.00    0.50      11.55436  0.4782987   8.370018
##   0.00    0.55      11.73737  0.4668470   8.518631
##   0.00    0.60      11.94065  0.4556659   8.677040
##   0.00    0.65      12.14505  0.4462439   8.835489
##   0.00    0.70      12.36943  0.4362257   9.022197
##   0.00    0.75      12.60672  0.4244317   9.212477
##   0.00    0.80      12.82164  0.4142255   9.398498
##   0.00    0.85      13.00746  0.4074243   9.553866
##   0.00    0.90      13.20164  0.4002530   9.729740
##   0.00    0.95      13.39717  0.3927319   9.922369
##   0.00    1.00      13.56846  0.3858837  10.092990
##   0.01    0.05      11.52012  0.4972266   8.020639
##   0.01    0.10      11.05901  0.5045453   7.654380
##   0.01    0.15      11.09148  0.4997759   7.886225
##   0.01    0.20      11.24961  0.4891585   8.013324
##   0.01    0.25      11.38150  0.4811294   8.128488
##   0.01    0.30      11.52513  0.4695141   8.307715
##   0.01    0.35      11.78957  0.4526466   8.579942
##   0.01    0.40      12.04254  0.4396044   8.856650
##   0.01    0.45      12.28144  0.4274378   9.106965
##   0.01    0.50      12.55242  0.4136716   9.379548
##   0.01    0.55      12.84534  0.3990217   9.662862
##   0.01    0.60      13.11087  0.3847593   9.881114
##   0.01    0.65      13.37187  0.3715481  10.056764
##   0.01    0.70      13.63199  0.3585939  10.219595
##   0.01    0.75      13.89619  0.3457505  10.397912
##   0.01    0.80      14.13207  0.3340052  10.550947
##   0.01    0.85      14.38560  0.3219061  10.726579
##   0.01    0.90      14.65589  0.3098366  10.928835
##   0.01    0.95      14.86956  0.3009707  11.089620
##   0.01    1.00      15.04527  0.2942010  11.215431
##   0.10    0.05      12.10929  0.4616972   9.100726
##   0.10    0.10      11.46167  0.5013271   7.839306
##   0.10    0.15      11.20941  0.5091979   7.559372
##   0.10    0.20      11.03835  0.5113379   7.595595
##   0.10    0.25      11.03369  0.5104672   7.781120
##   0.10    0.30      11.08814  0.5082211   7.872714
##   0.10    0.35      11.21817  0.5010496   7.985921
##   0.10    0.40      11.31391  0.4953345   8.018761
##   0.10    0.45      11.38706  0.4910837   8.068842
##   0.10    0.50      11.45076  0.4870859   8.162696
##   0.10    0.55      11.50113  0.4847988   8.266132
##   0.10    0.60      11.57130  0.4810821   8.370649
##   0.10    0.65      11.64927  0.4759461   8.454399
##   0.10    0.70      11.72488  0.4713180   8.546678
##   0.10    0.75      11.78704  0.4681958   8.624364
##   0.10    0.80      11.84749  0.4655544   8.708469
##   0.10    0.85      11.91036  0.4631435   8.788172
##   0.10    0.90      11.96377  0.4613488   8.860670
##   0.10    0.95      12.01680  0.4595391   8.935237
##   0.10    1.00      12.07131  0.4575539   9.008745
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.25 and lambda = 0.1.

LassoPred <- predict(TedLasso, test)
LassoModel <- data.frame(obs = test_perm, pred=LassoPred)
defaultSummary(LassoModel)

##       RMSE   Rsquared        MAE 
## 11.7971613  0.4787237  8.2583335

ElasticNet

enetGrid <- expand.grid(.lambda=c(0,0.1,.1), .fraction= seq(.05, 1, length =20))

set.seed(100)

enetTune <- train( training, training_perm,
                   method="enet",
                   tuneGrid = enetGrid,
                   trControl = ctrl,
                   preProcess = c("center", "scale"))

plot(enetTune)

EnetPred <- predict(enetTune, test)
EnetModel <- data.frame(obs = test_perm, pred=EnetPred)
defaultSummary(EnetModel)

##       RMSE   Rsquared        MAE 
## 11.7971613  0.4787237  8.2583335

Would you recommend any of your models to replace the permeability laboratory experiment?

Based on the Rsquared values, I would not recommend the models to replace the permeability laboratory experiment with exception of the ridge model. The best performing model was the ridge regression model with an Rsquared value of .517. There was no difference between the Elastic Net model and the Lasso model.

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Let’s quickly plot the missing data:

plot_missing(ChemicalManufacturingProcess)

While its good to see the missing percentage values, it doesn’t really help in knowing which columns have missing values and where we might need to impute data.

colnames(ChemicalManufacturingProcess[!complete.cases(ChemicalManufacturingProcess),])

##  [1] "Yield"                  "BiologicalMaterial01"   "BiologicalMaterial02"  
##  [4] "BiologicalMaterial03"   "BiologicalMaterial04"   "BiologicalMaterial05"  
##  [7] "BiologicalMaterial06"   "BiologicalMaterial07"   "BiologicalMaterial08"  
## [10] "BiologicalMaterial09"   "BiologicalMaterial10"   "BiologicalMaterial11"  
## [13] "BiologicalMaterial12"   "ManufacturingProcess01" "ManufacturingProcess02"
## [16] "ManufacturingProcess03" "ManufacturingProcess04" "ManufacturingProcess05"
## [19] "ManufacturingProcess06" "ManufacturingProcess07" "ManufacturingProcess08"
## [22] "ManufacturingProcess09" "ManufacturingProcess10" "ManufacturingProcess11"
## [25] "ManufacturingProcess12" "ManufacturingProcess13" "ManufacturingProcess14"
## [28] "ManufacturingProcess15" "ManufacturingProcess16" "ManufacturingProcess17"
## [31] "ManufacturingProcess18" "ManufacturingProcess19" "ManufacturingProcess20"
## [34] "ManufacturingProcess21" "ManufacturingProcess22" "ManufacturingProcess23"
## [37] "ManufacturingProcess24" "ManufacturingProcess25" "ManufacturingProcess26"
## [40] "ManufacturingProcess27" "ManufacturingProcess28" "ManufacturingProcess29"
## [43] "ManufacturingProcess30" "ManufacturingProcess31" "ManufacturingProcess32"
## [46] "ManufacturingProcess33" "ManufacturingProcess34" "ManufacturingProcess35"
## [49] "ManufacturingProcess36" "ManufacturingProcess37" "ManufacturingProcess38"
## [52] "ManufacturingProcess39" "ManufacturingProcess40" "ManufacturingProcess41"
## [55] "ManufacturingProcess42" "ManufacturingProcess43" "ManufacturingProcess44"
## [58] "ManufacturingProcess45"

Predictors and the target variable are all missing data. Based on the missing percentages, we can go ahead with imputation since large percentages are not missing.

Predictive Mean Matching is slightly complicated to explain but is a robust method of imputation that increases the variance of the imputed values. PMM also does not calculate values outside the scope of the data.

complete_chem <- complete(imputed_chem)

plot_missing(complete_chem)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

length(complete_chem)

## [1] 58

Let’s eliminate predictors with low frequencies, we did with the previous problem.

chemframe <- complete_chem[,-c(nearZeroVar(complete_chem))]

length(chemframe)

## [1] 57

1 column is removed from the data set.

Let’s remove columns with high correlation, using 90% as the cutoff point.

corr_data <- findCorrelation(cor(chemframe), cutoff = 0.90)
chemframe <- chemframe[, -corr_data]
length(chemframe)

## [1] 47

10 columns were removed from the data set. We will mimic the partition set up we created for problem 6.2

Now I’m going to subset the data after eliminating columns from the data set.

predictors <- subset(chemframe, select = -Yield)
yield <- subset(chemframe, select="Yield")

set.seed(100)

#Create Data Partition 

initsplit2 <- createDataPartition(yield$Yield, p=0.75, list=FALSE)


#Create Training Data to tune the model
training2 <- predictors[initsplit2,]
training_yield <- yield[initsplit2,]


#Create testing data to evaluate the model
test2 <- predictors[-initsplit2,]
test_yield <- yield[-initsplit2,]

Part of the cleaning process is centering and normalizing the data:

chemTransformed <- preProcess(training2, method = c("center", "scale"))

CSchem <- predict(chemTransformed, training2)
TCSchem <- predict(chemTransformed, test2)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Ridge Model

set.seed(100)

ridgeRegFit2 <- train(training2, training_yield, method = "ridge", tuneGrid = ridgeGrid, trControl = ctrl)

ridgeRegFit2

## Ridge Regression 
## 
## 132 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 118, 120, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE      Rsquared   MAE     
##   0.000000000  6.029453  0.3955710  2.428251
##   0.007142857  5.074384  0.4243074  2.128553
##   0.014285714  4.528278  0.4378886  1.967736
##   0.021428571  4.154274  0.4465980  1.857503
##   0.028571429  3.875036  0.4529265  1.775767
##   0.035714286  3.655417  0.4578691  1.710912
##   0.042857143  3.476523  0.4619154  1.657543
##   0.050000000  3.327061  0.4653395  1.612988
##   0.057142857  3.199769  0.4683091  1.575069
##   0.064285714  3.089716  0.4709339  1.542566
##   0.071428571  2.993407  0.4732896  1.514245
##   0.078571429  2.908281  0.4754304  1.489740
##   0.085714286  2.832411  0.4773961  1.468135
##   0.092857143  2.764311  0.4792171  1.449080
##   0.100000000  2.702813  0.4809167  1.432074
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

set.seed(100)
ridgePred2 <- predict(ridgeRegFit2, test2)
ridgeValues2 <- data.frame(obs=test_yield, pred=ridgePred2)
defaultSummary(ridgeValues2)

##      RMSE  Rsquared       MAE 
## 1.1873794 0.5870424 0.9484809

The Rsquared is 0.587042. The Rsquared value is above 0.50 with an RMSE of 1.1873794 and MAE of 0.9484809.

I played with different data partitions and .75 seems to hit a nice sweet spot for ridge regression, any more or less creates abysmal Rsquared values when using the test data. It’s something to keep in mind when working in the professional world, experiment with different training and test samples to optimize models. The model also performed better on the test data set than the training which is usually the opposite.

The best tune model had a lambda of 0.1

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Ridge Model Top Variables:

plot(varImp(ridgeRegFit2), top=20)

The most important variables are the Manufacturing predictors, the most important being Manufacturing Process 13 and Manufacturing Process 32. The only biological predictor is BiologicalMaterial06 which is in the top 5 of all predictors.

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

The top predictors show the importance of the manufacturing process, an area of where future investment will payoff bigger dividends. That is not to say that biological materials, practically 06 and 09. When trying to see the coefficients for the top predictors, I got “NULL”. However, ideally you would like to see if the coefficients are positive or negative. Positively correlated and you would want to optimize and increase the predictor and reduce the effects of negatively correlated predictors.