Homework Assigment Week 10

6.2: Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.4.4
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

zero = nearZeroVar(fingerprints)
print(length(zero))
## [1] 719
fingerprints = fingerprints[,-zero]
print(length(fingerprints))
## [1] 64020

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

training_data <- createDataPartition(permeability, p = 0.75, list = FALSE)
train_fingerprints <- fingerprints[training_data,]
train_permeability <- permeability[training_data,]

test_fingerprints <- fingerprints[-training_data,]
test_permeability <- permeability[-training_data,]
set.seed(20)
ctrl <- trainControl(method = "cv", number = 10)
pls_model = train(x = train_fingerprints, y = train_permeability, method = "pls", tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
pls_model
## Partial Least Squares 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 113, 113, 112, 112, 112, 113, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.35431  0.3788960  10.238557
##    2     12.03533  0.4568540   8.373472
##    3     12.05526  0.4437779   9.011565
##    4     12.26022  0.4356737   9.534174
##    5     11.84654  0.4753485   8.853376
##    6     11.63297  0.4887330   8.628693
##    7     11.42108  0.5217297   8.592258
##    8     11.31022  0.5375519   8.699285
##    9     11.20955  0.5556686   8.346522
##   10     11.21450  0.5523972   8.297355
##   11     11.43030  0.5323027   8.643355
##   12     11.61295  0.5246278   8.766786
##   13     11.36923  0.5541333   8.542773
##   14     11.51247  0.5386136   8.574184
##   15     11.75300  0.5236620   8.811551
##   16     11.84792  0.5182653   8.808581
##   17     11.74837  0.5262432   8.624581
##   18     11.88590  0.5262800   8.739195
##   19     11.95151  0.5237069   8.773778
##   20     12.10903  0.5142074   8.872265
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 9.
plot(pls_model, main="RMSE Error vs Components")

(d) Predict the response for the test set. What is the test set estimate of R2?

y_hat = predict(pls_model, newdata=test_fingerprints)
r2 = cor(y_hat,test_permeability,method="pearson")^2
rmse = sqrt( mean( (y_hat-test_permeability)^2 ) )
print(r2)
## [1] 0.3118688
print(rmse)
## [1] 11.9756
plot(varImp(pls_model))

v_imp <- varImp(pls_model)
v_imp
## pls variable importance
## 
##   only 20 most important variables shown (out of 388)
## 
##      Overall
## X6    100.00
## X15    56.34
## X126   47.90
## X127   47.90
## X118   46.95
## X503   44.77
## X371   44.08
## X129   37.95
## X245   36.87
## X246   36.87
## X244   36.87
## X254   36.87
## X157   36.87
## X253   36.87
## X239   36.87
## X240   36.87
## X141   36.62
## X374   35.00
## X130   33.47
## X125   33.47

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

library(elasticnet)
## Warning: package 'elasticnet' was built under R version 3.4.4
## Loading required package: lars
## Loaded lars 1.2
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))

ridgeRegFit<- train(x = train_fingerprints, y = train_permeability, method = "ridge", tuneGrid = ridgeGrid, trControl = ctrl, preProc = c("center", "scale"))
## Warning: model fit failed for Fold10: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
y_hat = predict(ridgeRegFit, newdata=test_fingerprints )
r2_enet = cor(y_hat,test_permeability,method="pearson")^2
rmse_enet = sqrt( mean( (y_hat-test_permeability)^2 ) )
plot(ridgeRegFit, main="RMSE Error vs Components")

plot(varImp(ridgeRegFit))

v_imp <- varImp(ridgeRegFit)
v_imp
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 388)
## 
##      Overall
## X239  100.00
## X253  100.00
## X245  100.00
## X246  100.00
## X244  100.00
## X157  100.00
## X240  100.00
## X254  100.00
## X129   82.91
## X262   74.82
## X266   74.82
## X6     74.14
## X260   73.78
## X265   73.78
## X247   72.04
## X255   72.04
## X125   60.08
## X130   60.08
## X16    58.62
## X138   58.62

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

Seems as though I could not get a second model to work on this data.

6.3: A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use

an imputation function to fill in these missing values (e.g., see Sect. 3.8).

predictors <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select="Yield")

samples = dim(predictors)[1]
features = dim(predictors)[2]

missing_rows = apply(predictors, 1, function(x) sum(is.na(x)))
for(x in 1:features ){
  blanks = is.na(predictors[,x] )
  predictors[blanks,x] = missing_rows[x]
}

(c) Split the data into a training and a test set, pre-process the data, and

tune a model of your choice from this chapter. What is the optimal value of the performance metric?

training_data <- createDataPartition(yield$Yield, 
                                    p = 0.75, 
                                    list = FALSE)



train_predictors <- predictors[training_data,]
train_yield <- yield[training_data,]

test_predictors <- predictors[-training_data,]
test_yield <- yield[-training_data,]


pre_process <- preProcess(train_predictors,method=c("BoxCox","center","scale","knnImpute"))
pretrain_predictors <- predict(pre_process,train_predictors)
pretest_predictors <- predict(pre_process,test_predictors)


non_zero_pretp <- nearZeroVar(pretrain_predictors)
pretrain_predictors  <- pretrain_predictors[-non_zero_pretp]
pretest_predictors <- pretest_predictors[-non_zero_pretp]
set.seed(100)
ctrl <- trainControl(method = "cv", number = 10)

pls <- train(x = pretrain_predictors, y = train_yield,
                 method = "pls",
                 tuneLength = 15,
                 trControl = ctrl)
pls
## Partial Least Squares 
## 
## 132 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 118, 117, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.437916  0.4047685  1.1746895
##    2     1.258876  0.5323482  1.0522654
##    3     1.209884  0.5822127  1.0090559
##    4     1.213152  0.5961061  1.0348065
##    5     1.220107  0.6142757  1.0220275
##    6     1.209217  0.6187165  0.9933124
##    7     1.205730  0.6161206  0.9850353
##    8     1.209255  0.6162727  0.9883973
##    9     1.221528  0.6157812  0.9786977
##   10     1.246213  0.6063629  0.9839947
##   11     1.283121  0.5974357  1.0121375
##   12     1.321362  0.5803813  1.0330123
##   13     1.349693  0.5705133  1.0445883
##   14     1.389765  0.5653152  1.0708333
##   15     1.403119  0.5615079  1.0820164
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.
plot(pls)

plot(varImp(pls))

v_imp <- varImp(pls)
v_imp
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   75.80
## ManufacturingProcess13   72.35
## ManufacturingProcess17   64.27
## ManufacturingProcess12   49.22
## BiologicalMaterial06     47.25
## BiologicalMaterial02     47.17
## BiologicalMaterial03     45.20
## BiologicalMaterial08     44.89
## BiologicalMaterial04     44.46
## BiologicalMaterial12     43.24
## BiologicalMaterial11     41.94
## BiologicalMaterial01     37.60
## ManufacturingProcess28   36.80
## BiologicalMaterial10     34.03
## ManufacturingProcess04   33.83
## ManufacturingProcess37   30.60
## ManufacturingProcess02   28.10
## ManufacturingProcess15   27.77
## ManufacturingProcess43   24.72

(d) Predict the response for the test set. What is the value of the performance

metric and how does this compare with the resampled performance metric on the training set?

y_hat = predict(pls, newdata=pretest_predictors)
r2 = cor(y_hat,test_yield,method="pearson")^2
rmse = sqrt( mean( (y_hat-test_yield)^2 ) )
print(r2)
## [1] 0.03535169
print(rmse)
## [1] 209986.9
lm_model = train(pretrain_predictors, train_yield, method="lm", preProcess=c("center","scale"), trControl=trainControl(method="repeatedcv",repeats=5) )
y_hat = predict( lm_model, newdata=pretest_predictors)
r2_lm = cor(y_hat,test_yield,method="pearson")^2
rmse_lm = sqrt( mean( (y_hat-test_yield)^2 ) )

print(r2_lm)
## [1] 0.03535586
print(rmse_lm)
## [1] 238261.3

(e) Which predictors are most important in the model you have trained? Do

either the biological or process predictors dominate the list?

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing?