Homework Assigment Week 10
6.2: Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
(a) Start R and use these commands to load the data:
library(AppliedPredictiveModeling)## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.4.4
data(permeability)The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
(b)
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
zero = nearZeroVar(fingerprints)
print(length(zero))## [1] 719
fingerprints = fingerprints[,-zero]
print(length(fingerprints))## [1] 64020
(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
training_data <- createDataPartition(permeability, p = 0.75, list = FALSE)train_fingerprints <- fingerprints[training_data,]
train_permeability <- permeability[training_data,]
test_fingerprints <- fingerprints[-training_data,]
test_permeability <- permeability[-training_data,]set.seed(20)
ctrl <- trainControl(method = "cv", number = 10)
pls_model = train(x = train_fingerprints, y = train_permeability, method = "pls", tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))pls_model## Partial Least Squares
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 113, 113, 112, 112, 112, 113, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.35431 0.3788960 10.238557
## 2 12.03533 0.4568540 8.373472
## 3 12.05526 0.4437779 9.011565
## 4 12.26022 0.4356737 9.534174
## 5 11.84654 0.4753485 8.853376
## 6 11.63297 0.4887330 8.628693
## 7 11.42108 0.5217297 8.592258
## 8 11.31022 0.5375519 8.699285
## 9 11.20955 0.5556686 8.346522
## 10 11.21450 0.5523972 8.297355
## 11 11.43030 0.5323027 8.643355
## 12 11.61295 0.5246278 8.766786
## 13 11.36923 0.5541333 8.542773
## 14 11.51247 0.5386136 8.574184
## 15 11.75300 0.5236620 8.811551
## 16 11.84792 0.5182653 8.808581
## 17 11.74837 0.5262432 8.624581
## 18 11.88590 0.5262800 8.739195
## 19 11.95151 0.5237069 8.773778
## 20 12.10903 0.5142074 8.872265
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 9.
plot(pls_model, main="RMSE Error vs Components")(d) Predict the response for the test set. What is the test set estimate of R2?
y_hat = predict(pls_model, newdata=test_fingerprints)
r2 = cor(y_hat,test_permeability,method="pearson")^2
rmse = sqrt( mean( (y_hat-test_permeability)^2 ) )print(r2)## [1] 0.3118688
print(rmse)## [1] 11.9756
plot(varImp(pls_model))v_imp <- varImp(pls_model)
v_imp## pls variable importance
##
## only 20 most important variables shown (out of 388)
##
## Overall
## X6 100.00
## X15 56.34
## X126 47.90
## X127 47.90
## X118 46.95
## X503 44.77
## X371 44.08
## X129 37.95
## X245 36.87
## X246 36.87
## X244 36.87
## X254 36.87
## X157 36.87
## X253 36.87
## X239 36.87
## X240 36.87
## X141 36.62
## X374 35.00
## X130 33.47
## X125 33.47
(e) Try building other models discussed in this chapter. Do any have better predictive performance?
library(elasticnet)## Warning: package 'elasticnet' was built under R version 3.4.4
## Loading required package: lars
## Loaded lars 1.2
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
ridgeRegFit<- train(x = train_fingerprints, y = train_permeability, method = "ridge", tuneGrid = ridgeGrid, trControl = ctrl, preProc = c("center", "scale"))## Warning: model fit failed for Fold10: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
y_hat = predict(ridgeRegFit, newdata=test_fingerprints )
r2_enet = cor(y_hat,test_permeability,method="pearson")^2
rmse_enet = sqrt( mean( (y_hat-test_permeability)^2 ) )plot(ridgeRegFit, main="RMSE Error vs Components")plot(varImp(ridgeRegFit))v_imp <- varImp(ridgeRegFit)
v_imp## loess r-squared variable importance
##
## only 20 most important variables shown (out of 388)
##
## Overall
## X239 100.00
## X253 100.00
## X245 100.00
## X246 100.00
## X244 100.00
## X157 100.00
## X240 100.00
## X254 100.00
## X129 82.91
## X262 74.82
## X266 74.82
## X6 74.14
## X260 73.78
## X265 73.78
## X247 72.04
## X255 72.04
## X125 60.08
## X130 60.08
## X16 58.62
## X138 58.62
(f) Would you recommend any of your models to replace the permeability laboratory experiment?
Seems as though I could not get a second model to work on this data.
6.3: A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
(a) Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
(b) A small percentage of cells in the predictor set contain missing values. Use
an imputation function to fill in these missing values (e.g., see Sect. 3.8).
predictors <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select="Yield")
samples = dim(predictors)[1]
features = dim(predictors)[2]
missing_rows = apply(predictors, 1, function(x) sum(is.na(x)))
for(x in 1:features ){
blanks = is.na(predictors[,x] )
predictors[blanks,x] = missing_rows[x]
}(c) Split the data into a training and a test set, pre-process the data, and
tune a model of your choice from this chapter. What is the optimal value of the performance metric?
training_data <- createDataPartition(yield$Yield,
p = 0.75,
list = FALSE)
train_predictors <- predictors[training_data,]
train_yield <- yield[training_data,]
test_predictors <- predictors[-training_data,]
test_yield <- yield[-training_data,]
pre_process <- preProcess(train_predictors,method=c("BoxCox","center","scale","knnImpute"))
pretrain_predictors <- predict(pre_process,train_predictors)
pretest_predictors <- predict(pre_process,test_predictors)
non_zero_pretp <- nearZeroVar(pretrain_predictors)
pretrain_predictors <- pretrain_predictors[-non_zero_pretp]
pretest_predictors <- pretest_predictors[-non_zero_pretp]set.seed(100)
ctrl <- trainControl(method = "cv", number = 10)
pls <- train(x = pretrain_predictors, y = train_yield,
method = "pls",
tuneLength = 15,
trControl = ctrl)pls## Partial Least Squares
##
## 132 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 118, 117, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.437916 0.4047685 1.1746895
## 2 1.258876 0.5323482 1.0522654
## 3 1.209884 0.5822127 1.0090559
## 4 1.213152 0.5961061 1.0348065
## 5 1.220107 0.6142757 1.0220275
## 6 1.209217 0.6187165 0.9933124
## 7 1.205730 0.6161206 0.9850353
## 8 1.209255 0.6162727 0.9883973
## 9 1.221528 0.6157812 0.9786977
## 10 1.246213 0.6063629 0.9839947
## 11 1.283121 0.5974357 1.0121375
## 12 1.321362 0.5803813 1.0330123
## 13 1.349693 0.5705133 1.0445883
## 14 1.389765 0.5653152 1.0708333
## 15 1.403119 0.5615079 1.0820164
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.
plot(pls)plot(varImp(pls))v_imp <- varImp(pls)
v_imp## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 75.80
## ManufacturingProcess13 72.35
## ManufacturingProcess17 64.27
## ManufacturingProcess12 49.22
## BiologicalMaterial06 47.25
## BiologicalMaterial02 47.17
## BiologicalMaterial03 45.20
## BiologicalMaterial08 44.89
## BiologicalMaterial04 44.46
## BiologicalMaterial12 43.24
## BiologicalMaterial11 41.94
## BiologicalMaterial01 37.60
## ManufacturingProcess28 36.80
## BiologicalMaterial10 34.03
## ManufacturingProcess04 33.83
## ManufacturingProcess37 30.60
## ManufacturingProcess02 28.10
## ManufacturingProcess15 27.77
## ManufacturingProcess43 24.72
(d) Predict the response for the test set. What is the value of the performance
metric and how does this compare with the resampled performance metric on the training set?
y_hat = predict(pls, newdata=pretest_predictors)
r2 = cor(y_hat,test_yield,method="pearson")^2
rmse = sqrt( mean( (y_hat-test_yield)^2 ) )
print(r2)## [1] 0.03535169
print(rmse)## [1] 209986.9
lm_model = train(pretrain_predictors, train_yield, method="lm", preProcess=c("center","scale"), trControl=trainControl(method="repeatedcv",repeats=5) )
y_hat = predict( lm_model, newdata=pretest_predictors)
r2_lm = cor(y_hat,test_yield,method="pearson")^2
rmse_lm = sqrt( mean( (y_hat-test_yield)^2 ) )
print(r2_lm)## [1] 0.03535586
print(rmse_lm)## [1] 238261.3
(e) Which predictors are most important in the model you have trained? Do
either the biological or process predictors dominate the list?