Source Code: https://github.com/djlofland/DATA624_PredictiveAnalytics/tree/master/Homework_7
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
## [1] "165 1107"
## [1] "165 1"
Note: no missing values found
nearZeroVar function from the caret package. How many predictors are left for modeling?## [1] "165 388"
We have 388 predictors left for predicting.
set.seed(424242)
# get training/test split
trainingRows <- createDataPartition(pm, p=0.8, list=FALSE)
# Build training datasets
trainX <- fp[trainingRows,]
trainY <- pm[trainingRows]
# put remaining rows into the test sets
testX <- fp[-trainingRows,]
testY <- pm[-trainingRows]
# Build a DF
trainingData <- as.data.frame(trainX)
trainingData$Perm <- trainY
ctrl <- trainControl(method='cv', number=10)
# run Cv to identify # of latent components
plsTune <- train(trainX, trainY,
method="pls",
tuneLength=20,
trControl = ctrl)
plot(plsTune)# Get the number of components which minimizes RMSE
(optimal_num_comp <- which.min(plsTune$results[,2]))## [1] 7
## [1] 0.5797039
I found the optimal number of PLS latent components to be 7 giving and RMSE of 0.5797039.
plsFit <- plsr(Perm ~., data=trainingData, ncomp=optimal_num_comp)
plsPred <- predict(plsFit, testX, ncomp=optimal_num_comp)
head(plsPred)## , , 7 comps
##
## Perm
## 2 2.201710
## 3 18.503569
## 4 5.680499
## 8 5.720243
## 19 7.133854
## 22 28.728169
plsValues <- data.frame(obs = testY, pred=plsPred)
colnames(plsValues) = c('obs', 'pred')
defaultSummary(plsValues)## RMSE Rsquared MAE
## 13.3211161 0.4067846 9.1956516
\(R^2\) for the test set is 0.4068
# Ridge Model
set.seed(100)
ridgeGrid <- data.frame(.lambda=seq(0, 0.2, length=21))
ctrl <- trainControl(method='cv', number=10)
ridgeRegFit <- train(trainX, trainY,
method="ridge",
tuneGrid = ridgeGrid,
trControl = ctrl)## Warning: model fit failed for Fold02: lambda=0.00 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold06: lambda=0.00 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold07: lambda=0.00 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00 14.04059 0.3155877 9.731223
## 0.01 15.05256 0.3259500 10.671451
## 0.02 13.20238 0.3886225 9.379421
## 0.03 12.51383 0.4167591 8.895310
## 0.04 12.18571 0.4374073 8.720544
## 0.05 11.83622 0.4527312 8.434725
## 0.06 11.66755 0.4629389 8.320563
## 0.07 11.56659 0.4721971 8.255973
## 0.08 11.46803 0.4796161 8.179014
## 0.09 11.35867 0.4860730 8.089975
## 0.10 11.30413 0.4910185 8.041902
## 0.11 11.26373 0.4952875 8.001179
## 0.12 11.23325 0.4990213 7.978158
## 0.13 11.21365 0.5023382 7.964948
## 0.14 11.22380 0.5053486 7.984859
## 0.15 11.21283 0.5082730 7.986088
## 0.16 11.16735 0.5106052 7.958787
## 0.17 11.17855 0.5123371 7.973737
## 0.18 11.16861 0.5145551 7.979657
## 0.19 11.17309 0.5162831 7.993145
## 0.20 11.19009 0.5177423 8.022084
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.16.
# Do prediction and get R^2
ridgePred <- predict(ridgeRegFit, newdata=testX)
ridgeValues <- data.frame(obs = testY, pred=ridgePred)
colnames(ridgeValues) = c('obs', 'pred')
defaultSummary(ridgeValues)## RMSE Rsquared MAE
## 13.7684863 0.3937314 10.1500314
set.seed(100)
# LASSO
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1),
.fraction=seq(0.05, 1, length = 20))
ctrl <- trainControl(method='cv', number=10)
enetTune <- train(trainX, trainY,
method="enet",
tuneGrid = enetGrid,
trControl = ctrl)## Warning: model fit failed for Fold02: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold06: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold07: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Length Class Mode
## call 4 -none- call
## actions 192 -none- list
## allset 388 -none- numeric
## beta.pure 74496 -none- numeric
## vn 388 -none- character
## mu 1 -none- numeric
## normx 388 -none- numeric
## meanx 388 -none- numeric
## lambda 1 -none- numeric
## L1norm 192 -none- numeric
## penalty 192 -none- numeric
## df 192 -none- numeric
## Cp 192 -none- numeric
## sigma2 1 -none- numeric
## xNames 388 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
lassoPred <- predict(enetTune, newdata=testX)
lassoValues <- data.frame(obs = testY, pred=lassoPred)
colnames(lassoValues) = c('obs', 'pred')
defaultSummary(lassoValues)## RMSE Rsquared MAE
## 14.0007919 0.3538517 9.7501498
I would NOT recommend replacement of laboratory experiments, but possibly a model could be used to help prioritize the order of research. Then based on business needs, if it made sense to stop early, we are more likely to have found those that are most meaningful. Between the couple of models, the Ridge model explains ~40% of the variability in permability based on features which may or may not be sufficient to justify altering processes.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(chemicalManufacturing)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
data("ChemicalManufacturingProcess")
cmp <- as_tibble(ChemicalManufacturingProcess)
x_raw <- cmp[,2:58]
y_raw <- as.matrix(cmp$Yield)
print(paste(nrow(x_raw), ncol(x_raw)))## [1] "176 57"
## [1] "176 1"
# Various NA plots to inspect data
knitr::kable(miss_var_summary(cmp) %>% filter(n_miss > 0),
caption = 'Missing Values',
format="html",
table.attr="style='width:50%;'") %>%
kableExtra::kable_styling()| variable | n_miss | pct_miss |
|---|---|---|
| ManufacturingProcess03 | 15 | 8.5227273 |
| ManufacturingProcess11 | 10 | 5.6818182 |
| ManufacturingProcess10 | 9 | 5.1136364 |
| ManufacturingProcess25 | 5 | 2.8409091 |
| ManufacturingProcess26 | 5 | 2.8409091 |
| ManufacturingProcess27 | 5 | 2.8409091 |
| ManufacturingProcess28 | 5 | 2.8409091 |
| ManufacturingProcess29 | 5 | 2.8409091 |
| ManufacturingProcess30 | 5 | 2.8409091 |
| ManufacturingProcess31 | 5 | 2.8409091 |
| ManufacturingProcess33 | 5 | 2.8409091 |
| ManufacturingProcess34 | 5 | 2.8409091 |
| ManufacturingProcess35 | 5 | 2.8409091 |
| ManufacturingProcess36 | 5 | 2.8409091 |
| ManufacturingProcess02 | 3 | 1.7045455 |
| ManufacturingProcess06 | 2 | 1.1363636 |
| ManufacturingProcess01 | 1 | 0.5681818 |
| ManufacturingProcess04 | 1 | 0.5681818 |
| ManufacturingProcess05 | 1 | 0.5681818 |
| ManufacturingProcess07 | 1 | 0.5681818 |
| ManufacturingProcess08 | 1 | 0.5681818 |
| ManufacturingProcess12 | 1 | 0.5681818 |
| ManufacturingProcess14 | 1 | 0.5681818 |
| ManufacturingProcess22 | 1 | 0.5681818 |
| ManufacturingProcess23 | 1 | 0.5681818 |
| ManufacturingProcess24 | 1 | 0.5681818 |
| ManufacturingProcess40 | 1 | 0.5681818 |
| ManufacturingProcess41 | 1 | 0.5681818 |
Because we see some patterns in how features are missing, I’ll use a KNN Impute approach. Note that the
caretknnImputealso normalizes the data by default. This is annoying - I’m going useknn.imputefrombnstructinstead.
Let’s check for and drop any features with near zero variance
# Check for columns with little variance - candidate to drop
lowVariance <- nearZeroVar(x_imputed, names = TRUE)
head(lowVariance)## [1] "BiologicalMaterial07"
Let’s deal with outliers - rrepalce them with the MEDIAN from that feature:
Now, lets drop any columns with high correlations
# Find and drop high correlation features
correlations <- cor(x_outliers)
highCorr <- findCorrelation(correlations, names=TRUE, cutoff=0.9)
(highCorr)## [1] "BiologicalMaterial02" "ManufacturingProcess26" "BiologicalMaterial04"
## [4] "BiologicalMaterial12" "ManufacturingProcess11" "ManufacturingProcess27"
## [7] "ManufacturingProcess44" "ManufacturingProcess40"
## BiologicalMaterial01 BiologicalMaterial03 BiologicalMaterial05
## [1,] 6.25 56.97 19.51
## [2,] 8.01 67.48 19.36
## [3,] 8.01 67.48 19.36
## [4,] 8.01 67.48 19.36
## [5,] 7.47 72.25 17.91
## [6,] 6.12 65.31 21.79
## BiologicalMaterial06 BiologicalMaterial08 BiologicalMaterial09
## [1,] 43.73 16.66 11.44
## [2,] 53.14 19.04 12.55
## [3,] 53.14 19.04 12.55
## [4,] 53.14 19.04 12.55
## [5,] 54.66 18.22 12.80
## [6,] 51.23 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 ManufacturingProcess01
## [1,] 3.46 138.09 11.1
## [2,] 3.46 153.67 0.0
## [3,] 3.46 153.67 0.0
## [4,] 3.46 153.67 0.0
## [5,] 3.05 147.61 10.7
## [6,] 3.78 151.88 12.0
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## [1,] 0 1.55 934
## [2,] 0 1.55 917
## [3,] 0 1.55 912
## [4,] 0 1.55 911
## [5,] 0 1.55 918
## [6,] 0 1.55 924
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## [1,] 1092.2 207.5 177
## [2,] 1032.2 210.0 177
## [3,] 1003.6 207.1 178
## [4,] 1014.6 213.3 177
## [5,] 1027.5 205.7 178
## [6,] 1016.8 208.9 178
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## [1,] 177 43.00 8.9
## [2,] 178 46.57 8.9
## [3,] 178 45.07 8.9
## [4,] 177 44.92 9.0
## [5,] 178 44.96 10.7
## [6,] 178 45.32 8.9
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## [1,] 0 35.5 4898
## [2,] 0 34.0 4869
## [3,] 0 34.8 4878
## [4,] 0 34.8 4897
## [5,] 0 34.6 4992
## [6,] 0 34.0 4985
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## [1,] 6108 4682 35.5
## [2,] 6095 4617 34.0
## [3,] 6087 4617 34.8
## [4,] 6102 4635 34.8
## [5,] 6233 4733 33.9
## [6,] 6222 4786 33.4
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## [1,] 4865 6049 4665
## [2,] 4867 6097 4621
## [3,] 4877 6078 4621
## [4,] 4872 6073 4611
## [5,] 4886 6102 4659
## [6,] 4862 6115 4696
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## [1,] 0.0 7 2
## [2,] 0.0 3 0
## [3,] 0.0 4 1
## [4,] 0.0 5 2
## [5,] -0.7 8 4
## [6,] -0.6 9 1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess28
## [1,] 3 4873 10.7
## [2,] 3 4869 11.2
## [3,] 4 4897 11.1
## [4,] 5 4892 11.1
## [5,] 18 4930 11.3
## [6,] 1 4871 11.4
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## [1,] 21.0 9.9 69.1
## [2,] 21.4 9.9 68.7
## [3,] 21.3 9.4 69.3
## [4,] 21.3 9.4 69.3
## [5,] 21.6 9.0 69.4
## [6,] 21.7 10.1 68.2
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## [1,] 156 66 2.4
## [2,] 169 66 2.6
## [3,] 173 66 2.6
## [4,] 171 68 2.5
## [5,] 171 70 2.5
## [6,] 173 70 2.5
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## [1,] 486 0.019 0.5
## [2,] 508 0.019 2.0
## [3,] 509 0.018 0.7
## [4,] 496 0.018 1.2
## [5,] 468 0.017 0.2
## [6,] 490 0.018 0.4
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess41
## [1,] 3 7.2 0.00
## [2,] 2 7.2 0.15
## [3,] 2 7.2 0.00
## [4,] 2 7.2 0.00
## [5,] 2 7.3 0.00
## [6,] 2 7.2 0.00
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess45
## [1,] 11.6 3.0 2.4
## [2,] 11.1 0.9 2.2
## [3,] 12.0 1.0 2.3
## [4,] 10.6 1.1 2.1
## [5,] 11.0 1.1 2.1
## [6,] 11.5 2.2 2.0
Now, let’s transform our features (scale, center and boxcox) to improve modeling resolution
## Created from 176 samples and 48 variables
##
## Pre-processing:
## - Box-Cox transformation (33)
## - centered (48)
## - ignored (0)
## - scaled (48)
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00000 -2.00000 0.00000 -0.08182 2.00000 2.00000
Time to split in to Training and Test
# get training/test split
trainingRows <- createDataPartition(y_raw, p=0.8, list=FALSE)
# Build training datasets
trainX <- x_transf[trainingRows,]
trainY <- y_raw[trainingRows]
# put remaining rows into the test sets
testX <- x_transf[-trainingRows,]
testY <- y_raw[-trainingRows]
# Build a DF
trainingData <- as.data.frame(trainX)
trainingData$Yield <- trainYBuild a model:
set.seed(100)
# LASSO
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1),
.fraction=seq(0.05, 1, length = 20))
ctrl <- trainControl(method='cv', number=10)
enetTune <- train(trainX, trainY,
method="enet",
tuneGrid = enetGrid,
trControl = ctrl)
plot(enetTune)## Length Class Mode
## call 4 -none- call
## actions 77 -none- list
## allset 48 -none- numeric
## beta.pure 3696 -none- numeric
## vn 48 -none- character
## mu 1 -none- numeric
## normx 48 -none- numeric
## meanx 48 -none- numeric
## lambda 1 -none- numeric
## L1norm 77 -none- numeric
## penalty 77 -none- numeric
## df 77 -none- numeric
## Cp 77 -none- numeric
## sigma2 1 -none- numeric
## xNames 48 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
lassoPred <- predict(enetTune, newdata=trainX)
lassoValues <- data.frame(obs = trainY, pred=lassoPred)
colnames(lassoValues) = c('obs', 'pred')
(train_values <- defaultSummary(lassoValues))## RMSE Rsquared MAE
## 1.0072681 0.6971485 0.8157123
lassoPred <- predict(enetTune, newdata=testX)
lassoValues <- data.frame(obs = testY, pred=lassoPred)
colnames(lassoValues) = c('obs', 'pred')
(test_values <- defaultSummary(lassoValues))## RMSE Rsquared MAE
## 1.1413167 0.6794059 0.9102490
With the training data, we had an RMSE of 0.6971485. With the test data, this was 0.6794059. These seem reasonable and quite good. We expect the training RMSE to be higher.
We have more Manufacturing Features in the top 20 (15) vs Biological (5). That said, we had more Manufacturing features to start with so, this could just be a class imbalance issue. … or maybe Manufacturing Processes have more impact than base Biological?
# Get top feature names
features <- as.data.frame(enetImp[["importance"]]) %>%
arrange(-Overall, ) %>%
rownames()
features <- features[1:20]
cmp_imp <- as.data.frame(x_outliers)[features]
cmp_imp$Yield <- cmp$Yield
featurePlot(x = cmp_imp[features],
y = cmp_imp$Yield,
plot = "scatter",
type = c("p", 'smooth'),
span = .5,
pch = 20)# Show feature correlations/target by decreasing correlation
#stack(sort(cor(clean_df[,1], clean_df[,2:ncol(clean_df)])[,], decreasing=TRUE))Since we cannot affect Biological Features, we can ignore those for now. However, for the Manufacuring features, we can look for those that show a relationship and try altering our Manufacturing towards the more favorable values. At this point, we haven’t established causation, merely correlation. So, changing those values might not lead to improvements, however these are good candidates for exploration. I would setup experimets where we independently test changing each candidate Manufacturing feature with the strongest correlation and starting from the top of list (based on feature importance) to see if it’s a lever for improving yield. Note: Just because a variable appeared at the top of the feature importance doesn’t mean it is the lever that could have the most impact if changed … but it’s a good place to start and better than just randomly adjusting manufactuing levers.