Week 10 HW - Data 624
KJ 6.2
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
Part A
## [1] 165 1107
## [1] 165 1
Fingerprints dataset have 165 rows and 1107 columns, whereas the permeability dataset is just one variable, the permeability value.
Part B
## [1] "165 388"
Big drop in the number of columns as a result of the nearZeroVar
funciton. We have now only 388.
PART C
I hope that Max Kuhn updates this textbook with the new tidymodels framework. Do these data partitions in caret is not as elegant. We will do the cross validation for the PLS and plot the RMSE aggregated from the cross validation.
set.seed(9450)
trainingRows <- createDataPartition(pm, p=0.8, list=FALSE)
train_X <- fp[trainingRows,]
train_y <- pm[trainingRows]
test_X <- fp[-trainingRows,]
test_y <- pm[-trainingRows]
trainingData <- as.data.frame(train_X)
trainingData$Perm <- train_y
ctrl <- trainControl(method='cv', number=10)
pls_tune <- train(train_X, train_y,
method="pls",
tuneLength=20,
trControl = ctrl)
plot(pls_tune)
Looks like 7 components is the minimum.
## [1] 7
yup.
## [1] 0.535406
Part E
Predict the response for the test set. What is the test set estimate of \(R^{2}\)?
pls_fit <- pls::plsr(Perm ~., data=trainingData, ncomp=num_comp)
pls_pred <- predict(pls_fit, test_X, ncomp=num_comp)
pls_data <- data.frame(obs = test_y, pred=pls_pred)
colnames(pls_data) = c('obs', 'pred')
defaultSummary(pls_data)
## RMSE Rsquared MAE
## 11.0335910 0.5013147 7.9989932
Not amazing for something that is a more traditional science, \(R^{2} = 0.501\) leaves us wanting.
Part E
Try building other models discussed in this chapter. Do any have better predictive performance?
We will start with a ridge model.
# SLOW - only run if necessary!
set.seed(9450)
ridge_grid <- data.frame(.lambda=seq(0, 0.2, length=21))
ctrl <- trainControl(method='cv', number=10)
ridge_fit <- train(train_X,
train_y,
method="ridge",
tuneGrid = ridge_grid,
trControl = ctrl)
ridge_fit
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 121, 118, 119, 120, 118, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00 14.80785 0.3830517 10.829144
## 0.01 26.65945 0.3730143 18.106596
## 0.02 14.16303 0.3967416 10.460063
## 0.03 13.72493 0.4112870 10.152576
## 0.04 13.44947 0.4223362 9.941886
## 0.05 13.27366 0.4308666 9.807140
## 0.06 13.10813 0.4376126 9.711171
## 0.07 12.99534 0.4441777 9.648426
## 0.08 12.92424 0.4481606 9.629948
## 0.09 12.84905 0.4518230 9.591166
## 0.10 12.77074 0.4563365 9.555339
## 0.11 12.77780 0.4577874 9.580700
## 0.12 12.77244 0.4606524 9.593622
## 0.13 12.72514 0.4642391 9.577156
## 0.14 12.70184 0.4664237 9.575529
## 0.15 12.66699 0.4684782 9.559133
## 0.16 12.66902 0.4703680 9.570430
## 0.17 12.65046 0.4723218 9.568137
## 0.18 12.66479 0.4739432 9.591773
## 0.19 12.65435 0.4754373 9.592564
## 0.20 12.69938 0.4769667 9.640392
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.17.
This better be worth the time it took.
ridge_pred <- predict(ridge_fit, newdata=test_X)
ridge_data <- data.frame(obs = test_y, pred=ridge_pred)
colnames(ridge_data) = c('obs', 'pred')
defaultSummary(ridge_data)
## RMSE Rsquared MAE
## 10.7076085 0.5592362 7.7623095
That’s a little better! I’ll take any imporvement over only being able to explain half the variance in the data.
Now for ElasticNet
set.seed(9450)
enet_grid <- expand.grid(.lambda=c(0, 0.01, 0.1),
.fraction=seq(0.05, 1, length = 20))
ctrl <- trainControl(method='cv', number=10)
enet_tune <- train(train_X, train_y,
method="enet",
tuneGrid = enet_grid,
trControl = ctrl)
plot(enet_tune)
enet_pred <- predict(enet_tune, newdata=test_X)
enet_data <- data.frame(obs = test_y, pred=enet_pred)
colnames(enet_data) = c('obs', 'pred')
defaultSummary(enet_data)
## RMSE Rsquared MAE
## 10.4453044 0.5376642 7.6238743
Looks like the ridge and ElasticNet models were the best, but they still are not great. If it’s all we have, then we should try to use it while not canceling ongoing experiments.
KJ 6.3
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
Part A
data("ChemicalManufacturingProcess")
cmp <- as.data.frame(ChemicalManufacturingProcess)
x_raw <- cmp[,2:58]
y_raw <- as.matrix(cmp$Yield)
dim(x_raw)
## [1] 176 57
## [1] 176 1
Similar to 6.2, a feature set and a target set.
Part B
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Not a ton of missing data, but it will be nice to fill these in. It seems like several of the variables are missing the same amount, that might be telling.
Looks like maybe KNN would be a good choice.
Part C
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
We will take a standard approach to this and possibly re-assess once we see the results. It seems like KNN imputation is a good idea. From there we will drop the low variance predictors. Then, because these models like normal, centered, scaled date we will do just that. I am leaving the outliers rather than deleting or re-filling them with a summary statistic because they may be meaningful. Also we are doing a CV to compare results, which should give some robustness to the results even with outliers.
x_imp <- bnstruct::knn.impute(as.matrix(x_raw), k=10)
low_var <- nearZeroVar(x_imp)
x_lowvar <- x_imp[,-low_var]
x_trans <- preProcess(x_lowvar, method=c('center', 'scale', 'BoxCox'))
x_trans <- predict(x_trans, x_lowvar)
trainingRows <- createDataPartition(y_raw, p=0.8, list=FALSE)
train_X <- x_trans[trainingRows,]
train_y <- y_raw[trainingRows]
test_X <- x_trans[-trainingRows,]
test_y <- y_raw[-trainingRows]
trainingData <- as.data.frame(train_X)
trainingData$Yield <- train_y
set.seed(9450)
enet_grid <- expand.grid(.lambda=c(0, 0.01, 0.1),
.fraction=seq(0.05, 1, length = 20))
ctrl <- trainControl(method='cv', number=10)
enet_tune <- train(train_X, train_y,
method="enet",
tuneGrid = enet_grid,
trControl = ctrl)
plot(enet_tune)
Part D
Predict on test set and collect metrics
enet_pred <- predict(enet_tune, newdata=train_X)
enet_data <- data.frame(obs = train_y, pred=enet_pred)
colnames(enet_data) = c('obs', 'pred')
model_results <- defaultSummary(enet_data)
model_results
## RMSE Rsquared MAE
## 0.9491038 0.7405267 0.7680710
Great! This seems like a good result. All linear models tend to overfit, but this looks good.
Part E
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
This is good, the top four most important features are all manufacturing processes, which means we have some amount of control over them. 7 of the top 10 features are related to the manufacturing process.
Part F
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
library(dplyr)
features <-
as.data.frame(feature_value[["importance"]]) %>%
arrange(-Overall) %>%
top_n(20) %>%
rownames()
cmp_imp <- as.data.frame(x_trans)[features]
cmp_imp$Yield <- cmp$Yield
featurePlot(x = cmp_imp[features],
y = cmp_imp$Yield,
plot = "scatter",
type = c("p"))
I would start with the things we can reasonably control, which are the production processes. I would look for one that have some kind of relationship and potentially test yield by change only one process at a time. So process 13, 32, 33, and 09 would be candidates for trials.