Basically just running the above code, with comments to disect. Then implementing SHAP analytics with comments and speculations. Might make for an interesting paper – esp given the comment/rebuttal spat Doyle et al had with that Chinese author wrt “what” the model is actually predicting. SHAP, being an advanced – almost perfect – method for transforming a “black box” model into a “white box”, could possibly help.
Pacman package intelligently loads and, if necessary, installs packages. This improves portability and also speeds up execution as unnecessary installs/loads wont be run. The syntax is also cleaner and less tedious than the usual library/require functions.
Here, we set up freuquently re/used graphics.
Yield is the primary predictive output being considered. It is derived from actual chemical essays processed offplatform (Python?)
# ============================================================================
# Load yield data and stitch it together
# ============================================================================
# In my case, there were 3 1536-well plates and the data for each plate
# was analyzed by quadrant (via 4 384-well plates)
# make func for portability
# Plate 1.1
plate1.1 <- data.table::fread("yield_data/plate1.1.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate1.1_pdt <- plate1.1$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate1.1_pdt))
dim(plate.data) <- c(24,16)
plate.data1.1 <- t(plate.data)
# Plate 1.2
plate1.2 <- data.table::fread("yield_data/plate1.2.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate1.2_pdt <- plate1.2$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate1.2_pdt))
dim(plate.data) <- c(24,16)
plate.data1.2 <- t(plate.data)
# Plate 1.3
plate1.3 <- data.table::fread("yield_data/plate1.3.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate1.3_pdt <- plate1.3$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate1.3_pdt))
dim(plate.data) <- c(24,16)
plate.data1.3 <- t(plate.data)
# Plate 1.4
plate1.4 <- data.table::fread("yield_data/plate1.4.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate1.4_pdt <- plate1.4$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate1.4_pdt))
## Warning in as.matrix(as.numeric(plate1.4_pdt)): NAs introduced by coercion
dim(plate.data) <- c(24,16)
plate.data1.4 <- t(plate.data)
# stitch Plate 1 together into one 32x48 matrix
plate1.top <- cbind(plate.data1.1, plate.data1.2)
plate1.bottom <- cbind(plate.data1.3, plate.data1.4)
plate1 <- rbind(plate1.top, plate1.bottom)
# Plate 2.1
plate2.1 <- data.table::fread("yield_data/plate2.1.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate2.1_pdt <- plate2.1$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate2.1_pdt))
dim(plate.data) <- c(24,16)
plate.data2.1 <- t(plate.data)
# Plate 2.2
plate2.2 <- data.table::fread("yield_data/plate2.2.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate2.2_pdt <- plate2.2$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate2.2_pdt))
dim(plate.data) <- c(24,16)
plate.data2.2 <- t(plate.data)
# Plate 2.3
plate2.3 <- data.table::fread("yield_data/plate2.3.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate2.3_pdt <- plate2.3$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate2.3_pdt))
dim(plate.data) <- c(24,16)
plate.data2.3 <- t(plate.data)
# Plate 2.4
plate2.4 <- data.table::fread("yield_data/plate2.4.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate2.4_pdt <- plate2.4$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate2.4_pdt))
dim(plate.data) <- c(24,16)
plate.data2.4 <- t(plate.data)
# stitch Plate 2 together into one 32x48 matrix
plate2.top <- cbind(plate.data2.1, plate.data2.2)
plate2.bottom <- cbind(plate.data2.3, plate.data2.4)
plate2 <- rbind(plate2.top, plate2.bottom)
# Plate 3.1
plate3.1 <- data.table::fread("yield_data/plate3.1.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate3.1_pdt <- plate3.1$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate3.1_pdt))
dim(plate.data) <- c(24,16)
plate.data3.1 <- t(plate.data)
# Plate 3.2
plate3.2 <- data.table::fread("yield_data/plate3.2.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate3.2_pdt <- plate3.2$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate3.2_pdt))
dim(plate.data) <- c(24,16)
plate.data3.2 <- t(plate.data)
# Plate 3.3
plate3.3 <- data.table::fread("yield_data/plate3.3.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate3.3_pdt <- plate3.3$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate3.3_pdt))
dim(plate.data) <- c(24,16)
plate.data3.3 <- t(plate.data)
# Plate 3.4
plate3.4 <- data.table::fread("yield_data/plate3.4.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate3.4_pdt <- plate3.4$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate3.4_pdt))
dim(plate.data) <- c(24,16)
plate.data3.4 <- t(plate.data)
# stitch Plate 3 together into one 32x48 matrix
plate3.top <- cbind(plate.data3.1, plate.data3.2)
plate3.bottom <- cbind(plate.data3.3, plate.data3.4)
plate3 <- rbind(plate3.top, plate3.bottom)
Simple diagnostics plots for yield. Based on heatmaps and histograms.
# ============================================================================
# Make heatmaps and histograms
# ============================================================================
# Plate 1 heatmap and histogram
plate1.txt <- format(round(plate1, 0), nsmall = 0)
makeHeatmap1536(plate1, plate1.txt, "yield_data/plate1_heatmap.png") %>% tryCatch(error = function(e) NULL)
## quartz_off_screen
## 2
plate1 <- as.data.frame(plate1)
ggHist(plate1, "yield_data/plate1_histogram.png") %>% tryCatch(error = function(e) NULL)
## NULL
# Plate 2 heatmap and histogram
plate2.txt <- format(round(plate2, 0), nsmall = 0)
makeHeatmap1536(plate2, plate2.txt, "yield_data/plate2_heatmap.png") %>% tryCatch(error = function(e) NULL)
## quartz_off_screen
## 2
plate2 <- as.data.frame(plate2)
ggHist(plate2, "yield_data/plate2_histogram.png") %>% tryCatch(error = function(e) NULL)
## NULL
# Plate 3 heatmap and histogram
plate3.txt <- format(round(plate3, 0), nsmall = 0)
makeHeatmap1536(plate3, plate3.txt, "yield_data/plate3_heatmap.png") %>% tryCatch(error = function(e) NULL)
## quartz_off_screen
## 2
plate3 <- as.data.frame(plate3)
ggHist(plate3, "yield_data/plate3_histogram.png") %>% tryCatch(error = function(e) NULL)
## NULL
# All plates histogram
top.two <- rbind(plate1, plate2)
all.plates <- rbind(top.two, plate3)
all.plates <- as.data.frame(all.plates)
ggHist(all.plates, "yield_data/allplates_histogram.png") %>% tryCatch(error = function(e) NULL)
## NULL
Quality control based on chemical considerations. Removed manually, rather than based on descriptor file (idk?).
# Remove reactions without additive and reactions with additive 7
plate1_nocontrols <- plate1[c(-1,-5,-9,-13,-20,-24,-28,-32), c(-16,-32,-48)]
# Remove reactions without aryl halide
plate2_nocontrols <- plate2[, c(-16,-32,-48)]
plate3_nocontrols <- plate3[, c(-16,-32,-48)]
plate1_nocontrols_v <- as.vector(t(plate1_nocontrols))
plate2_nocontrols_v <- as.vector(t(plate2_nocontrols))
plate3_nocontrols_v <- as.vector(t(plate3_nocontrols))
yield_data <- c(plate1_nocontrols_v, plate2_nocontrols_v, plate3_nocontrols_v)
Chemical descriptors generated off-platform (Spartan GUI via Python). Original data is provided, but then re-scaled in an unsupervised fashion (column-wise normalization?). * Rescaling most probably done based on early prototypes or prior experience wrt direct use of nominal data
Assumed that stratified sample at (chloride, bromide, iodide) level is sufficient. Dataset manually traversed, via indices and prior knowledge wrt table structure.
Create three “clean” datasets for further processing. 1. yield = NA removed and split according to plates (1-3) 2. yield = NA only as control 3. yield = NA removed * order of operation relevant dt (untouched) output.scaled dependency for 1 and 2
Results exammined per histograms.
# ============================================================================
# Remove reactions without yield data and make histogram
# ============================================================================
# separate into plates 1,2,3 and remove NA
output.plate1 <- output.scaled[1:1080, ] ## plates manually selected via indices
output.plate1 <- output.plate1[!(is.na(output.plate1$yield)), ]
output.plate2 <- output.scaled[1081:2520, ] ## plates manually selected via indices
output.plate2 <- output.plate2[!(is.na(output.plate2$yield)), ]
output.plate3 <- output.scaled[2521:3960, ] ## plates manually selected via indices
output.plate3 <- output.plate3[!(is.na(output.plate3$yield)), ]
# control; keep only rows where yield=NA
output.control.scaled <- output.scaled[(is.na(output.scaled$yield)), ]
# remove rows where yield=NA
output.scaled <- output.scaled[!(is.na(output.scaled$yield)), ]
# Histogram for modeling yields (removed controls and additive 7)
ggHist(output.scaled$yield, "yield_data/allplates_nocontrols_histogram.png") %>% tryCatch(error = function(e) NULL)
## NULL
# Histogram for modeling yields (controls only)
ggHist(output.control.scaled$yield, "yield_data/allplates_controlsonly_histogram.png") %>% tryCatch(error = function(e) NULL)
## NULL
CV applied to ever-increasing data set sizes.
# ============================================================================
# Data splitting for modeling
# ============================================================================
# Set sample ratio for 70/30 split
sample.ratio = 0.7
# Split into training and test set (70/30)
set.seed(1084)
size <- round(sample.ratio*nrow(output.scaled))
training <- sample(nrow(output.scaled), size=size, replace=FALSE)
training.scaled <- output.scaled[training,]
test.scaled <- output.scaled[-training,]
# Create smaller partitions within training set (equal to 10, 20, etc. % of TOTAL data)
size2.5 <- round(0.025*nrow(output.scaled))
size5 <- round(0.05*nrow(output.scaled))
size10 <- round(0.10*nrow(output.scaled))
size20 <- round(0.20*nrow(output.scaled))
size30 <- round(0.30*nrow(output.scaled))
size40 <- round(0.40*nrow(output.scaled))
size50 <- round(0.50*nrow(output.scaled))
size60 <- round(0.60*nrow(output.scaled))
# Sampled from within training set to avoid using test set data
training2.5rows <- sample(nrow(training.scaled),size=size2.5,replace=FALSE)
training2.5 <- training.scaled[training2.5rows, ]
training5rows <- sample(nrow(training.scaled),size=size5,replace=FALSE)
training5 <- training.scaled[training5rows, ]
training10rows <- sample(nrow(training.scaled),size=size10,replace=FALSE)
training10 <- training.scaled[training10rows, ]
training20rows <- sample(nrow(training.scaled),size=size20,replace=FALSE)
training20 <- training.scaled[training20rows, ]
training30rows <- sample(nrow(training.scaled),size=size30,replace=FALSE)
training30 <- training.scaled[training30rows, ]
training40rows <- sample(nrow(training.scaled),size=size40,replace=FALSE)
training40 <- training.scaled[training40rows, ]
training50rows <- sample(nrow(training.scaled),size=size50,replace=FALSE)
training50 <- training.scaled[training50rows, ]
training60rows <- sample(nrow(training.scaled),size=size60,replace=FALSE)
training60 <- training.scaled[training60rows, ]
# 10-fold cross-validation
train_control <- trainControl(method="cv", number=10, savePredictions=TRUE)
Previously trained models may be loaded manually here.
# ============================================================================
# Read in previously trained models saved as .rds files
# ============================================================================
# Run to read in previously trained models
lmFit.reduced <- readRDS("rds/lmFit_reduced.rds")
lmFit_top5 <- readRDS("rds/lmFit_top5.rds")
knnFit <- readRDS("rds/knnFit.rds")
svmFit <- readRDS("rds/svmFit.rds")
bayesglmFit <- readRDS("rds/bayesglmFit.rds")
lmFit <- readRDS("rds/lmFit.rds")
nnetFit.100nodes <- readRDS("rds/nnetFit_100nodes.rds")
rfFit <- readRDS("rds/rfFit.rds")
rfFit2.5 <- readRDS("rds/rfFit2_5.rds")
rfFit5 <- readRDS("rds/rfFit5.rds")
rfFit10 <- readRDS("rds/rfFit10.rds")
rfFit20 <- readRDS("rds/rfFit20.rds")
rfFit30 <- readRDS("rds/rfFit30.rds")
rfFit40 <- readRDS("rds/rfFit40.rds")
rfFit50 <- readRDS("rds/rfFit50.rds")
rfFit60 <- readRDS("rds/rfFit60.rds")
rfFit70 <- readRDS("rds/rfFit70.rds")
rfFit.LOO <- readRDS("rds/rfFit_LOO.rds")
rfFit.ArCl <- readRDS("rds/rfFit_ArCl.rds")
rfFit.ArBr <- readRDS("rds/rfFit_ArBr.rds")
rfFit.ArI <- readRDS("rds/rfFit_ArI.rds")
rfFit.ArBr.all <- readRDS("rds/rfFit_ArBr_all.rds")
rfFit.nonpyridyl <- readRDS("rds/rfFit_nonpyridyl.rds")
rfFit.under80 <- readRDS("rds/rfFit_under80.rds")
Benchmarking and qualitative comparison of alternative models, both linear and machine learning. ___________________
First, up regularized linear regression: Lasso, Ridge, and Elastic Net
# ============================================================================
# Regularized linear regression: Lasso, Ridge, and Elastic Net
# ============================================================================
set.seed(1533)
x <- as.matrix(training.scaled[, 1:120])
x.test <- as.matrix(test.scaled[, 1:120])
y <- training.scaled[, 121]
y.test <- test.scaled[, 121]
# model type determined by the 'alpha' setting; lasso(alpha=1), elastic_net(alpha<1 AND >0), ridge(alpha=0)
fit.lasso <- glmnet(x, y, family="gaussian", alpha=1) # lasso
fit.elnet <- glmnet(x, y, family="gaussian", alpha=0.5) # elastic net
fit.ridge <- glmnet(x, y, family="gaussian", alpha=0) # ridge
# create and execute 10-step spectrum wrt model type
for (i in 0:10) {
assign(paste("fit", i, sep=""), cv.glmnet(x, y, type.measure="mse",
alpha=i/10,family="gaussian"))
}
# Plot solution paths
png(filename="R/plots/regularization_solution_paths.png", width = 800, height = 1000)
par(mfrow=c(3,2))
plot(fit.lasso, xvar="lambda")
plot(fit10, main="LASSO")
plot(fit.ridge, xvar="lambda")
plot(fit0, main="Ridge")
plot(fit.elnet, xvar="lambda")
plot(fit2, main="Elastic Net")
dev.off()
## quartz_off_screen
## 2
# Predict y values and calculate RMSE and R^2 for range of alpha values
yhat0 <- predict(fit0, s=fit0$lambda.1se, newx=x.test) # ridge
yhat1 <- predict(fit1, s=fit1$lambda.1se, newx=x.test) # elastic net
yhat2 <- predict(fit2, s=fit2$lambda.1se, newx=x.test) # elastic net
yhat3 <- predict(fit3, s=fit3$lambda.1se, newx=x.test) # elastic net
yhat4 <- predict(fit4, s=fit4$lambda.1se, newx=x.test) # elastic net
yhat5 <- predict(fit5, s=fit5$lambda.1se, newx=x.test) # elastic net
yhat6 <- predict(fit6, s=fit6$lambda.1se, newx=x.test) # elastic net
yhat7 <- predict(fit7, s=fit7$lambda.1se, newx=x.test) # elastic net
yhat8 <- predict(fit8, s=fit8$lambda.1se, newx=x.test) # elastic net
yhat9 <- predict(fit9, s=fit9$lambda.1se, newx=x.test) # elastic net
yhat10 <- predict(fit10, s=fit10$lambda.1se, newx=x.test) # lasso
# Calculate RMSE
rmse0 <- rmse(yhat0, y.test)
rmse1 <- rmse(yhat1, y.test)
rmse2 <- rmse(yhat2, y.test)
rmse3 <- rmse(yhat3, y.test)
rmse4 <- rmse(yhat4, y.test)
rmse5 <- rmse(yhat5, y.test)
rmse6 <- rmse(yhat6, y.test)
rmse7 <- rmse(yhat7, y.test)
rmse8 <- rmse(yhat8, y.test)
rmse9 <- rmse(yhat9, y.test)
rmse10 <- rmse(yhat10, y.test)
# Calculate R^2
rsquared0 <- cor(yhat0, y.test)**2
rsquared1 <- cor(yhat1, y.test)**2
rsquared2 <- cor(yhat2, y.test)**2
rsquared3 <- cor(yhat3, y.test)**2
rsquared4 <- cor(yhat4, y.test)**2
rsquared5 <- cor(yhat5, y.test)**2
rsquared6 <- cor(yhat6, y.test)**2
rsquared7 <- cor(yhat7, y.test)**2
rsquared8 <- cor(yhat8, y.test)**2
rsquared9 <- cor(yhat9, y.test)**2
rsquared10 <- cor(yhat10, y.test)**2
# generate metrix for plotting
mfit1 <- cv.glmnet(x, y, type.measure="mse", alpha=0.01, family="gaussian")
mfit2 <- cv.glmnet(x, y, type.measure="mse", alpha=0.1, family="gaussian")
mfit3 <- cv.glmnet(x, y, type.measure="mse", alpha=0.2, family="gaussian")
mfit4 <- cv.glmnet(x, y, type.measure="mse", alpha=0.5, family="gaussian")
mfit5 <- cv.glmnet(x, y, type.measure="mse", alpha=1, family="gaussian")
mfit1.pred <- predict(mfit1, s=mfit1$lambda.1se, newx=x.test)
mfit2.pred <- predict(mfit2, s=mfit2$lambda.1se, newx=x.test)
mfit3.pred <- predict(mfit3, s=mfit3$lambda.1se, newx=x.test)
mfit4.pred <- predict(mfit4, s=mfit4$lambda.1se, newx=x.test)
mfit5.pred <- predict(mfit5, s=mfit5$lambda.1se, newx=x.test)
mfit1.rmse <- rmse(mfit1.pred, y.test)
mfit2.rmse <- rmse(mfit2.pred, y.test)
mfit3.rmse <- rmse(mfit3.pred, y.test)
mfit4.rmse <- rmse(mfit4.pred, y.test)
mfit5.rmse <- rmse(mfit5.pred, y.test)
mfit1.r2 <- cor(mfit1.pred, y.test)**2
mfit2.r2 <- cor(mfit2.pred, y.test)**2
mfit3.r2 <- cor(mfit3.pred, y.test)**2
mfit4.r2 <- cor(mfit4.pred, y.test)**2
mfit5.r2 <- cor(mfit5.pred, y.test)**2
# Plot RMSE and R^2 for different values of alpha
df <- data.frame(alpha = c('0 (LASSO)', '0.01', '0.1', '0.2', '0.5', '1 (Ridge)'),
rmse = c(rmse0, mfit1.rmse, mfit2.rmse, mfit3.rmse, mfit4.rmse, mfit5.rmse),
r2 = c(rsquared0, mfit1.r2, mfit2.r2, mfit3.r2, mfit4.r2, mfit5.r2))
rmse.plot <- ggplot(df, aes(x=rmse, y=alpha)) +
geom_point() +
geom_text(label=round(df$rmse, 2), vjust=-1, size=3) +
labs(x='RMSE', y='alpha') +
xlim(15,16.5)
r2.plot <- ggplot(df, aes(x=r2, y=alpha)) +
geom_point() +
geom_text(label=round(df$r2, 4), vjust=-1, size=3) +
labs(x='Rsquared', y='alpha') +
xlim(0.64, 0.68)
plots <- arrangeGrob(rmse.plot, r2.plot, ncol=2)
ggsave(plots, file="R/plots/regularized_models.png", width=8, height=3)
# print plots
print(plots)
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
Feature selection via a priori removal of correlated features
# ============================================================================
# Dimension reduction by removing correlated descriptors
# ============================================================================
# Read in data for each reaction component
# These tables contain 1 row per different molecule (e.g., 22 rows in additive.csv)
additive <- read.csv("R/additive.csv", header=TRUE)
aryl.halide <- read.csv("R/aryl_halide.csv", header=TRUE)
base <- read.csv("R/base.csv", header=TRUE)
ligand <- read.csv("R/ligand.csv", header=TRUE)
# Correlation plots
ggcorr(additive, label=TRUE, alpha=0)
## Warning in ggcorr(additive, label = TRUE, alpha = 0): data in column(s) 'name'
## are not numeric and were ignored
ggsave(file="R/plots/additive_corrplot.png", width=10, height=10)
ggcorr(aryl.halide, label=TRUE, alpha=0)
## Warning in ggcorr(aryl.halide, label = TRUE, alpha = 0): data in column(s)
## 'name' are not numeric and were ignored
ggsave(file="R/plots/aryl_halide_corrplot.png", width=10, height=10)
ggcorr(base, label=TRUE, alpha=0)
## Warning in ggcorr(base, label = TRUE, alpha = 0): data in column(s) 'name' are
## not numeric and were ignored
ggsave(file="R/plots/base_corrplot.png", width=5, height=5)
ggcorr(ligand, label=FALSE, alpha=0)
## Warning in ggcorr(ligand, label = FALSE, alpha = 0): data in column(s) 'name'
## are not numeric and were ignored
ggsave(file="R/plots/ligand_corrplot.png", width=25, height=25)
# Feature selection by removing correlated features
# remove name column (need to do for cor function)
additive.num <- additive[, -which(names(additive)=="name")]
aryl.halide.num <- aryl.halide[, -which(names(aryl.halide)=="name")]
base.num <- base[, -which(names(base)=="name")]
ligand.num <- ligand[, -which(names(ligand)=="name")]
# define correlation cut.off
corr_cutoff = 0.5
# additive
additive.bad <- findCorrelation(cor(additive.num), cutoff = corr_cutoff, exact = TRUE)
additive.good <- names(additive.num[, -additive.bad])
additive.reduced <- additive.num[, additive.good]
pca.additive.reduced <- prcomp(additive.reduced)
plot(pca.additive.reduced)
ggcorr(additive.reduced, label=TRUE, alpha=0)
ggsave(file="R/plots/additive_reduced_corrplot.png", width=6, height=6)
# aryl halide
aryl.halide.bad <- findCorrelation(cor(aryl.halide.num), cutoff = corr_cutoff, exact = TRUE)
aryl.halide.good <- names(aryl.halide.num[, -aryl.halide.bad])
aryl.halide.reduced <- aryl.halide.num[, aryl.halide.good]
pca.aryl.halide.reduced <- prcomp(aryl.halide.reduced)
plot(pca.aryl.halide.reduced)
ggcorr(aryl.halide.reduced, label=TRUE, alpha=0)
ggsave(file="R/plots/aryl_halide_reduced_corrplot.png", width=6, height=6)
# base
base.bad <- findCorrelation(cor(base.num), cutoff = corr_cutoff, exact = TRUE)
base.good <- names(base.num[, -base.bad])
base.reduced <- base.num[, base.good]
pca.base.reduced <- prcomp(base.reduced)
plot(pca.base.reduced)
ggcorr(base.reduced, label=TRUE, alpha=0)
ggsave(file="R/plots/base_reduced_corrplot.png", width=3, height=3)
# ligand
ligand.bad <- findCorrelation(cor(ligand.num), cutoff = corr_cutoff, exact = TRUE)
ligand.good <- names(ligand.num[, -ligand.bad])
ligand.reduced <- ligand.num[, ligand.good]
pca.ligand.reduced <- prcomp(ligand.reduced)
plot(pca.ligand.reduced)
ggcorr(ligand.reduced, label=TRUE, alpha=0)
ggsave(file="R/plots/ligand_reduced_corrplot.png", width=4, height=4)
# Subset to smaller number of dimensions
reduced.vars = c(names(additive.reduced),
names(aryl.halide.reduced),
names(base.reduced),
names(ligand.reduced),
"yield")
training.scaled.reduced <- training.scaled[, reduced.vars]
test.scaled.reduced <- test.scaled[, reduced.vars]
# Train linear model using reduced set of descriptors
lmFit.reduced <- train(yield ~ ., data=training.scaled.reduced, trControl=train_control, method="lm")
saveRDS(lmFit.reduced, "rds/lmFit_reduced.rds")
lmFit.reduced.pred <- predict(lmFit.reduced, test.scaled.reduced)
lmFit.reduced.rmse <- rmse(lmFit.reduced.pred, test.scaled.reduced$yield)
lmFit.reduced.r2 <- cor(lmFit.reduced.pred, test.scaled.reduced$yield)**2
df <- data.frame(x = lmFit.reduced.pred,
y = test.scaled.reduced$yield,
type = as.factor('Linear Model'))
p <- ggplot(df, aes(x = x, y = y)) +
geom_point(alpha = 0.4) +
scale_x_continuous(breaks = seq(-25,100,25), lim=c(-30, 100)) +
labs(x='Predicted Yield', y='Observed Yield') +
geom_segment(aes(x=0, xend=100, y=0, yend=100), linetype="dashed") +
geom_smooth(method="loess", se=FALSE)
#save plot
ggsave(file="R/plots/lm_reduced.png", width=5, height=4)
## `geom_smooth()` using formula 'y ~ x'
# display plot
print(p)
## `geom_smooth()` using formula 'y ~ x'
Various models are configured and trained. Including: KNN, SVM, Bayes GLM, linear regression, neural net, and random forest.
# ============================================================================
# Training various machine learning models (models are saved as .rds files)
# ============================================================================
# k-nearest neighbor (kNN)
knnFit <- train(yield ~ ., data=training.scaled, trControl=train_control, method="knn")
png(filename="R/plots/knn.png", width = 1000, height = 600)
predVals <- extractPrediction(list(knnFit))
plotObsVsPred(predVals)
dev.off()
## quartz_off_screen
## 2
saveRDS(knnFit, "rds/knnFit.rds")
# support vector machine (SVM)
png(filename="R/plots/svm.png", width = 1000, height = 600)
predVals <- extractPrediction(list(svmFit))
plotObsVsPred(predVals)
dev.off()
## quartz_off_screen
## 2
saveRDS(svmFit, "rds/svmFit.rds")
# Bayes generalized linear model (GLM)
bayesglmFit <- train(yield ~ ., data=training.scaled, trControl=train_control, method="bayesglm")
png(filename="R/plots/bayesglm.png", width = 1000, height = 600)
predVals <- extractPrediction(list(bayesglmFit))
plotObsVsPred(predVals)
dev.off()
## quartz_off_screen
## 2
saveRDS(bayesglmFit, "rds/bayesglmFit.rds")
# linear model
lmFit <- train(yield ~ ., data=training.scaled, trControl=train_control, method="lm")
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
png(filename="R/plots/lm.png", width = 1000, height = 600)
predVals <- extractPrediction(list(lmFit))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
plotObsVsPred(predVals)
dev.off()
## quartz_off_screen
## 2
saveRDS(lmFit, "rds/lmFit.rds")
# neural network (neuralnet package)
n <- names(training.scaled)
f <- as.formula(paste("yield ~", paste(n[!n %in% "yield"], collapse = " + ")))
set.seed(3288)
nnetFit.100nodes <- neuralnet(f, data=training.scaled, hidden=c(100), linear.output=TRUE, threshold=1, stepmax=1e7)
saveRDS(nnetFit.100nodes, "rds/nnetFit_100nodes.rds")
# random forest
set.seed(8915)
rfFit <- train(yield ~ ., data=training.scaled, trControl=train_control, method="rf", importance=TRUE)
png(filename="R/plots/rf.png", width = 1000, height = 600)
predVals <- extractPrediction(list(rfFit))
plotObsVsPred(predVals)
dev.off()
## quartz_off_screen
## 2
saveRDS(rfFit, "rds/rfFit.rds")
Here, the predictive accuracy wrt top features selected by RF model are examined via linear model.
# ============================================================================
# Calibration plot for linear model using 5 top RF descriptors
# ============================================================================
# linear model (top 5 descriptors from final RF model)
lmFit_top5 <- train(yield ~ additive_.C3_NMR_shift +
additive_E_LUMO +
aryl_halide_.C3_NMR_shift +
additive_.O1_electrostatic_charge +
additive_.C5_electrostatic_charge,
data=training.scaled, trControl=train_control, method="lm")
png(filename="R/plots/lm_top5.png", width = 1000, height = 600)
predVals <- extractPrediction(list(lmFit))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
plotObsVsPred(predVals)
dev.off()
## quartz_off_screen
## 2
saveRDS(lmFit, "rds/lmFit_top5.rds")
# Predict for testing set
lm.top5.pred <- predict(lmFit_top5, test.scaled)
# R^2 values
lm.top5.pred.r2 <- cor(lm.top5.pred, test.scaled$yield)**2
# RMSE
lm.top5.pred.rmse <- rmse(lm.top5.pred, test.scaled$yield)
# Create data frames with predicted and observed values for test set
df1 <- data.frame(x = lm.top5.pred, y = test.scaled$yield)
# Make calibration plot
p <- ggplot(df1, aes(x = x, y = y)) +
geom_point(alpha = 0.3, color="dodgerblue3", size=0.8) +
scale_x_continuous(breaks = seq(-25,100,25), lim=c(-25, 100)) +
geom_segment(aes(x=0, xend=100, y=0, yend=100), linetype="dashed", size=0.3) +
geom_smooth(method="loess", se=FALSE, size=0.5, color="black") +
labs(x='Predicted Yield', y='Observed Yield')
ggsave(file="R/plots/lm_top5_calibration_plot.png", width=5, height=3)
## `geom_smooth()` using formula 'y ~ x'
# view plot
plot(p)
## `geom_smooth()` using formula 'y ~ x'
Comparison of goodness of fit measures, R^2 and RMSE. Negative yield predictions set to zero for calculation of R^2/RMSE performance.
# ============================================================================
# RMSE and R^2 plot for different machine learning models
# ============================================================================
# Predict for testing set
lm.pred <- predict(lmFit, test.scaled)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
svm.pred <- predict(svmFit, test.scaled)
knn.pred <- predict(knnFit, test.scaled)
nnet.pred <- predict(nnetFit.100nodes, test.scaled[, 1:120])
bayesglm.pred <- predict(bayesglmFit, test.scaled)
rf.pred <- predict(rfFit, test.scaled)
# Set negative predictions to zero
lm.pred[lm.pred<0] <- 0
svm.pred[svm.pred<0] <- 0
knn.pred[knn.pred<0] <- 0
nnet.pred[nnet.pred<0] <- 0
bayesglm.pred[bayesglm.pred<0] <- 0
rf.pred[rf.pred<0] <- 0
# R^2 values
lm.pred.r2 <- cor(lm.pred, test.scaled$yield)**2
svm.pred.r2 <- cor(svm.pred, test.scaled$yield)**2
knn.pred.r2 <- cor(knn.pred, test.scaled$yield)**2
nnet.pred.r2 <- cor(nnet.pred, test.scaled$yield)**2
bayesglm.pred.r2 <- cor(bayesglm.pred, test.scaled$yield)**2
rf.pred.r2 <- cor(rf.pred, test.scaled$yield)**2
# RMSE
lm.pred.rmse <- rmse(lm.pred, test.scaled$yield)
svm.pred.rmse <- rmse(svm.pred, test.scaled$yield)
knn.pred.rmse <- rmse(knn.pred, test.scaled$yield)
nnet.pred.rmse <- rmse(nnet.pred, test.scaled$yield)
bayesglm.pred.rmse <- rmse(bayesglm.pred, test.scaled$yield)
rf.pred.rmse <- rmse(rf.pred, test.scaled$yield)
# Plot RMSE and R^2
df <- data.frame(rmse = c(lm.pred.rmse, svm.pred.rmse, knn.pred.rmse, nnet.pred.rmse, bayesglm.pred.rmse, rf.pred.rmse),
r2 = c(lm.pred.r2, svm.pred.r2, knn.pred.r2, nnet.pred.r2, bayesglm.pred.r2, rf.pred.r2))
row.names(df) <- c('Linear Model', 'SVM', 'kNN', 'Neural Network', 'Bayes GLM', 'Random Forest')
rmse.plot <- ggplot(df, aes(y=reorder(rownames(df), rmse), x=rmse)) +
geom_point() +
geom_text(label=round(df$rmse, 1), vjust=-1, size=3) +
labs(x='RMSE', y='') +
xlim(0,20)
r2.plot <- ggplot(df, aes(y=reorder(rownames(df), rmse), x=r2)) +
geom_point() +
geom_text(label=round(df$r2, 2), vjust=-1, size=3) +
labs(x='Rsquared', y='') +
xlim(0.6,1)
plots <- arrangeGrob(rmse.plot, r2.plot, ncol=2)
ggsave(plots, file="R/plots/ml_models.png", width=7, height=2.5)
# view plots
plot(plots)
All machine learning models are simultaneoulsy evaluated and plotted, predicted vs observed.
# ============================================================================
# Calibration plots for different machine learning models
# ============================================================================
# Create data frames with predicted and observed values for test set
df1 <- data.frame(x = lm.pred, y = test.scaled$yield, type = as.factor('Linear Model'))
df2 <- data.frame(x = knn.pred, y = test.scaled$yield, type = as.factor('kNN'))
df3 <- data.frame(x = svm.pred, y = test.scaled$yield, type = as.factor('SVM'))
df4 <- data.frame(x = bayesglm.pred, y = test.scaled$yield, type = as.factor('Bayes GLM'))
df5 <- data.frame(x = nnet.pred, y = test.scaled$yield, type = as.factor('Neural Network'))
df6 <- data.frame(x = rf.pred, y = test.scaled$yield, type = as.factor('Random Forest'))
# Make calibration plots
facet.df <- do.call(rbind, list(df1, df2, df3, df4, df5, df6))
facet.plot <- ggplot(facet.df, aes(x = x, y = y)) +
geom_point(alpha = 0.3, color="dodgerblue3", size=0.8) +
scale_x_continuous(breaks = seq(-25,100,25), lim=c(-25, 100)) +
geom_segment(aes(x=0, xend=100, y=0, yend=100), linetype="dashed", size=0.3) +
geom_smooth(method="loess", se=FALSE, size=0.5, color="black") +
facet_wrap(~type, ncol=3) +
labs(x='Predicted Yield', y='Observed Yield')
ggsave(file="R/plots/ml_calibration_plots.png", width=8, height=5)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
#view plots
(facet.plot)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
# Make calibration plots (Supporting Information)
facet.df <- do.call(rbind, list(df1, df2, df3, df4, df5, df6))
facet.plot <- ggplot(facet.df, aes(x = x, y = y)) +
geom_point(alpha = 0.3, color="dodgerblue3", size=0.8) +
scale_x_continuous(breaks = seq(-25,100,25), lim=c(-25, 100)) +
geom_segment(aes(x=0, xend=100, y=0, yend=100), linetype="dashed", size=0.3) +
geom_smooth(method="loess", se=FALSE, size=0.5, color="black") +
facet_wrap(~type, ncol=2) +
labs(x='Predicted Yield', y='Observed Yield')
ggsave(file="R/plots/ml_calibration_plots_SI.png", width=6, height=8)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
#view plots
(facet.plot)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
OOS prediction task is conducted on a stratified basis, per additive (idk?). * Superiority and mechanistic consistency of results confirmed, stability of RF results now explored on a stratified basis. * Large variation noted along this dimension wrt successful OOS yield prediction. * However, even poorest performing strata out-perform other, higher information models (except for neural network)
# ============================================================================
# Random forest: Predicting out-of-sample additives
# ============================================================================
# double check that row numbers are correct by testing number of unique additives
# length(unique(output.scaled$additive_.C3_NMR_shift[1:1075])) == 6 # TRUE (one more row is 7)
# length(unique(output.scaled$additive_.C3_NMR_shift[1:2515])) == 14 # TRUE (one more row is 15)
# length(unique(output.scaled$additive_.C3_NMR_shift)) == 22
# plates 1 and 2 (after NA's removed)
plate12 <- output.scaled[1:2515, ]
# plate 3 (after NA's removed)
plate3 <- output.scaled[2516:3955, ]
# leave-one-out by additive
by.additive <- split(seq_along(plate12$additive_.C3_NMR_shift), plate12$additive_.C3_NMR_shift)
tc.additive <- trainControl(method="cv", indexOut=by.additive, savePredictions = TRUE)
# leave-one-out random forest (70% of data)
set.seed(8915)
rfFit.LOO <- train(yield ~ ., data=plate12, trControl=tc.additive, method="rf", importance=TRUE)
png(filename="R/plots/rf_LOO.png", width = 1000, height = 600)
predVals <- extractPrediction(list(rfFit.LOO))
plotObsVsPred(predVals)
dev.off()
## quartz_off_screen
## 2
saveRDS(rfFit.LOO, "rds/rfFit_LOO.rds")
rf.predTrain.LOO <- predict(rfFit.LOO, plate12)
rf.predTrain.LOO.rmse <- rmse(rf.predTrain.LOO, plate12$yield)
rf.predTrain.LOO.r2 <- cor(rf.predTrain.LOO, plate12$yield)**2
rf.pred.LOO <- predict(rfFit.LOO, plate3)
rf.pred.LOO.rmse <- rmse(rf.pred.LOO, plate3$yield)
rf.pred.LOO.r2 <- cor(rf.pred.LOO, plate3$yield)**2
# Create calibration plots for additives
plate3$additive_id <- as.factor(plate3$additive_.C3_NMR_shift)
levels(plate3$additive_id) <- c('Additive 16', 'Additive 18', 'Additive 20', 'Additive 21',
'Additive 22', 'Additive 17', 'Additive 19', 'Additive 23')
plate3$additive_id <- sortLvls.fnc(plate3$additive_id, c(1, 6, 2, 7, 3, 4, 5, 8))
plate3$additive_id = factor(plate3$additive_id,levels(plate3$additive_id)[c(1, 6, 2, 7, 3, 4, 5, 8)])
df <- cbind(plate3, rf.pred.LOO)
p <- ggplot(df, aes(x=rf.pred.LOO, y=yield)) +
geom_point(alpha=0.4, aes(col=additive_id), size=1) +
labs(x='Predicted Yield', y='Observed Yield') +
xlim(0, 100) +
ylim(0, 100) +
geom_smooth(method='lm', se=FALSE, color="black", size=0.5) +
facet_wrap(~additive_id, nrow=2, ncol=4) +
geom_segment(aes(x=0, xend=100, y=0, yend=100), linetype="dashed", size=0.3) +
theme(legend.position="none")
ggsave(file="R/plots/additive_out_of_sample.png", width=8, height=4.5)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 8 rows containing missing values (geom_smooth).
# Calculate RMSE and R^2 for additives 16-23
additive.16 <- plate3[plate3$additive_id=='Additive 16', ]
rf.pred.16 <- predict(rfFit.LOO, additive.16)
rf.pred.16.rmse <- rmse(rf.pred.16, additive.16$yield)
rf.pred.16.r2 <- cor(rf.pred.16, additive.16$yield)**2
additive.17 <- plate3[plate3$additive_id=='Additive 17', ]
rf.pred.17 <- predict(rfFit.LOO, additive.17)
rf.pred.17.rmse <- rmse(rf.pred.17, additive.17$yield)
rf.pred.17.r2 <- cor(rf.pred.17, additive.17$yield)**2
additive.18 <- plate3[plate3$additive_id=='Additive 18', ]
rf.pred.18 <- predict(rfFit.LOO, additive.18)
rf.pred.18.rmse <- rmse(rf.pred.18, additive.18$yield)
rf.pred.18.r2 <- cor(rf.pred.18, additive.18$yield)**2
additive.19 <- plate3[plate3$additive_id=='Additive 19', ]
rf.pred.19 <- predict(rfFit.LOO, additive.19)
rf.pred.19.rmse <- rmse(rf.pred.19, additive.19$yield)
rf.pred.19.r2 <- cor(rf.pred.19, additive.19$yield)**2
additive.20 <- plate3[plate3$additive_id=='Additive 20', ]
rf.pred.20 <- predict(rfFit.LOO, additive.20)
rf.pred.20.rmse <- rmse(rf.pred.20, additive.20$yield)
rf.pred.20.r2 <- cor(rf.pred.20, additive.20$yield)**2
additive.21 <- plate3[plate3$additive_id=='Additive 21', ]
rf.pred.21 <- predict(rfFit.LOO, additive.21)
rf.pred.21.rmse <- rmse(rf.pred.21, additive.21$yield)
rf.pred.21.r2 <- cor(rf.pred.21, additive.21$yield)**2
additive.22 <- plate3[plate3$additive_id=='Additive 22', ]
rf.pred.22 <- predict(rfFit.LOO, additive.22)
rf.pred.22.rmse <- rmse(rf.pred.22, additive.22$yield)
rf.pred.22.r2 <- cor(rf.pred.22, additive.22$yield)**2
additive.23 <- plate3[plate3$additive_id=='Additive 23', ]
rf.pred.23 <- predict(rfFit.LOO, additive.23)
rf.pred.23.rmse <- rmse(rf.pred.23, additive.23$yield)
rf.pred.23.r2 <- cor(rf.pred.23, additive.23$yield)**2
# Print RMSE and R^2 to console
paste0("Additive 16: RMSE = ", rf.pred.16.rmse, ", R^2 = ", rf.pred.16.r2)
## [1] "Additive 16: RMSE = 6.85714247158009, R^2 = 0.906918790463747"
paste0("Additive 17: RMSE = ", rf.pred.17.rmse, ", R^2 = ", rf.pred.17.r2)
## [1] "Additive 17: RMSE = 10.5023995287875, R^2 = 0.854707343209958"
paste0("Additive 18: RMSE = ", rf.pred.18.rmse, ", R^2 = ", rf.pred.18.r2)
## [1] "Additive 18: RMSE = 13.4236762606032, R^2 = 0.742601724418996"
paste0("Additive 19: RMSE = ", rf.pred.19.rmse, ", R^2 = ", rf.pred.19.r2)
## [1] "Additive 19: RMSE = 14.8502121063655, R^2 = 0.717121476001657"
paste0("Additive 20: RMSE = ", rf.pred.20.rmse, ", R^2 = ", rf.pred.20.r2)
## [1] "Additive 20: RMSE = 8.53998933905872, R^2 = 0.889427343355382"
paste0("Additive 21: RMSE = ", rf.pred.21.rmse, ", R^2 = ", rf.pred.21.r2)
## [1] "Additive 21: RMSE = 11.9777277089739, R^2 = 0.853207037730673"
paste0("Additive 22: RMSE = ", rf.pred.22.rmse, ", R^2 = ", rf.pred.22.r2)
## [1] "Additive 22: RMSE = 13.0215459116134, R^2 = 0.780450732957853"
paste0("Additive 23: RMSE = ", rf.pred.23.rmse, ", R^2 = ", rf.pred.23.r2)
## [1] "Additive 23: RMSE = 9.18615146344195, R^2 = 0.897543482139672"
Sensitivity analysis: RF predictive performance versus sparsity/completeness of training data
# ============================================================================
# Random forest under sparsity: Training models
# ============================================================================
# Random forest (2.5% of data)
set.seed(8915)
rfFit2.5 <- train(yield ~ ., data=training2.5, trControl=train_control, method="rf")
saveRDS(rfFit2.5, "rds/rfFit2_5.rds")
# Random forest (5% of data)
set.seed(8915)
rfFit5 <- train(yield ~ ., data=training5, trControl=train_control, method="rf")
saveRDS(rfFit5, "rds/rfFit5.rds")
# Random forest (10% of data)
set.seed(8915)
rfFit10 <- train(yield ~ ., data=training10, trControl=train_control, method="rf")
saveRDS(rfFit10, "rds/rfFit10.rds")
# Random forest (20% of data)
set.seed(8915)
rfFit20 <- train(yield ~ ., data=training20, trControl=train_control, method="rf")
saveRDS(rfFit20, "rds/rfFit20.rds")
# Random forest (30% of data)
set.seed(8915)
rfFit30 <- train(yield ~ ., data=training30, trControl=train_control, method="rf")
saveRDS(rfFit30, "rds/rfFit30.rds")
# Random forest (40% of data)
set.seed(8915)
rfFit40 <- train(yield ~ ., data=training40, trControl=train_control, method="rf")
saveRDS(rfFit40, "rds/rfFit40.rds")
# Random forest (50% of data)
set.seed(8915)
rfFit50 <- train(yield ~ ., data=training50, trControl=train_control, method="rf")
saveRDS(rfFit50, "rds/rfFit50.rds")
# Random forest (60% of data)
set.seed(8915)
rfFit60 <- train(yield ~ ., data=training60, trControl=train_control, method="rf")
saveRDS(rfFit60, "rds/rfFit60.rds")
# Random forest (70% of data - all training data - rfFit70 is identical to rfFit)
set.seed(8915)
rfFit70 <- train(yield ~ ., data=training.scaled, trControl=train_control, method="rf", importance=TRUE)
saveRDS(rfFit70, "rds/rfFit70.rds")
Plots for sensitivity analysis. * Note: excellent performance even at 20/80. What does this say about out-of-the-box “intelligence” of boosted trees? * I would suggest this is due to boosted trees employing the same logic as trained experimentalists or diagnosticians.
# ============================================================================
# Random forest under sparsity: Making calibration plots
# ============================================================================
# Predict for testing set
rf.pred2.5 <- predict(rfFit2.5, test.scaled)
rf.pred5 <- predict(rfFit5, test.scaled)
rf.pred10 <- predict(rfFit10, test.scaled)
rf.pred20 <- predict(rfFit20, test.scaled)
rf.pred30 <- predict(rfFit30, test.scaled)
rf.pred40 <- predict(rfFit40, test.scaled)
rf.pred50 <- predict(rfFit50, test.scaled)
rf.pred60 <- predict(rfFit60, test.scaled)
rf.pred70 <- predict(rfFit70, test.scaled)
# Plot expected vs. observed
# Create data frames
df1 <- data.frame(x = rf.pred2.5, y = test.scaled$yield, type = as.factor('2.5%'))
df2 <- data.frame(x = rf.pred5, y = test.scaled$yield, type = as.factor('5%'))
df3 <- data.frame(x = rf.pred10, y = test.scaled$yield, type = as.factor('10%'))
df4 <- data.frame(x = rf.pred20, y = test.scaled$yield, type = as.factor('20%'))
df5 <- data.frame(x = rf.pred30, y = test.scaled$yield, type = as.factor('30%'))
df6 <- data.frame(x = rf.pred40, y = test.scaled$yield, type = as.factor('40%'))
df7 <- data.frame(x = rf.pred50, y = test.scaled$yield, type = as.factor('50%'))
df8 <- data.frame(x = rf.pred60, y = test.scaled$yield, type = as.factor('60%'))
df9 <- data.frame(x = rf.pred70, y = test.scaled$yield, type = as.factor('70%'))
facet.df <- do.call(rbind, list(df1, df2, df3,
df4, df5, df6,
df7, df8, df9))
facet.plot <- ggplot(facet.df, aes(x = x, y = y)) +
geom_point(alpha = 0.3, color="dodgerblue3", size=1) +
scale_x_continuous(breaks = seq(0,100,25), lim=c(0, 100)) +
geom_segment(aes(x=0, xend=100, y=0, yend=100), linetype="dashed", size=0.3) +
geom_smooth(method="loess", se=FALSE, size=0.5, color="black") +
facet_wrap(~type, ncol=3) +
labs(x='Predicted Yield', y='Observed Yield')
ggsave(file="R/plots/rf_sparsity.png", width=8, height=9)
## `geom_smooth()` using formula 'y ~ x'
# ============================================================================
# Random forest under sparsity: Plotting R^2 and RMSE
# ============================================================================
# R^2 values
rf.pred2.5.r2 <- cor(rf.pred2.5, test.scaled$yield)**2
rf.pred5.r2 <- cor(rf.pred5, test.scaled$yield)**2
rf.pred10.r2 <- cor(rf.pred10, test.scaled$yield)**2
rf.pred20.r2 <- cor(rf.pred20, test.scaled$yield)**2
rf.pred30.r2 <- cor(rf.pred30, test.scaled$yield)**2
rf.pred40.r2 <- cor(rf.pred40, test.scaled$yield)**2
rf.pred50.r2 <- cor(rf.pred50, test.scaled$yield)**2
rf.pred60.r2 <- cor(rf.pred60, test.scaled$yield)**2
rf.pred70.r2 <- cor(rf.pred70, test.scaled$yield)**2
# RMSE
rf.pred2.5.rmse <- rmse(rf.pred2.5, test.scaled$yield)
rf.pred5.rmse <- rmse(rf.pred5, test.scaled$yield)
rf.pred10.rmse <- rmse(rf.pred10, test.scaled$yield)
rf.pred20.rmse <- rmse(rf.pred20, test.scaled$yield)
rf.pred30.rmse <- rmse(rf.pred30, test.scaled$yield)
rf.pred40.rmse <- rmse(rf.pred40, test.scaled$yield)
rf.pred50.rmse <- rmse(rf.pred50, test.scaled$yield)
rf.pred60.rmse <- rmse(rf.pred60, test.scaled$yield)
rf.pred70.rmse <- rmse(rf.pred70, test.scaled$yield)
# create data frame containing RMSE and R^2 data for sparsity models
df <- data.frame(rmse = c(rf.pred2.5.rmse,
rf.pred5.rmse,
rf.pred10.rmse,
rf.pred20.rmse,
rf.pred30.rmse,
rf.pred50.rmse,
rf.pred70.rmse),
r2 = c(rf.pred2.5.r2,
rf.pred5.r2,
rf.pred10.r2,
rf.pred20.r2,
rf.pred30.r2,
rf.pred50.r2,
rf.pred70.r2))
row.names(df) <- c('2.5%', '5%', '10%', '20%', '30%', '50%', '70%')
# Plot RMSE and R^2 data
rmse.plot <- ggplot(df, aes(y=reorder(rownames(df), -rmse), x=rmse)) +
geom_point() +
geom_text(label=round(df$rmse, 1), vjust=-1, size=2.5) +
labs(x='RMSE', y='Training Set Data') +
xlim(0, 20) +
coord_flip()
r2.plot <- ggplot(df, aes(y=reorder(rownames(df), -rmse), x=r2)) +
geom_point() +
geom_text(label=round(df$r2, 2), vjust=-1, size=2.5) +
labs(x='Rsquared', y='Training Set Data') +
xlim(0.5, 1) +
coord_flip()
plots <- arrangeGrob(r2.plot, rmse.plot, ncol=2)
ggsave(plots, file="R/plots/rf_sparsity_r2_rmse.png", width=6, height=2.5)
Feature importance plots for random forest
# ============================================================================
# Random forest: Plotting variable importance
# ============================================================================
# read in variable importance from trained random forest model
rf_imp <- importance(rfFit70$finalModel)
rf.imp.df <- cbind(as.data.frame(rf_imp), names(rf_imp[, 1]))
colnames(rf.imp.df)[1] <- "IncMSE"
colnames(rf.imp.df)[3] <- "descriptor"
# for descriptor names, replace "_" with " " and "." with "*"
rf.imp.df$descriptor <- gsub("_", " ", rf.imp.df$descriptor)
rf.imp.df$descriptor <- gsub("[.]", "*", rf.imp.df$descriptor)
# capitalize descriptor names
simpleCap <- function(x) {
s <- strsplit(x, " ")[[1]]
paste(toupper(substring(s, 1, 1)), substring(s, 2),
sep="", collapse=" ")
}
rf.imp.df$descriptor <- sapply(rf.imp.df$descriptor, simpleCap)
# plot variable importance (top 10 descriptors)
p1 <- ggplot(rf.imp.df[rf.imp.df$IncMSE>22, ], aes(x=reorder(descriptor, IncMSE), y=IncMSE)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = comma) +
labs(x="", y="Increase in Mean Squared Error (%)") +
coord_flip()
ggsave(p1, file="R/plots/rf_importance.png", width=6, height=2.25)
# plot variable importance (descriptors above 15% IncMSE) - Supporting Information
p1 <- ggplot(rf.imp.df[rf.imp.df$IncMSE>15, ], aes(x=reorder(descriptor, IncMSE), y=IncMSE)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = comma) +
labs(x="", y="Increase in Mean Squared Error (%)") +
coord_flip()
ggsave(p1, file="R/plots/rf_importance_SI.png", width=6, height=4)
‘###### SKIP ##################################################################’# ============================================================================ ’# Plotting *C3 NMR Shift vs Yield ’# ============================================================================
‘###### SKIP ##################################################################’# ============================================================================ ‘# Random forest: Train and test each aryl halide individually’# ============================================================================
‘###### CONTINUE ###############################################################’# ============================================================================ ‘# Random forest: Predict ArCl and ArI from ArBr’# ============================================================================
Ability for model trained on one set of compounds to predict another is tested. This can be seen as a basic test of cross-compound generalizeability. * Will be very interesting to observe/compare drivers via SHAP. * What does SHAP suggest regarding drivers of cross-compatibility? * How do these drivers compare to drivers in standard/general model?
# ============================================================================
# Random forest: Predict ArCl and ArI from ArBr
# ============================================================================
# Random forest (ArBr)
set.seed(8915)
rfFit.ArBr.all <- train(yield ~ ., data=ArBr.scaled, trControl=train_control, method="rf", importance=TRUE)
saveRDS(rfFit.ArBr.all, "rds/rfFit_ArBr_all.rds")
# Test on ArCl
ClfromBr <- predict(rfFit.ArBr.all, ArCl.scaled)
ClfromBr.r2 <- cor(ClfromBr, ArCl.scaled$yield)**2
ClfromBr.rmse <- rmse(ClfromBr, ArCl.scaled$yield)
df1 <- data.frame(x = ClfromBr,
y = ArCl.scaled$yield)
# Create calibration plot (predict Cl from Br random forest model)
p1 <- ggplot(df1, aes(x = x, y = y)) +
geom_point(alpha = 0.4) +
scale_x_continuous(breaks = seq(0,100,25), lim=c(0, 100)) +
labs(x='Predicted Yield', y='Observed Yield') +
geom_segment(aes(x=0, xend=100, y=0, yend=100), linetype="dashed") +
geom_smooth(method="loess", se=FALSE)
ggsave(file="R/plots/ClfromBr.png", width=5, height=4)
## `geom_smooth()` using formula 'y ~ x'
# Test on ArI
IfromBr <- predict(rfFit.ArBr.all, ArI.scaled)
IfromBr.r2 <- cor(IfromBr, ArI.scaled$yield)**2
IfromBr.rmse <- rmse(IfromBr, ArI.scaled$yield)
df2 <- data.frame(x = IfromBr,
y = ArI.scaled$yield)
# Create calibration plot (predict I from Br random forest model)
p2 <- ggplot(df2, aes(x = x, y = y)) +
geom_point(alpha = 0.4) +
scale_x_continuous(breaks = seq(0,100,25), lim=c(0, 100)) +
labs(x='Predicted Yield', y='Observed Yield') +
geom_segment(aes(x=0, xend=100, y=0, yend=100), linetype="dashed") +
geom_smooth(method="loess", se=FALSE)
ggsave(file="R/plots/IfromBr.png", width=5, height=4)
## `geom_smooth()` using formula 'y ~ x'
‘###### SKIP ##################################################################’# ============================================================================ ‘# Random forest: Predicting pyridyl substrates from nonpyridyl ones’# ============================================================================
‘###### CONTINUE ###############################################################’# ============================================================================ ‘# Random forest: Predicting yields > 80% from yields < 80%’# ============================================================================
The ability for the model to infer beyond its original training scope is tested further. Here, the challenging – nigh impossible – task of testing whether the model generalizes beyond assumed physical limitations. * Perhaps something requested by a reviewer, as it does not seem to make much sense in the regression context * Might be interesting to compare these results with those from another RF model trainied using SHAP values.
# ============================================================================
# Random forest: Predicting yields > 80% from yields < 80%
# ============================================================================
# Train random forest model using yields < 80%
set.seed(8915)
rfFit.under80 <- train(yield ~ ., data=output.under80, trControl=train_control, method="rf", importance=TRUE)
saveRDS(rfFit.under80, "rds/rfFit_under80.rds")
# Predict yields for reactions > 80%
over80pred <- predict(rfFit.under80, output.over80)
over80pred.r2 <- cor(over80pred, output.over80$yield)**2
over80pred.rmse <- rmse(over80pred, output.over80$yield)
# Generate calibration plot
df <- data.frame(x = over80pred,
y = output.over80$yield)
p1 <- ggplot(df, aes(x = x, y = y)) +
geom_point(alpha = 0.4) +
scale_x_continuous(breaks = seq(0, 100, 25), lim=c(0, 100)) +
labs(x='Predicted Yield', y='Observed Yield') +
geom_segment(aes(x=0, xend=100, y=0, yend=100), linetype="dashed") +
geom_smooth(method="loess", se=FALSE)
ggsave(file="R/plots/over80pred.png", width=5, height=4)
## `geom_smooth()` using formula 'y ~ x'
XGBoost/SHAP
#install.packages('magick')
img <- magick::image_read("yield_data/allplates_histogram.png")
plot(img) # or print(img)
Load predictor (descriptor) data from external files * Same as original, but with data.table
# load output table generated by python script
output.table <- fread("R/output_table.csv", header=TRUE)
# scale the descriptor data
output.scaled <- as.data.table(scale(output.table))
# create the same for nominal data
output.nominal <- output.table
Refers to previous chunk and should execute via reference, but doesn’t (Rstudio bug?). Need to tab up to original chunk and execute manually.
# Plate 1.1
plate1.1 <- data.table::fread("yield_data/plate1.1.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate1.1_pdt <- plate1.1$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate1.1_pdt))
dim(plate.data) <- c(24,16)
plate.data1.1 <- t(plate.data)
# Plate 1.2
plate1.2 <- data.table::fread("yield_data/plate1.2.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate1.2_pdt <- plate1.2$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate1.2_pdt))
dim(plate.data) <- c(24,16)
plate.data1.2 <- t(plate.data)
# Plate 1.3
plate1.3 <- data.table::fread("yield_data/plate1.3.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate1.3_pdt <- plate1.3$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate1.3_pdt))
dim(plate.data) <- c(24,16)
plate.data1.3 <- t(plate.data)
# Plate 1.4
plate1.4 <- data.table::fread("yield_data/plate1.4.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate1.4_pdt <- plate1.4$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate1.4_pdt))
## Warning in as.matrix(as.numeric(plate1.4_pdt)): NAs introduced by coercion
dim(plate.data) <- c(24,16)
plate.data1.4 <- t(plate.data)
# stitch Plate 1 together into one 32x48 matrix
plate1.top <- cbind(plate.data1.1, plate.data1.2)
plate1.bottom <- cbind(plate.data1.3, plate.data1.4)
plate1 <- rbind(plate1.top, plate1.bottom)
# Plate 2.1
plate2.1 <- data.table::fread("yield_data/plate2.1.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate2.1_pdt <- plate2.1$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate2.1_pdt))
dim(plate.data) <- c(24,16)
plate.data2.1 <- t(plate.data)
# Plate 2.2
plate2.2 <- data.table::fread("yield_data/plate2.2.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate2.2_pdt <- plate2.2$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate2.2_pdt))
dim(plate.data) <- c(24,16)
plate.data2.2 <- t(plate.data)
# Plate 2.3
plate2.3 <- data.table::fread("yield_data/plate2.3.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate2.3_pdt <- plate2.3$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate2.3_pdt))
dim(plate.data) <- c(24,16)
plate.data2.3 <- t(plate.data)
# Plate 2.4
plate2.4 <- data.table::fread("yield_data/plate2.4.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate2.4_pdt <- plate2.4$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate2.4_pdt))
dim(plate.data) <- c(24,16)
plate.data2.4 <- t(plate.data)
# stitch Plate 2 together into one 32x48 matrix
plate2.top <- cbind(plate.data2.1, plate.data2.2)
plate2.bottom <- cbind(plate.data2.3, plate.data2.4)
plate2 <- rbind(plate2.top, plate2.bottom)
# Plate 3.1
plate3.1 <- data.table::fread("yield_data/plate3.1.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate3.1_pdt <- plate3.1$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate3.1_pdt))
dim(plate.data) <- c(24,16)
plate.data3.1 <- t(plate.data)
# Plate 3.2
plate3.2 <- data.table::fread("yield_data/plate3.2.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate3.2_pdt <- plate3.2$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate3.2_pdt))
dim(plate.data) <- c(24,16)
plate.data3.2 <- t(plate.data)
# Plate 3.3
plate3.3 <- data.table::fread("yield_data/plate3.3.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate3.3_pdt <- plate3.3$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate3.3_pdt))
dim(plate.data) <- c(24,16)
plate.data3.3 <- t(plate.data)
# Plate 3.4
plate3.4 <- data.table::fread("yield_data/plate3.4.csv", header=TRUE, stringsAsFactors=FALSE, na.strings = "#DIV/0!")
plate3.4_pdt <- plate3.4$product_scaled[1:384]
plate.data <- as.matrix(as.numeric(plate3.4_pdt))
dim(plate.data) <- c(24,16)
plate.data3.4 <- t(plate.data)
# stitch Plate 3 together into one 32x48 matrix
plate3.top <- cbind(plate.data3.1, plate.data3.2)
plate3.bottom <- cbind(plate.data3.3, plate.data3.4)
plate3 <- rbind(plate3.top, plate3.bottom)
Generate yield data from loaded and processed yield
# Remove reactions without additive and reactions with additive 7
plate1_nocontrols <- plate1[c(-1,-5,-9,-13,-20,-24,-28,-32), c(-16,-32,-48)]
# Remove reactions without aryl halide
plate2_nocontrols <- plate2[, c(-16,-32,-48)]
plate3_nocontrols <- plate3[, c(-16,-32,-48)]
plate1_nocontrols_v <- as.vector(t(plate1_nocontrols))
plate2_nocontrols_v <- as.vector(t(plate2_nocontrols))
plate3_nocontrols_v <- as.vector(t(plate3_nocontrols))
yield_data <- c(plate1_nocontrols_v, plate2_nocontrols_v, plate3_nocontrols_v)
Create binary labels for yield
# based on histogram, scaled yield = c(0, <=.05, and <=.10) will be included
yield_label <- as.data.table(yield_data)
yield_label <- yield_label[, `:=` (`yield=0%` = if_else(yield_data == 0,0,1),
`yield<=5%` = if_else(yield_data <= 5,0,1),
`yield<=10%` = if_else(yield_data <= 10,0,1))]
Add mutliple label data to predictor data
#remove original yield column, rearrange to place at start of table/s
output.nominal <- cbind(yield_label,output.nominal)
output.scaled <- cbind(yield_label,output.scaled)
# remove rows where yield=NA
output.nominal <- output.nominal[!(is.na(output.nominal$yield_data)), ]
output.scaled <- output.scaled[!(is.na(output.scaled$yield_data)), ]
# control; keep only rows where yield=NA
#output.control.scaled <- output.scaled[(is.na(output.scaled$yield)), ]
Data split for initial CV
# Set sample ratio for 70/30 split
sample.ratio = 0.7
# Split scaled into training and test set (70/30)
set.seed(1084)
size <- round(sample.ratio*nrow(output.scaled))
training <- sample(nrow(output.scaled), size=size, replace=FALSE)
training.scaled <- output.scaled[training,]
test.scaled <- output.scaled[-training,]
# Split nominal into training and test set (70/30)
set.seed(1084)
size <- round(sample.ratio*nrow(output.nominal))
training <- sample(nrow(output.nominal), size=size, replace=FALSE)
training.nominal <- output.nominal[training,]
test.nominal <- output.nominal[-training,]
Generate individual datasets to be tested
## data nominal, yield <= 0
dtrain.nominal.01 <- xgb.DMatrix(data = as.matrix(training.nominal[, -(1:4)]),
label = as.matrix(training.nominal[, 2])) ## yield = 0
## data nominal, yield <= 5%
dtrain.nominal.05 <- xgb.DMatrix(data = as.matrix(training.nominal[, -(1:4)]),
label = as.matrix(training.nominal[, 3])) ## yield <= 5%
## data nominal, yield <= 10%
dtrain.nominal.10 <- xgb.DMatrix(data = as.matrix(training.nominal[, -(1:4)]),
label = as.matrix(training.nominal[, 4])) ## yield <= 10%
Yield = 0 is technically the most correct classification criteria; however, here we imgaine this a three-way classification task aggregated back down to two: 1) no rxn, 2) insignificant rxn, and 3) good rxn -> 1) no/insignificant rxn and 2) good rxn. This ensures predictive structure sensitive to yield-relevant considerations that may or may not function independently from factors determining rxn success. This is implied by original manuscript wrt the resasoning behind including and testing various additives known to suppress reactivity: suppression of reactivity and possibility of reaction are statistically different.
XGBoost cross validations setup
And now we check performance…
# model AUC
plot(pROC::roc(response = as.matrix(training.nominal[, 4]),
predictor = XGB_Model_1_CV.10$pred,
levels=c(0, 1)),
lwd=1.5,
auc.polygon=TRUE,
print.auc=TRUE)
## Warning in roc.default(response = as.matrix(training.nominal[, 4]), predictor
## = XGB_Model_1_CV.10$pred, : Deprecated use a matrix as response. Unexpected
## results may be produced, please pass a vector or factor.
## Setting direction: controls < cases
# plot, predicted probability vs actual yield
plot(XGB_Model_1_CV.10$pred, unlist(training.nominal[, 1]))
# correlations
paste0("correlation, rxn probability vs yield = ",
cor(XGB_Model_1_CV.10$pred, unlist(training.nominal[, 1]) %>% as.numeric()) %>% round(3))
## [1] "correlation, rxn probability vs yield = 0.704"
Cross validation suggests a model with state-of-the-art predictive power out-of-the-box, with no a posteriori “tuning” of parameters: AUC = 98.4%. This results indicate that regardless of how the model is calibrate wrt classification threshold, risk associated with false negs and false pos is minimal. The degree to which results hold up under generalized contexts will be explored shortly; however, we first examine model output in terms of emperical validity.
Cross validation produces preliminary probability predictions; results here suggest our hypothesis wrt yield being a function of rxn probability has merit. Correlation between the two valies is reasonably high, at 70.5%, and a simple linear regression produces a respectable R^2 of .497.
That said, these results are far away from state of the art wrt emperical alignment.
A quick plot demonstrates an m-shaped probability distribution, which, given the binary nature of our response function, is reasonable albeit undesirable. In our experience, the SHAP transformation – from probabilities to emperically “recentered” (conditioned) likelihoods (log-RR) – potentially address this.
Accordingly, next steps are as follows – 1) generate predictive xgboost model, 2) generate explanatory SHAP model, and 3) evaluate/characterize SHAP model
Full XGBoost model seems to be fit perfectly, which has traditionally been a cause for alarm. We now take a peek at the test data…
# generate diagnostic predictions for training data
XGB_model_1.10.Predict.Test <- xgboost:::predict.xgb.Booster(XGB_model_1.10, as.matrix(test.nominal[, -(1:4)]))
# model AUC
plot(pROC::roc(response = as.matrix(test.nominal[, 4]),
predictor = XGB_model_1.10.Predict.Test,
levels=c(0, 1)),
lwd=1.5,
auc.polygon=TRUE,
print.auc=TRUE)
## Warning in roc.default(response = as.matrix(test.nominal[, 4]), predictor =
## XGB_model_1.10.Predict.Test, : Deprecated use a matrix as response. Unexpected
## results may be produced, please pass a vector or factor.
## Setting direction: controls < cases
With an AUC of 98.2%, OOS results – representing 30% of the full dataset – are definitively state-of-the-art.
As previously noted, this chart represents model selectivity across all possible classification thresholds. The question being answered can be see as “to what degree is the underlying model sound with respect to mechanistic implications?”. As suggestd by these results, predictive performance no longer needs to be our primary concern – given sufficient data, these models perform remarkably well. We therefore, prior to assessing classification performance, take a closer look at the predictive substrucutre informing the performance of this model.
Table and Bubble Plot of Feature Performance metrics
# chart data
XGB_model_1.10.Importance <- xgb.importance(model = XGB_model_1.10)
XGB_model_1.10.Importance$Gain <- round(XGB_model_1.10.Importance$Gain, 5)*100
XGB_model_1.10.Importance$Frequency <- round(XGB_model_1.10.Importance$Frequency, 5)*100
XGB_model_1.10.Importance$Cover <- round(XGB_model_1.10.Importance$Cover, 5)*100
XGB_model_1.10.Importance <- XGB_model_1.10.Importance[(Cover+Gain+Frequency) > 0,]
(XGB_model_1.10.Importance)
#define and save chart
XGB_model_1.10.Bubble <- XGB_model_1.10.Importance %>%
ggplot(aes(x=Cover, y=Frequency, size=Gain, colors = Feature)) +
geom_point(alpha=0.5) +
scale_size(range = c(.1, 24), name="Gain")
#plotly
ggplotly(XGB_model_1.10.Bubble)
This plot combines three complementary performance metrics to visually provided a better understanding of how the various features inform the predictions produced by XGBoost (compared to traditional statistical models). Here, we argue that the two remaining, generally overlooked metrics – frequency and cover – deserve to be considered in a new light. Cover, which indicates the percentage of cases for which a given feature is relevant with respect to prediction, can be seen as a heuristic measure of overall significance. And frequency, which is a measure of how often the model tries to fit a given variable into predictive iterations, can be seen as a heuristic measure of predictive ambiguity. The standard, well known metric indicative of the degree to which omission of a given feature reduces the degree of variation represented by your model, is referred to as “gain” and indicated by bubble size.
The predictive substructure revealed above demonstrates a key, but overlooked reality regarding machine learning: the source of it’s predictive performance. Note that about 50% of classification performance comes from ~10 features. This is consistent with the common statistical wisdom and results form Doyle et al. Regardless of the methodology used, basic linear regression or state-of-the-art neural network, performance is a function of feature set size and diversity. We argue that the primary difference between models is in their ability to separably extract information from increasingly larger and complex sets of data. And machine learning, particularly tree-based methods, provides for mathematically & statistically tractable way to exploit sources of “subprime” information. * refer to XX for a discussion and proof demonstrating the mathematical equivalence of boosted classification trees and stagewise additive logistic regression.
As shown above, the bottom-left quadrant (low cover, low frequency) – features that would have been ignored or tossed under traditional statistical regimes – is home to a substantial proportion of the predictive information driving this model. Absent that, this model performs no better than standard LR. Up to now, estimation of feature effects (wrt magnitude and direction) has been exclusive to linear models. Recent developments have no turned this paradigm on its head.
The next step involves generation of effect estimates akin to those produced by traditional linear models; this is new to the world of machine learning. However, Shapely takes this a step further by providing local, case-wise effect estimations which allows for a new dimension of exploratory validation.
Shapley values
# generate SHAP model
XGB_model_1.10.SHAP <- shap.values(xgb_model = XGB_model_1.10, X_train = as.matrix(training.nominal[, -(1:4)]))
# eval classification performance
# generate prediction
XGB_model_1.10.SHAP.Bin <- rowSums(XGB_model_1.10.SHAP$shap_score) ## generate prediction
XGB_model_1.10.SHAP.Bin <- ifelse(XGB_model_1.10.SHAP.Bin < 0, "YieldInsig", "YieldPositive") %>% factor()
# evaluate
XGB_model_1.10.SHAP.Eval <- caret::confusionMatrix(XGB_model_1.10.SHAP.Bin, as.matrix(training.nominal[, 4]) %>%
factor(labels = c("YieldInsig", "YieldPositive")), positive = "YieldPositive")
(XGB_model_1.10.SHAP.Eval)
## Confusion Matrix and Statistics
##
## Reference
## Prediction YieldInsig YieldPositive
## YieldInsig 754 0
## YieldPositive 0 2014
##
## Accuracy : 1
## 95% CI : (0.9987, 1)
## No Information Rate : 0.7276
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.7276
## Detection Rate : 0.7276
## Detection Prevalence : 0.7276
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : YieldPositive
##
# generate SHAP visualization
shap.plot.summary.wrap2(shap_score = XGB_model_1.10.SHAP$shap_score,
X = as.matrix(training.nominal[, -(1:4)]),
top_n = 20)
# generate SHAP clusters
shap.plot.force_plot_bygroup(shap.prep.stack.data(shap_contrib = XGB_model_1.10.SHAP$shap_score, top_n = 20, n_groups = 6) )
## The SHAP values of the Rest 100 features were summed into variable 'rest_variables'.
SHAP generates an effect matrix consisiting of the predicted marginal contribution of each feature wrt a given predictive outcome. These values are additive wrt the log relative risk of event (reaction); in addition to effect estimations, SHAP also calculates a “bias” term. Bias is the intercept term and is, by definition, the same for all cases; however, in this context, the bias terms also indicates “base probability of event”. Mathematically, this is the value separating positive and negative results – in other words, SHAP inadvertently generates a classification threshold. This classification threshold is empirically grounded: SHAP generates this effect matrix based on an optimization model derived from cooperative game theory.
Classification models have always suffered dt lack of a priori, emperically-grounded thresholds for classifying cases based on predicted probabilites. Classification thresholds have either been determined arbitrarily (50% cutoff) or according to the inherently biasing practice of AUC optimization.
Accordingly, up to this point, we have refrained from arbitrarily assigning or generating a classification threshold in order to interpret the XGBoost results. Note (below) the ability to convert the SHAP-generated bias back to a probability and set as a the classification threshold for the original XGBoost model. Predictive output from XGBoost and SHAP are mathematically equivalent {XGBoost_Prob(Yield) = exp(SHAP)/(1+exp(SHAP))} – we will therefore henceforth ignore the former.
Summary: XGBoost was used – not to generate a prediction, but… – to, in a high-throughput fashion, develop a high-fidelity predictive substructure to be extracted by SHAP.
# eval classification performance
XGB_model_1.10.SHAP.Threshold <- exp(XGB_model_1.10.SHAP$BIAS0)/(1+exp(XGB_model_1.10.SHAP$BIAS0))
# generate prediction
XGB_model_1.10.Test.Bin <- ifelse(XGB_model_1.10.Predict.Test < XGB_model_1.10.SHAP.Threshold$BIAS, "YieldInsig", "YieldPositive") %>% factor()
# evaluate
XGB_model_1.10.Test.Eval <- caret::confusionMatrix(XGB_model_1.10.Test.Bin, as.matrix(test.nominal[, 4]) %>%
factor(labels = c("YieldInsig", "YieldPositive")), positive = "YieldPositive")
(XGB_model_1.10.Test.Eval)
## Confusion Matrix and Statistics
##
## Reference
## Prediction YieldInsig YieldPositive
## YieldInsig 316 23
## YieldPositive 48 800
##
## Accuracy : 0.9402
## 95% CI : (0.9251, 0.953)
## No Information Rate : 0.6933
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8566
##
## Mcnemar's Test P-Value : 0.004396
##
## Sensitivity : 0.9721
## Specificity : 0.8681
## Pos Pred Value : 0.9434
## Neg Pred Value : 0.9322
## Prevalence : 0.6933
## Detection Rate : 0.6740
## Detection Prevalence : 0.7144
## Balanced Accuracy : 0.9201
##
## 'Positive' Class : YieldPositive
##
Here too, OOS results are exceptional – especially when considerting that the classification threshold was determined a priori via SHAP – not after-the-fact to optimize performance. ___________
Now lets test correlation of SHAP vs yield – recall that SHAP values are the log relative risk of yield > 10% {P(yield > 10% | BIAS_BaselineProbability)} – we can suppose there to be many arguments as to why this comparision is or is not appropriate; however, looking to our DFT forefathers for inspiration, we unabashedly declare this to be “just an approximation”.
# make correlation dataset
XGB_model_1.10.SHAP.Corr <- cbind(rowSums(XGB_model_1.10.SHAP$shap_score), unlist(training.nominal[, 1])) %>% data.table()
colnames(XGB_model_1.10.SHAP.Corr) <- c("SHAP", "Yield")
# plot, log-RR (SHAP) vs actual yield
plot(XGB_model_1.10.SHAP.Corr$SHAP, XGB_model_1.10.SHAP.Corr$Yield)
# correlations, log-RR (SHAP) vs actual yield
cor(XGB_model_1.10.SHAP.Corr$SHAP, XGB_model_1.10.SHAP.Corr$Yield)
## [1] 0.8060548
#[1] 0.8059208
# create SHAP0, where neg values are set to zero
XGB_model_1.10.SHAP.Corr[, SHAP0 := ifelse(SHAP<=0,0,SHAP)]
# rerun correlations
plot(XGB_model_1.10.SHAP.Corr$SHAP0, XGB_model_1.10.SHAP.Corr$Yield)
cor(XGB_model_1.10.SHAP.Corr$SHAP0, XGB_model_1.10.SHAP.Corr$Yield)
## [1] 0.8344013
#[1] 0.8348504
# lets try linear model
#shap
lm(XGB_model_1.10.SHAP.Corr$Yield ~ XGB_model_1.10.SHAP.Corr$SHAP) %>% summary()
##
## Call:
## lm(formula = XGB_model_1.10.SHAP.Corr$Yield ~ XGB_model_1.10.SHAP.Corr$SHAP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.602 -12.552 -0.579 9.282 55.926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.53067 0.36652 53.29 <2e-16 ***
## XGB_model_1.10.SHAP.Corr$SHAP 3.74683 0.05231 71.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.21 on 2766 degrees of freedom
## Multiple R-squared: 0.6497, Adjusted R-squared: 0.6496
## F-statistic: 5131 on 1 and 2766 DF, p-value: < 2.2e-16
#Adjusted R-squared: 0.6494
#shap0
lm(XGB_model_1.10.SHAP.Corr$Yield ~ XGB_model_1.10.SHAP.Corr$SHAP0) %>% summary()
##
## Call:
## lm(formula = XGB_model_1.10.SHAP.Corr$Yield ~ XGB_model_1.10.SHAP.Corr$SHAP0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.137 -8.604 -1.232 6.681 63.565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.64282 0.48470 5.452 5.41e-08 ***
## XGB_model_1.10.SHAP.Corr$SHAP0 6.01544 0.07555 79.621 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.1 on 2766 degrees of freedom
## Multiple R-squared: 0.6962, Adjusted R-squared: 0.6961
## F-statistic: 6339 on 1 and 2766 DF, p-value: < 2.2e-16
#Adjusted R-squared: 0.6969
Substantial correlation is clearly evident, particular when the negative SHAP values transformed to zero (per Doyle et al). Simple LM worked well, yield as a function of aggregate SHAP score; R^2 has substantially improved vs the untransformed probabilities generated by XGBoost (.64 and .69 vs .50, resp.).
Next, we continue this direction by expanding to the context of multiple regression; yield as a function of the top 20 features. It must be noted that aggregate SHAP score and individual SHAP feature values are both independent to and additive with respect to each other. By definition, the SHAP value is heuristically calculated in such a way (deduction) that the association between the predictive outcome and each feature (grouping) is maximally valid.
# define data for lm that sorts and includes only the top features, per mean shap
# create subset for lm
XGB_model_1.10.SHAP.lm <- cbind(Yield = XGB_model_1.10.SHAP.Corr$Yield,
XGB_model_1.10.SHAP$shap_score[, .SD, .SDcols = head(names(XGB_model_1.10.SHAP$mean_shap_score), n=20)])
# calling by names errors, so wont do
# execute with top 20
# Multiple R-squared: 0.7087, Adjusted R-squared: 0.7066
lm(XGB_model_1.10.SHAP.lm$Yield ~
XGB_model_1.10.SHAP.lm[[2]] +
XGB_model_1.10.SHAP.lm[[3]] +
XGB_model_1.10.SHAP.lm[[4]] +
XGB_model_1.10.SHAP.lm[[5]] +
XGB_model_1.10.SHAP.lm[[6]] +
XGB_model_1.10.SHAP.lm[[7]] +
XGB_model_1.10.SHAP.lm[[8]] +
XGB_model_1.10.SHAP.lm[[9]] +
XGB_model_1.10.SHAP.lm[[10]] +
XGB_model_1.10.SHAP.lm[[11]] +
XGB_model_1.10.SHAP.lm[[12]] +
XGB_model_1.10.SHAP.lm[[13]] +
XGB_model_1.10.SHAP.lm[[14]] +
XGB_model_1.10.SHAP.lm[[15]] +
XGB_model_1.10.SHAP.lm[[16]] +
XGB_model_1.10.SHAP.lm[[17]] +
XGB_model_1.10.SHAP.lm[[18]] +
XGB_model_1.10.SHAP.lm[[19]] +
XGB_model_1.10.SHAP.lm[[20]] +
XGB_model_1.10.SHAP.lm[[21]]
) %>% summary()
##
## Call:
## lm(formula = XGB_model_1.10.SHAP.lm$Yield ~ XGB_model_1.10.SHAP.lm[[2]] +
## XGB_model_1.10.SHAP.lm[[3]] + XGB_model_1.10.SHAP.lm[[4]] +
## XGB_model_1.10.SHAP.lm[[5]] + XGB_model_1.10.SHAP.lm[[6]] +
## XGB_model_1.10.SHAP.lm[[7]] + XGB_model_1.10.SHAP.lm[[8]] +
## XGB_model_1.10.SHAP.lm[[9]] + XGB_model_1.10.SHAP.lm[[10]] +
## XGB_model_1.10.SHAP.lm[[11]] + XGB_model_1.10.SHAP.lm[[12]] +
## XGB_model_1.10.SHAP.lm[[13]] + XGB_model_1.10.SHAP.lm[[14]] +
## XGB_model_1.10.SHAP.lm[[15]] + XGB_model_1.10.SHAP.lm[[16]] +
## XGB_model_1.10.SHAP.lm[[17]] + XGB_model_1.10.SHAP.lm[[18]] +
## XGB_model_1.10.SHAP.lm[[19]] + XGB_model_1.10.SHAP.lm[[20]] +
## XGB_model_1.10.SHAP.lm[[21]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.889 -9.899 -0.416 8.526 48.844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.2079 0.4295 44.726 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[2]] 9.7228 0.5782 16.814 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[3]] 2.3969 0.8930 2.684 0.007317 **
## XGB_model_1.10.SHAP.lm[[4]] 4.0785 0.8580 4.753 2.10e-06 ***
## XGB_model_1.10.SHAP.lm[[5]] 2.7968 2.2473 1.245 0.213411
## XGB_model_1.10.SHAP.lm[[6]] 7.6897 0.8982 8.562 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[7]] 6.2144 0.8803 7.060 2.10e-12 ***
## XGB_model_1.10.SHAP.lm[[8]] 46.6806 8.9614 5.209 2.04e-07 ***
## XGB_model_1.10.SHAP.lm[[9]] 6.9008 0.6818 10.121 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[10]] 8.1232 1.2232 6.641 3.74e-11 ***
## XGB_model_1.10.SHAP.lm[[11]] 6.3545 1.4912 4.261 2.10e-05 ***
## XGB_model_1.10.SHAP.lm[[12]] -0.6581 2.0587 -0.320 0.749234
## XGB_model_1.10.SHAP.lm[[13]] 2.1121 0.9320 2.266 0.023517 *
## XGB_model_1.10.SHAP.lm[[14]] -0.1161 1.8074 -0.064 0.948790
## XGB_model_1.10.SHAP.lm[[15]] -6.2013 1.7318 -3.581 0.000348 ***
## XGB_model_1.10.SHAP.lm[[16]] 12.9568 1.3198 9.817 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[17]] 10.0746 2.1898 4.601 4.40e-06 ***
## XGB_model_1.10.SHAP.lm[[18]] -57.0491 17.8005 -3.205 0.001367 **
## XGB_model_1.10.SHAP.lm[[19]] -9.4594 3.0218 -3.130 0.001764 **
## XGB_model_1.10.SHAP.lm[[20]] 0.6646 1.8704 0.355 0.722386
## XGB_model_1.10.SHAP.lm[[21]] 2.5938 3.0711 0.845 0.398410
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.77 on 2747 degrees of freedom
## Multiple R-squared: 0.7113, Adjusted R-squared: 0.7092
## F-statistic: 338.4 on 20 and 2747 DF, p-value: < 2.2e-16
# execute with top 12
# Multiple R-squared: 0.6922, Adjusted R-squared: 0.691
lm(XGB_model_1.10.SHAP.lm$Yield ~
XGB_model_1.10.SHAP.lm[[2]] +
XGB_model_1.10.SHAP.lm[[3]] +
XGB_model_1.10.SHAP.lm[[4]] +
XGB_model_1.10.SHAP.lm[[5]] +
XGB_model_1.10.SHAP.lm[[6]] +
XGB_model_1.10.SHAP.lm[[7]] +
XGB_model_1.10.SHAP.lm[[8]] +
XGB_model_1.10.SHAP.lm[[9]] +
XGB_model_1.10.SHAP.lm[[10]] +
XGB_model_1.10.SHAP.lm[[11]] +
XGB_model_1.10.SHAP.lm[[12]]
) %>% summary()
##
## Call:
## lm(formula = XGB_model_1.10.SHAP.lm$Yield ~ XGB_model_1.10.SHAP.lm[[2]] +
## XGB_model_1.10.SHAP.lm[[3]] + XGB_model_1.10.SHAP.lm[[4]] +
## XGB_model_1.10.SHAP.lm[[5]] + XGB_model_1.10.SHAP.lm[[6]] +
## XGB_model_1.10.SHAP.lm[[7]] + XGB_model_1.10.SHAP.lm[[8]] +
## XGB_model_1.10.SHAP.lm[[9]] + XGB_model_1.10.SHAP.lm[[10]] +
## XGB_model_1.10.SHAP.lm[[11]] + XGB_model_1.10.SHAP.lm[[12]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.467 -9.941 -0.290 8.670 56.671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.9400 0.3721 53.583 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[2]] 9.2680 0.5644 16.421 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[3]] 2.6301 0.9036 2.911 0.00364 **
## XGB_model_1.10.SHAP.lm[[4]] 4.9339 0.8389 5.881 4.56e-09 ***
## XGB_model_1.10.SHAP.lm[[5]] 1.3655 2.2026 0.620 0.53535
## XGB_model_1.10.SHAP.lm[[6]] 8.0409 0.6626 12.136 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[7]] 9.4614 0.7308 12.946 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[8]] 17.3946 0.6591 26.393 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[9]] 10.4512 0.5927 17.633 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[10]] 7.8851 0.8967 8.793 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[11]] 6.1426 1.4508 4.234 2.37e-05 ***
## XGB_model_1.10.SHAP.lm[[12]] 2.1465 1.9192 1.118 0.26347
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.18 on 2756 degrees of freedom
## Multiple R-squared: 0.694, Adjusted R-squared: 0.6928
## F-statistic: 568.3 on 11 and 2756 DF, p-value: < 2.2e-16
# execute with top 5
# Multiple R-squared: 0.5286, Adjusted R-squared: 0.5279
lm(XGB_model_1.10.SHAP.lm$Yield ~
XGB_model_1.10.SHAP.lm[[2]] +
XGB_model_1.10.SHAP.lm[[3]] +
XGB_model_1.10.SHAP.lm[[4]] +
XGB_model_1.10.SHAP.lm[[5]]
) %>% summary()
##
## Call:
## lm(formula = XGB_model_1.10.SHAP.lm$Yield ~ XGB_model_1.10.SHAP.lm[[2]] +
## XGB_model_1.10.SHAP.lm[[3]] + XGB_model_1.10.SHAP.lm[[4]] +
## XGB_model_1.10.SHAP.lm[[5]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.126 -14.172 0.125 12.865 64.398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.3259 0.4293 61.324 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[2]] 9.7461 0.6126 15.910 < 2e-16 ***
## XGB_model_1.10.SHAP.lm[[3]] 5.8968 1.0006 5.893 4.24e-09 ***
## XGB_model_1.10.SHAP.lm[[4]] 6.8048 1.0576 6.434 1.46e-10 ***
## XGB_model_1.10.SHAP.lm[[5]] 4.0848 2.0270 2.015 0.044 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.11 on 2763 degrees of freedom
## Multiple R-squared: 0.4617, Adjusted R-squared: 0.4609
## F-statistic: 592.4 on 4 and 2763 DF, p-value: < 2.2e-16
Next, we recall and compare results from Doyle et al…
ml_models.png <- magick::image_read("R/plots/ml_models.png")
plot(ml_models.png) # or print(img)
Interim Summary: So, without much work, SHAP was able to produce – from a binary classification task – yield prediction that rivals classical machine learned models trained on a regression task (kNN, SVM, linear model, Bayes GLM). Results are still far below those neural network and random forest, but…. this is to be expected.
A more interesting test will be to substitue the scaled data used by Doyle et al, with the SHAP data and rerun the RandomForest. However, prior to that, we will take a closer look at the SHAP data to better understand its statistical characteristics as well as applications wrt mechanistic explanation/exploration. ________________________
First, let’s examine one of my favorite use cases: stratified means tests for exploratory hypothesis testing via tables and 3D/4D visualizations. For stratification, we will go with the Halogen and Halide groupings that Doyle et all tested by adding them to the SHAP data as categorical variables. For diagnostic purposes, we’ll conclude this by quantifying the differentiating effect of these categoricals by evaluating their marginal effect in a LR task. ___________________
First, we set up the data to include the Halide and Halogen stratification variables.
# generate indices for halogens
ArCl <- seq(1, nrow(output.table), by=3)
ArBr <- seq(2, nrow(output.table), by=3)
ArI <- seq(3, nrow(output.table), by=3)
# generate indices for halides
CF3.Cl <- seq(1, nrow(output.table), by=15)
CF3.Br <- seq(2, nrow(output.table), by=15)
CF3.I <- seq(3, nrow(output.table), by=15)
OMe.Cl <- seq(4, nrow(output.table), by=15)
OMe.Br <- seq(5, nrow(output.table), by=15)
OMe.I <- seq(6, nrow(output.table), by=15)
Et.Cl <- seq(7, nrow(output.table), by=15)
Et.Br <- seq(8, nrow(output.table), by=15)
Et.I <- seq(9, nrow(output.table), by=15)
pyr2.Cl <- seq(10, nrow(output.table), by=15)
pyr2.Br <- seq(11, nrow(output.table), by=15)
pyr2.I <- seq(12, nrow(output.table), by=15)
pyr3.Cl <- seq(13, nrow(output.table), by=15)
pyr3.Br <- seq(14, nrow(output.table), by=15)
pyr3.I <- seq(15, nrow(output.table), by=15)
# define new table
nominal.table.labeled <- output.table
# create new nominal table with labels
nominal.table.labeled[ArCl, Halogen := "ArCl"]
nominal.table.labeled[ArBr, Halogen := "ArBr"]
nominal.table.labeled[ArI, Halogen := "ArI"]
# add halides
nominal.table.labeled[CF3.Cl, Halide := "CF3.Cl"]
nominal.table.labeled[CF3.Br, Halide := "CF3.Br"]
nominal.table.labeled[CF3.I, Halide := "CF3.I"]
nominal.table.labeled[OMe.Cl, Halide := "OMe.Cl"]
nominal.table.labeled[OMe.Br, Halide := "OMe.Br"]
nominal.table.labeled[OMe.I, Halide := "OMe.I"]
nominal.table.labeled[Et.Cl, Halide := "Et.Cl"]
nominal.table.labeled[Et.Br, Halide := "Et.Br"]
nominal.table.labeled[Et.I, Halide := "Et.I"]
nominal.table.labeled[pyr2.Cl, Halide := "pyr2.Cl"]
nominal.table.labeled[pyr2.Br, Halide := "pyr2.Br"]
nominal.table.labeled[pyr2.I, Halide := "pyr2.I"]
nominal.table.labeled[pyr3.Cl, Halide := "pyr3.Cl"]
nominal.table.labeled[pyr3.Br, Halide := "pyr3.Br"]
nominal.table.labeled[pyr3.I, Halide := "pyr3.I"]
# add yield
nominal.table.labeled <- cbind(yield_label, nominal.table.labeled)
# drop yield is na
nominal.table.labeled <- nominal.table.labeled[!is.na(yield_data)]
# separate into test and training
labeled.test <- nominal.table.labeled[-training,]
labeled.training <- nominal.table.labeled[training,]
# confirm equivalence
XGB_model_1.10.SHAP$shap_score %>% nrow() == labeled.training %>% nrow()
## [1] TRUE
# [1] TRUE
# add labels to SHAP
XGB_model_1.10.SHAP$shap_score[, Halogen := labeled.training$Halogen]
XGB_model_1.10.SHAP$shap_score[, Halide := labeled.training$Halide]
Let’s do means test to evaluate differences between chemical groups wrt SHAP-estimated feature effects.
# create indices for test variables
means_indx <- XGB_model_1.10.SHAP$shap_score[, 1:ncol(XGB_model_1.10.SHAP$shap_score)] %>% names()
# setup and execute stats tables
XGB_model_1.10.SHAP.Means <- CreateTableOne(vars = means_indx,
strata = "Halogen" ,
data = XGB_model_1.10.SHAP$shap_score)
#extract data matrix
XGB_model_1.10.SHAP.Means <- print(XGB_model_1.10.SHAP.Means$ContTable, smd=TRUE, quote = FALSE, noSpaces = TRUE)
## Stratified by Halogen
## ArBr ArCl
## n 915 909
## additive_*C3_NMR_shift (mean (SD)) 0.21 (0.52) 0.11 (0.35)
## additive_*C3_electrostatic_charge (mean (SD)) 0.06 (0.14) 0.02 (0.16)
## additive_*C4_NMR_shift (mean (SD)) 0.11 (0.18) 0.07 (0.14)
## additive_*C4_electrostatic_charge (mean (SD)) 0.15 (0.25) 0.10 (0.19)
## additive_*C5_NMR_shift (mean (SD)) 0.05 (0.15) 0.03 (0.14)
## additive_*C5_electrostatic_charge (mean (SD)) 0.11 (0.17) 0.06 (0.14)
## additive_*N1_electrostatic_charge (mean (SD)) 0.03 (0.07) 0.00 (0.08)
## additive_*O1_electrostatic_charge (mean (SD)) 0.23 (0.54) 0.12 (0.50)
## additive_E_HOMO (mean (SD)) 0.16 (0.43) 0.09 (0.26)
## additive_E_LUMO (mean (SD)) 0.05 (0.29) 0.00 (0.48)
## additive_V1_frequency (mean (SD)) 0.04 (0.15) 0.02 (0.15)
## additive_V1_intensity (mean (SD)) 0.04 (0.10) 0.01 (0.13)
## additive_dipole_moment (mean (SD)) 0.15 (0.27) 0.09 (0.20)
## additive_electronegativity (mean (SD)) 0.01 (0.03) 0.01 (0.03)
## additive_hardness (mean (SD)) 0.02 (0.07) 0.01 (0.10)
## additive_molecular_volume (mean (SD)) 0.09 (0.19) 0.04 (0.22)
## additive_molecular_weight (mean (SD)) 0.04 (0.08) 0.01 (0.08)
## additive_ovality (mean (SD)) 0.02 (0.08) 0.01 (0.07)
## additive_surface_area (mean (SD)) 0.02 (0.05) 0.00 (0.09)
## aryl_halide_*C1_NMR_shift (mean (SD)) 0.05 (0.11) -0.02 (0.34)
## aryl_halide_*C1_electrostatic_charge (mean (SD)) 0.19 (0.09) -0.01 (0.19)
## aryl_halide_*C2_NMR_shift (mean (SD)) 0.09 (0.06) -0.01 (0.06)
## aryl_halide_*C2_electrostatic_charge (mean (SD)) 0.08 (0.10) 0.02 (0.09)
## aryl_halide_*C3_NMR_shift (mean (SD)) 0.42 (0.98) -0.17 (1.08)
## aryl_halide_*C3_electrostatic_charge (mean (SD)) 0.04 (0.08) 0.00 (0.05)
## aryl_halide_*C4_NMR_shift (mean (SD)) 0.00 (0.06) -0.03 (0.03)
## aryl_halide_*C4_electrostatic_charge (mean (SD)) 0.04 (0.09) 0.00 (0.07)
## aryl_halide_*H2_NMR_shift (mean (SD)) 0.31 (0.27) -0.37 (0.50)
## aryl_halide_*H2_electrostatic_charge (mean (SD)) 0.24 (0.30) -0.25 (0.13)
## aryl_halide_*H3_NMR_shift (mean (SD)) 0.00 (0.04) -0.02 (0.05)
## aryl_halide_*H3_electrostatic_charge (mean (SD)) 0.05 (0.09) 0.02 (0.07)
## aryl_halide_E_HOMO (mean (SD)) 0.10 (0.08) -0.04 (0.15)
## aryl_halide_E_LUMO (mean (SD)) 0.12 (0.20) -0.15 (0.43)
## aryl_halide_V1_frequency (mean (SD)) 0.03 (0.05) -0.01 (0.05)
## aryl_halide_V1_intensity (mean (SD)) 0.09 (0.07) 0.01 (0.12)
## aryl_halide_V2_frequency (mean (SD)) 0.37 (0.55) -0.45 (0.84)
## aryl_halide_V2_intensity (mean (SD)) 0.02 (0.03) 0.00 (0.01)
## aryl_halide_V3_frequency (mean (SD)) 0.01 (0.12) -0.08 (0.10)
## aryl_halide_V3_intensity (mean (SD)) 0.02 (0.03) 0.00 (0.03)
## aryl_halide_dipole_moment (mean (SD)) 0.05 (0.09) 0.01 (0.05)
## aryl_halide_electronegativity (mean (SD)) 0.01 (0.02) 0.00 (0.02)
## aryl_halide_hardness (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## aryl_halide_molecular_volume (mean (SD)) 0.01 (0.01) -0.01 (0.01)
## aryl_halide_molecular_weight (mean (SD)) 0.64 (0.27) -0.80 (0.63)
## aryl_halide_ovality (mean (SD)) 0.09 (0.05) -0.04 (0.18)
## aryl_halide_surface_area (mean (SD)) 0.04 (0.02) -0.01 (0.07)
## base_*N1_electrostatic_charge (mean (SD)) 0.13 (0.53) 0.07 (0.46)
## base_E_HOMO (mean (SD)) 0.09 (0.29) 0.05 (0.23)
## base_E_LUMO (mean (SD)) 0.04 (0.14) 0.02 (0.11)
## base_dipole_moment (mean (SD)) 0.02 (0.07) 0.01 (0.06)
## base_electronegativity (mean (SD)) 0.01 (0.04) 0.01 (0.03)
## base_hardness (mean (SD)) 0.05 (0.21) 0.03 (0.18)
## base_molecular_volume (mean (SD)) 0.03 (0.12) 0.02 (0.10)
## base_molecular_weight (mean (SD)) 0.02 (0.06) 0.01 (0.05)
## base_ovality (mean (SD)) 0.01 (0.03) 0.00 (0.02)
## base_surface_area (mean (SD)) 0.00 (0.01) 0.00 (0.01)
## ligand_*C10_NMR_shift (mean (SD)) 0.05 (0.29) 0.04 (0.43)
## ligand_*C10_electrostatic_charge (mean (SD)) 0.02 (0.13) 0.02 (0.18)
## ligand_*C11_NMR_shift (mean (SD)) 0.14 (0.46) 0.07 (0.32)
## ligand_*C11_electrostatic_charge (mean (SD)) 0.07 (0.22) 0.04 (0.15)
## ligand_*C12_NMR_shift (mean (SD)) 0.03 (0.10) 0.01 (0.07)
## ligand_*C12_electrostatic_charge (mean (SD)) 0.01 (0.04) 0.01 (0.08)
## ligand_*C13_NMR_shift (mean (SD)) 0.00 (0.02) 0.00 (0.04)
## ligand_*C13_electrostatic_charge (mean (SD)) 0.01 (0.04) 0.01 (0.03)
## ligand_*C14_NMR_shift (mean (SD)) 0.01 (0.02) 0.00 (0.01)
## ligand_*C14_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C15_NMR_shift (mean (SD)) 0.01 (0.04) 0.00 (0.04)
## ligand_*C15_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C16_NMR_shift (mean (SD)) 0.01 (0.06) 0.01 (0.06)
## ligand_*C16_electrostatic_charge (mean (SD)) 0.01 (0.03) 0.00 (0.03)
## ligand_*C17_NMR_shift (mean (SD)) 0.00 (0.01) 0.00 (0.01)
## ligand_*C17_electrostatic_charge (mean (SD)) 0.00 (0.01) 0.00 (0.00)
## ligand_*C1_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.01)
## ligand_*C1_electrostatic_charge (mean (SD)) 0.00 (0.01) 0.00 (0.01)
## ligand_*C2_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C2_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C3_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C3_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C4_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C4_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C5_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C5_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C6_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C6_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C7_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C7_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C8_NMR_shift (mean (SD)) 0.00 (0.01) 0.00 (0.01)
## ligand_*C8_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C9_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C9_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H11_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H11_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H3_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H3_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H4_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H4_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H9_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H9_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*P1_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V10_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V10_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V1_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V1_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V2_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V2_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V3_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V3_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V4_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V4_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V5_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V5_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V6_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V6_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V7_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V7_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V8_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V8_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V9_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V9_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_dipole_moment (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## Stratified by Halogen
## ArI p test
## n 944
## additive_*C3_NMR_shift (mean (SD)) 0.21 (0.50) <0.001
## additive_*C3_electrostatic_charge (mean (SD)) 0.06 (0.14) <0.001
## additive_*C4_NMR_shift (mean (SD)) 0.11 (0.16) <0.001
## additive_*C4_electrostatic_charge (mean (SD)) 0.16 (0.25) <0.001
## additive_*C5_NMR_shift (mean (SD)) 0.05 (0.14) <0.001
## additive_*C5_electrostatic_charge (mean (SD)) 0.13 (0.20) <0.001
## additive_*N1_electrostatic_charge (mean (SD)) 0.03 (0.05) <0.001
## additive_*O1_electrostatic_charge (mean (SD)) 0.22 (0.56) <0.001
## additive_E_HOMO (mean (SD)) 0.15 (0.38) <0.001
## additive_E_LUMO (mean (SD)) 0.04 (0.19) 0.007
## additive_V1_frequency (mean (SD)) 0.04 (0.11) 0.002
## additive_V1_intensity (mean (SD)) 0.04 (0.08) <0.001
## additive_dipole_moment (mean (SD)) 0.17 (0.31) <0.001
## additive_electronegativity (mean (SD)) 0.01 (0.03) <0.001
## additive_hardness (mean (SD)) 0.02 (0.06) <0.001
## additive_molecular_volume (mean (SD)) 0.10 (0.18) <0.001
## additive_molecular_weight (mean (SD)) 0.05 (0.08) <0.001
## additive_ovality (mean (SD)) 0.01 (0.07) 0.005
## additive_surface_area (mean (SD)) 0.02 (0.05) <0.001
## aryl_halide_*C1_NMR_shift (mean (SD)) 0.06 (0.12) <0.001
## aryl_halide_*C1_electrostatic_charge (mean (SD)) 0.12 (0.10) <0.001
## aryl_halide_*C2_NMR_shift (mean (SD)) 0.03 (0.09) <0.001
## aryl_halide_*C2_electrostatic_charge (mean (SD)) 0.06 (0.10) <0.001
## aryl_halide_*C3_NMR_shift (mean (SD)) 0.97 (0.86) <0.001
## aryl_halide_*C3_electrostatic_charge (mean (SD)) 0.03 (0.08) <0.001
## aryl_halide_*C4_NMR_shift (mean (SD)) 0.10 (0.11) <0.001
## aryl_halide_*C4_electrostatic_charge (mean (SD)) 0.02 (0.11) <0.001
## aryl_halide_*H2_NMR_shift (mean (SD)) 0.49 (0.13) <0.001
## aryl_halide_*H2_electrostatic_charge (mean (SD)) 0.22 (0.21) <0.001
## aryl_halide_*H3_NMR_shift (mean (SD)) 0.06 (0.02) <0.001
## aryl_halide_*H3_electrostatic_charge (mean (SD)) 0.03 (0.09) <0.001
## aryl_halide_E_HOMO (mean (SD)) 0.11 (0.06) <0.001
## aryl_halide_E_LUMO (mean (SD)) 0.19 (0.08) <0.001
## aryl_halide_V1_frequency (mean (SD)) 0.00 (0.04) <0.001
## aryl_halide_V1_intensity (mean (SD)) 0.08 (0.15) <0.001
## aryl_halide_V2_frequency (mean (SD)) 0.65 (0.16) <0.001
## aryl_halide_V2_intensity (mean (SD)) 0.01 (0.03) <0.001
## aryl_halide_V3_frequency (mean (SD)) 0.16 (0.08) <0.001
## aryl_halide_V3_intensity (mean (SD)) 0.01 (0.01) <0.001
## aryl_halide_dipole_moment (mean (SD)) 0.05 (0.08) <0.001
## aryl_halide_electronegativity (mean (SD)) 0.01 (0.02) <0.001
## aryl_halide_hardness (mean (SD)) 0.00 (0.00) <0.001
## aryl_halide_molecular_volume (mean (SD)) 0.02 (0.01) <0.001
## aryl_halide_molecular_weight (mean (SD)) 0.87 (0.22) <0.001
## aryl_halide_ovality (mean (SD)) 0.10 (0.04) <0.001
## aryl_halide_surface_area (mean (SD)) 0.04 (0.02) <0.001
## base_*N1_electrostatic_charge (mean (SD)) 0.14 (0.50) 0.004
## base_E_HOMO (mean (SD)) 0.10 (0.27) <0.001
## base_E_LUMO (mean (SD)) 0.05 (0.13) <0.001
## base_dipole_moment (mean (SD)) 0.03 (0.07) <0.001
## base_electronegativity (mean (SD)) 0.01 (0.03) <0.001
## base_hardness (mean (SD)) 0.05 (0.20) 0.032
## base_molecular_volume (mean (SD)) 0.03 (0.11) 0.011
## base_molecular_weight (mean (SD)) 0.02 (0.06) 0.008
## base_ovality (mean (SD)) 0.01 (0.03) 0.009
## base_surface_area (mean (SD)) 0.00 (0.01) 0.007
## ligand_*C10_NMR_shift (mean (SD)) 0.05 (0.21) 0.754
## ligand_*C10_electrostatic_charge (mean (SD)) 0.02 (0.12) 0.579
## ligand_*C11_NMR_shift (mean (SD)) 0.16 (0.53) <0.001
## ligand_*C11_electrostatic_charge (mean (SD)) 0.08 (0.26) <0.001
## ligand_*C12_NMR_shift (mean (SD)) 0.03 (0.11) <0.001
## ligand_*C12_electrostatic_charge (mean (SD)) 0.01 (0.04) 0.717
## ligand_*C13_NMR_shift (mean (SD)) 0.00 (0.02) 0.679
## ligand_*C13_electrostatic_charge (mean (SD)) 0.02 (0.05) <0.001
## ligand_*C14_NMR_shift (mean (SD)) 0.01 (0.03) <0.001
## ligand_*C14_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.229
## ligand_*C15_NMR_shift (mean (SD)) 0.00 (0.03) 0.543
## ligand_*C15_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.692
## ligand_*C16_NMR_shift (mean (SD)) 0.01 (0.06) 0.094
## ligand_*C16_electrostatic_charge (mean (SD)) 0.01 (0.04) 0.006
## ligand_*C17_NMR_shift (mean (SD)) 0.00 (0.01) 0.658
## ligand_*C17_electrostatic_charge (mean (SD)) 0.00 (0.01) <0.001
## ligand_*C1_NMR_shift (mean (SD)) 0.00 (0.00) 0.006
## ligand_*C1_electrostatic_charge (mean (SD)) 0.00 (0.01) 0.027
## ligand_*C2_NMR_shift (mean (SD)) 0.00 (0.00) 0.009
## ligand_*C2_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.993
## ligand_*C3_NMR_shift (mean (SD)) 0.00 (0.00) 0.281
## ligand_*C3_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.601
## ligand_*C4_NMR_shift (mean (SD)) 0.00 (0.00) 0.228
## ligand_*C4_electrostatic_charge (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C5_NMR_shift (mean (SD)) 0.00 (0.00) 0.023
## ligand_*C5_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.502
## ligand_*C6_NMR_shift (mean (SD)) 0.00 (0.00) 0.260
## ligand_*C6_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.008
## ligand_*C7_NMR_shift (mean (SD)) 0.00 (0.00) NaN
## ligand_*C7_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.009
## ligand_*C8_NMR_shift (mean (SD)) 0.00 (0.01) 0.094
## ligand_*C8_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.033
## ligand_*C9_NMR_shift (mean (SD)) 0.00 (0.00) 0.025
## ligand_*C9_electrostatic_charge (mean (SD)) 0.00 (0.00) NaN
## ligand_*H11_NMR_shift (mean (SD)) 0.00 (0.00) 0.032
## ligand_*H11_electrostatic_charge (mean (SD)) 0.00 (0.00) NaN
## ligand_*H3_NMR_shift (mean (SD)) 0.00 (0.00) 0.513
## ligand_*H3_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.179
## ligand_*H4_NMR_shift (mean (SD)) 0.00 (0.00) 0.461
## ligand_*H4_electrostatic_charge (mean (SD)) 0.00 (0.00) NaN
## ligand_*H9_NMR_shift (mean (SD)) 0.00 (0.00) 0.077
## ligand_*H9_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.109
## ligand_*P1_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.254
## ligand_V10_frequency (mean (SD)) 0.00 (0.00) 0.053
## ligand_V10_intensity (mean (SD)) 0.00 (0.00) 0.404
## ligand_V1_frequency (mean (SD)) 0.00 (0.00) 0.379
## ligand_V1_intensity (mean (SD)) 0.00 (0.00) 0.071
## ligand_V2_frequency (mean (SD)) 0.00 (0.00) NaN
## ligand_V2_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_V3_frequency (mean (SD)) 0.00 (0.00) 0.001
## ligand_V3_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_V4_frequency (mean (SD)) 0.00 (0.00) 0.004
## ligand_V4_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_V5_frequency (mean (SD)) 0.00 (0.00) NaN
## ligand_V5_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_V6_frequency (mean (SD)) 0.00 (0.00) NaN
## ligand_V6_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_V7_frequency (mean (SD)) 0.00 (0.00) NaN
## ligand_V7_intensity (mean (SD)) 0.00 (0.00) 0.732
## ligand_V8_frequency (mean (SD)) 0.00 (0.00) 0.413
## ligand_V8_intensity (mean (SD)) 0.00 (0.00) 0.718
## ligand_V9_frequency (mean (SD)) 0.00 (0.00) NaN
## ligand_V9_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_dipole_moment (mean (SD)) 0.00 (0.00) NaN
## Stratified by Halogen
## SMD
## n
## additive_*C3_NMR_shift (mean (SD)) 0.151
## additive_*C3_electrostatic_charge (mean (SD)) 0.202
## additive_*C4_NMR_shift (mean (SD)) 0.201
## additive_*C4_electrostatic_charge (mean (SD)) 0.193
## additive_*C5_NMR_shift (mean (SD)) 0.113
## additive_*C5_electrostatic_charge (mean (SD)) 0.285
## additive_*N1_electrostatic_charge (mean (SD)) 0.261
## additive_*O1_electrostatic_charge (mean (SD)) 0.142
## additive_E_HOMO (mean (SD)) 0.135
## additive_E_LUMO (mean (SD)) 0.084
## additive_V1_frequency (mean (SD)) 0.097
## additive_V1_intensity (mean (SD)) 0.141
## additive_dipole_moment (mean (SD)) 0.203
## additive_electronegativity (mean (SD)) 0.174
## additive_hardness (mean (SD)) 0.103
## additive_molecular_volume (mean (SD)) 0.166
## additive_molecular_weight (mean (SD)) 0.300
## additive_ovality (mean (SD)) 0.099
## additive_surface_area (mean (SD)) 0.145
## aryl_halide_*C1_NMR_shift (mean (SD)) 0.237
## aryl_halide_*C1_electrostatic_charge (mean (SD)) 0.947
## aryl_halide_*C2_NMR_shift (mean (SD)) 1.009
## aryl_halide_*C2_electrostatic_charge (mean (SD)) 0.404
## aryl_halide_*C3_NMR_shift (mean (SD)) 0.774
## aryl_halide_*C3_electrostatic_charge (mean (SD)) 0.341
## aryl_halide_*C4_NMR_shift (mean (SD)) 1.065
## aryl_halide_*C4_electrostatic_charge (mean (SD)) 0.267
## aryl_halide_*H2_NMR_shift (mean (SD)) 1.628
## aryl_halide_*H2_electrostatic_charge (mean (SD)) 1.630
## aryl_halide_*H3_NMR_shift (mean (SD)) 1.358
## aryl_halide_*H3_electrostatic_charge (mean (SD)) 0.230
## aryl_halide_E_HOMO (mean (SD)) 0.852
## aryl_halide_E_LUMO (mean (SD)) 0.804
## aryl_halide_V1_frequency (mean (SD)) 0.497
## aryl_halide_V1_intensity (mean (SD)) 0.501
## aryl_halide_V2_frequency (mean (SD)) 1.228
## aryl_halide_V2_intensity (mean (SD)) 0.463
## aryl_halide_V3_frequency (mean (SD)) 1.586
## aryl_halide_V3_intensity (mean (SD)) 0.619
## aryl_halide_dipole_moment (mean (SD)) 0.420
## aryl_halide_electronegativity (mean (SD)) 0.507
## aryl_halide_hardness (mean (SD)) 1.076
## aryl_halide_molecular_volume (mean (SD)) 1.301
## aryl_halide_molecular_weight (mean (SD)) 2.488
## aryl_halide_ovality (mean (SD)) 0.811
## aryl_halide_surface_area (mean (SD)) 0.835
## base_*N1_electrostatic_charge (mean (SD)) 0.100
## base_E_HOMO (mean (SD)) 0.139
## base_E_LUMO (mean (SD)) 0.143
## base_dipole_moment (mean (SD)) 0.133
## base_electronegativity (mean (SD)) 0.134
## base_hardness (mean (SD)) 0.078
## base_molecular_volume (mean (SD)) 0.089
## base_molecular_weight (mean (SD)) 0.092
## base_ovality (mean (SD)) 0.091
## base_surface_area (mean (SD)) 0.093
## ligand_*C10_NMR_shift (mean (SD)) 0.021
## ligand_*C10_electrostatic_charge (mean (SD)) 0.032
## ligand_*C11_NMR_shift (mean (SD)) 0.128
## ligand_*C11_electrostatic_charge (mean (SD)) 0.138
## ligand_*C12_NMR_shift (mean (SD)) 0.143
## ligand_*C12_electrostatic_charge (mean (SD)) 0.026
## ligand_*C13_NMR_shift (mean (SD)) 0.027
## ligand_*C13_electrostatic_charge (mean (SD)) 0.166
## ligand_*C14_NMR_shift (mean (SD)) 0.153
## ligand_*C14_electrostatic_charge (mean (SD)) 0.053
## ligand_*C15_NMR_shift (mean (SD)) 0.033
## ligand_*C15_electrostatic_charge (mean (SD)) 0.024
## ligand_*C16_NMR_shift (mean (SD)) 0.066
## ligand_*C16_electrostatic_charge (mean (SD)) 0.090
## ligand_*C17_NMR_shift (mean (SD)) 0.026
## ligand_*C17_electrostatic_charge (mean (SD)) 0.144
## ligand_*C1_NMR_shift (mean (SD)) 0.092
## ligand_*C1_electrostatic_charge (mean (SD)) 0.077
## ligand_*C2_NMR_shift (mean (SD)) 0.094
## ligand_*C2_electrostatic_charge (mean (SD)) 0.003
## ligand_*C3_NMR_shift (mean (SD)) 0.047
## ligand_*C3_electrostatic_charge (mean (SD)) 0.031
## ligand_*C4_NMR_shift (mean (SD)) 0.051
## ligand_*C4_electrostatic_charge (mean (SD)) 0.217
## ligand_*C5_NMR_shift (mean (SD)) 0.079
## ligand_*C5_electrostatic_charge (mean (SD)) 0.033
## ligand_*C6_NMR_shift (mean (SD)) 0.050
## ligand_*C6_electrostatic_charge (mean (SD)) 0.077
## ligand_*C7_NMR_shift (mean (SD)) <0.001
## ligand_*C7_electrostatic_charge (mean (SD)) 0.095
## ligand_*C8_NMR_shift (mean (SD)) 0.065
## ligand_*C8_electrostatic_charge (mean (SD)) 0.079
## ligand_*C9_NMR_shift (mean (SD)) 0.080
## ligand_*C9_electrostatic_charge (mean (SD)) <0.001
## ligand_*H11_NMR_shift (mean (SD)) 0.076
## ligand_*H11_electrostatic_charge (mean (SD)) <0.001
## ligand_*H3_NMR_shift (mean (SD)) 0.037
## ligand_*H3_electrostatic_charge (mean (SD)) 0.057
## ligand_*H4_NMR_shift (mean (SD)) 0.039
## ligand_*H4_electrostatic_charge (mean (SD)) <0.001
## ligand_*H9_NMR_shift (mean (SD)) 0.068
## ligand_*H9_electrostatic_charge (mean (SD)) 0.065
## ligand_*P1_electrostatic_charge (mean (SD)) 0.050
## ligand_V10_frequency (mean (SD)) 0.074
## ligand_V10_intensity (mean (SD)) 0.037
## ligand_V1_frequency (mean (SD)) 0.043
## ligand_V1_intensity (mean (SD)) 0.071
## ligand_V2_frequency (mean (SD)) <0.001
## ligand_V2_intensity (mean (SD)) <0.001
## ligand_V3_frequency (mean (SD)) 0.114
## ligand_V3_intensity (mean (SD)) <0.001
## ligand_V4_frequency (mean (SD)) 0.100
## ligand_V4_intensity (mean (SD)) <0.001
## ligand_V5_frequency (mean (SD)) <0.001
## ligand_V5_intensity (mean (SD)) <0.001
## ligand_V6_frequency (mean (SD)) <0.001
## ligand_V6_intensity (mean (SD)) <0.001
## ligand_V7_frequency (mean (SD)) <0.001
## ligand_V7_intensity (mean (SD)) 0.025
## ligand_V8_frequency (mean (SD)) 0.042
## ligand_V8_intensity (mean (SD)) 0.023
## ligand_V9_frequency (mean (SD)) <0.001
## ligand_V9_intensity (mean (SD)) <0.001
## ligand_dipole_moment (mean (SD)) <0.001
#clean up and convert to data table
colnames(XGB_model_1.10.SHAP.Means) <- paste0(colnames(XGB_model_1.10.SHAP.Means), " (n=", as.numeric(XGB_model_1.10.SHAP.Means[1,]), ")")
colnames(XGB_model_1.10.SHAP.Means) <- c(colnames(XGB_model_1.10.SHAP.Means)[1:3], "p", "test", "SMD")
XGB_model_1.10.SHAP.Means <- XGB_model_1.10.SHAP.Means[-1,]
XGB_model_1.10.SHAP.Means <- XGB_model_1.10.SHAP.Means %>% as.data.table(keep.rownames = TRUE)
# drop parentheticals
XGB_model_1.10.SHAP.Means[, 1:4 := lapply(.SD, function(x){tstrsplit(x=x, split=" ", fixed=TRUE, keep = 1L) %>% unlist() }), .SDcols=1:4]
# drop text column
XGB_model_1.10.SHAP.Means[, test := NULL]
# convert to numeric
XGB_model_1.10.SHAP.Means[, 2:6 := lapply(.SD, as.numeric), .SDcols=2:6]
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion
# fix NA and NaN
XGB_model_1.10.SHAP.Means[is.nan(p), p := 999]
XGB_model_1.10.SHAP.Means[is.na(p), p := -0.001]
XGB_model_1.10.SHAP.Means[is.na(SMD), SMD := -0.001]
#view
(XGB_model_1.10.SHAP.Means)
Take a moment to consider what has just been done: a “black box” model (XGBoost) was used to generate an effect matrix of “white box” models for each individual case. This effect matrix was then used as a “synthetic data” for the purpose of predicting reaction yields – and here now, for exploratory hypothesis testing via stratified means tests. _______ Let’s now chart: mean(SHAP) vs corr(SHAPvYield) vs SMD(ArCl vs ArBr vs ArIx)
## data required for chart
# SHAP SMD
# XGB_model_1.10.SHAP.Means$SMD
# SHAM mean score
# XGB_model_1.10.SHAP$mean_shap_score
# SHAP vs Yield Correclations
sapply(1:(ncol(XGB_model_1.10.SHAP$shap_score)-2), function(i){
c(names(XGB_model_1.10.SHAP$shap_score)[[i]],
cor(as.matrix(training.nominal[, 1]), XGB_model_1.10.SHAP$shap_score[[i]]))
}) %>% t() %>% as.data.table() %>%
setnames(c("rn", "corr")) %>%
.[, "corr" := as.numeric(corr)] -> XGB_model_1.10.SHAP.Corr2
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
## Warning in cor(as.matrix(training.nominal[, 1]),
## XGB_model_1.10.SHAP$shap_score[[i]]): the standard deviation is zero
# merge data sets for plot
XGB_model_1.10.SHAP$mean_shap_score %>%
as.data.table(keep.rownames = TRUE) %>% setnames(c("rn", "SHAP")) %>% setorder(rn) %>%
.[XGB_model_1.10.SHAP.Corr2, on="rn"] %>%
.[XGB_model_1.10.SHAP.Means[, c("rn", "SMD")], on="rn"] -> XGB_model_1.10.SHAP.Diagnostics
# generate plot
XGB_model_1.10.SHAP.Diagnostics %>%
#.[SMD > .5] %>%
ggplot(aes(x=SHAP, y=corr, size=SMD, colors = rn)) +
geom_point(alpha=0.5) +
scale_size(range = c(.1, 24), name="Geometric") -> XGB_model_1.10.SHAP.Diagnostics.Plot
ggplotly(XGB_model_1.10.SHAP.Diagnostics.Plot)
x: SHAP - mean SHAP-estimated feature effect y: corr - correlation between Yield and SHAP-estimated feature effect w (bubble size): standardized difference between I-Br-Cl strata wrt SHAP-estimated feature effects * SMD is a unitless measure of standard deviations; it controls for scale, therefore providing a global picture of discriminatory relevance wrt each feature
Next (as done by Doyle et al) we construct and compare Halogen-specific models wrt binary prediction. _____________________
First, the data is prepared…
# create training/test data; separate by chloride, bromide, iodide
labeled.training.stratified <- nominal.table.labeled[training,]
labeled.test.stratified <- nominal.table.labeled[-training,]
# generate input data - training
labeled.training.stratified.Cl <- labeled.training.stratified[Halogen=="ArCl"]
labeled.training.stratified.Br <- labeled.training.stratified[Halogen=="ArBr"]
labeled.training.stratified.Ix <- labeled.training.stratified[Halogen=="ArI"]
# generate input data - test
labeled.test.stratified.Cl <- labeled.test.stratified[Halogen=="ArCl"]
labeled.test.stratified.Br <- labeled.test.stratified[Halogen=="ArBr"]
labeled.test.stratified.Ix <- labeled.test.stratified[Halogen=="ArI"]
Next, our models are executed and saved…
Then, OOS performance is evaluated for each.
# generate diagnostic predictions for training data
#Cl
XGB_model_1.11_stratified.Cl.Predict.Test <- xgboost:::predict.xgb.Booster(XGB_model_1.11_stratified.Cl,
labeled.test.stratified.Cl[, 5:(ncol(labeled.test.stratified.Cl)-2)] %>% as.matrix())
#Br
XGB_model_1.11_stratified.Br.Predict.Test <- xgboost:::predict.xgb.Booster(XGB_model_1.11_stratified.Br,
labeled.test.stratified.Br[, 5:(ncol(labeled.test.stratified.Br)-2)] %>% as.matrix())
#Ix
XGB_model_1.11_stratified.Ix.Predict.Test <- xgboost:::predict.xgb.Booster(XGB_model_1.11_stratified.Ix,
labeled.test.stratified.Ix[, 5:(ncol(labeled.test.stratified.Ix)-2)] %>% as.matrix())
# model AUC
#Cl
plot(pROC::roc(predictor = XGB_model_1.11_stratified.Cl.Predict.Test,
response = labeled.test.stratified.Cl[,4] %>% as.matrix(),
levels=c(0, 1)),
lwd=1.5,
auc.polygon=TRUE,
print.auc=TRUE)
## Warning in roc.default(predictor = XGB_model_1.11_stratified.Cl.Predict.Test, :
## Deprecated use a matrix as response. Unexpected results may be produced, please
## pass a vector or factor.
## Setting direction: controls < cases
#Br
plot(pROC::roc(predictor = XGB_model_1.11_stratified.Br.Predict.Test,
response = labeled.test.stratified.Br[,4] %>% as.matrix(),
levels=c(0, 1)),
lwd=1.5,
auc.polygon=TRUE,
print.auc=TRUE)
## Warning in roc.default(predictor = XGB_model_1.11_stratified.Br.Predict.Test, :
## Deprecated use a matrix as response. Unexpected results may be produced, please
## pass a vector or factor.
## Setting direction: controls < cases
#Ix
plot(pROC::roc(predictor = XGB_model_1.11_stratified.Ix.Predict.Test,
response = labeled.test.stratified.Ix[,4] %>% as.matrix(),
levels=c(0, 1)),
lwd=1.5,
auc.polygon=TRUE,
print.auc=TRUE)
## Warning in roc.default(predictor = XGB_model_1.11_stratified.Ix.Predict.Test, :
## Deprecated use a matrix as response. Unexpected results may be produced, please
## pass a vector or factor.
## Setting direction: controls < cases
Once again, OOS AUC is absurdly high – with no special tuning or tweaking of parameters. ___________________________
Next, we generate the SHAP effect matrices and evaluate classification performance.
# generate SHAP model
#Cl
XGB_model_1.11_stratified.Cl.SHAP <- shap.values(xgb_model = XGB_model_1.11_stratified.Cl,
X_train = labeled.test.stratified.Cl[, 5:(ncol(labeled.test.stratified.Cl)-2)] %>% as.matrix())
#Br
XGB_model_1.11_stratified.Br.SHAP <- shap.values(xgb_model = XGB_model_1.11_stratified.Br,
X_train = labeled.test.stratified.Br[, 5:(ncol(labeled.test.stratified.Br)-2)] %>% as.matrix())
#Ix
XGB_model_1.11_stratified.Ix.SHAP <- shap.values(xgb_model = XGB_model_1.11_stratified.Ix,
X_train = labeled.test.stratified.Ix[, 5:(ncol(labeled.test.stratified.Ix)-2)] %>% as.matrix())
# eval classification performance
# generate prediction
#Cl
XGB_model_1.11_stratified.Cl.SHAP.Bin <- rowSums(XGB_model_1.11_stratified.Cl.SHAP$shap_score) ## generate prediction
XGB_model_1.11_stratified.Cl.SHAP.Bin <- ifelse(XGB_model_1.11_stratified.Cl.SHAP.Bin < 0, "YieldInsig", "YieldPositive") %>% factor()
XGB_model_1.11_stratified.Cl.SHAP.Eval <- caret::confusionMatrix(XGB_model_1.11_stratified.Cl.SHAP.Bin,
labeled.test.stratified.Cl[, 4] %>% as.matrix() %>% factor(labels = c("YieldInsig", "YieldPositive")), positive = "YieldPositive")
#Br
XGB_model_1.11_stratified.Br.SHAP.Bin <- rowSums(XGB_model_1.11_stratified.Br.SHAP$shap_score) ## generate prediction
XGB_model_1.11_stratified.Br.SHAP.Bin <- ifelse(XGB_model_1.11_stratified.Br.SHAP.Bin < 0, "YieldInsig", "YieldPositive") %>% factor()
XGB_model_1.11_stratified.Br.SHAP.Eval <- caret::confusionMatrix(XGB_model_1.11_stratified.Br.SHAP.Bin,
labeled.test.stratified.Br[, 4] %>% as.matrix() %>% factor(labels = c("YieldInsig", "YieldPositive")), positive = "YieldPositive")
#Ix
XGB_model_1.11_stratified.Ix.SHAP.Bin <- rowSums(XGB_model_1.11_stratified.Ix.SHAP$shap_score) ## generate prediction
XGB_model_1.11_stratified.Ix.SHAP.Bin <- ifelse(XGB_model_1.11_stratified.Ix.SHAP.Bin < 0, "YieldInsig", "YieldPositive") %>% factor()
XGB_model_1.11_stratified.Ix.SHAP.Eval <- caret::confusionMatrix(XGB_model_1.11_stratified.Ix.SHAP.Bin,
labeled.test.stratified.Ix[, 4] %>% as.matrix() %>% factor(labels = c("YieldInsig", "YieldPositive")), positive = "YieldPositive")
#view
#Cl
(XGB_model_1.11_stratified.Cl.SHAP.Eval)
## Confusion Matrix and Statistics
##
## Reference
## Prediction YieldInsig YieldPositive
## YieldInsig 229 10
## YieldPositive 29 141
##
## Accuracy : 0.9046
## 95% CI : (0.872, 0.9313)
## No Information Rate : 0.6308
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8005
##
## Mcnemar's Test P-Value : 0.003948
##
## Sensitivity : 0.9338
## Specificity : 0.8876
## Pos Pred Value : 0.8294
## Neg Pred Value : 0.9582
## Prevalence : 0.3692
## Detection Rate : 0.3447
## Detection Prevalence : 0.4156
## Balanced Accuracy : 0.9107
##
## 'Positive' Class : YieldPositive
##
#Br
(XGB_model_1.11_stratified.Br.SHAP.Eval)
## Confusion Matrix and Statistics
##
## Reference
## Prediction YieldInsig YieldPositive
## YieldInsig 55 15
## YieldPositive 16 318
##
## Accuracy : 0.9233
## 95% CI : (0.8929, 0.9473)
## No Information Rate : 0.8243
## P-Value [Acc > NIR] : 7.254e-09
##
## Kappa : 0.7337
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9550
## Specificity : 0.7746
## Pos Pred Value : 0.9521
## Neg Pred Value : 0.7857
## Prevalence : 0.8243
## Detection Rate : 0.7871
## Detection Prevalence : 0.8267
## Balanced Accuracy : 0.8648
##
## 'Positive' Class : YieldPositive
##
#Ix
(XGB_model_1.11_stratified.Ix.SHAP.Eval)
## Confusion Matrix and Statistics
##
## Reference
## Prediction YieldInsig YieldPositive
## YieldInsig 27 13
## YieldPositive 8 326
##
## Accuracy : 0.9439
## 95% CI : (0.9154, 0.9649)
## No Information Rate : 0.9064
## P-Value [Acc > NIR] : 0.005574
##
## Kappa : 0.689
##
## Mcnemar's Test P-Value : 0.382733
##
## Sensitivity : 0.9617
## Specificity : 0.7714
## Pos Pred Value : 0.9760
## Neg Pred Value : 0.6750
## Prevalence : 0.9064
## Detection Rate : 0.8717
## Detection Prevalence : 0.8930
## Balanced Accuracy : 0.8665
##
## 'Positive' Class : YieldPositive
##
OOS classification results are moderately impressive, with balanced accuracy in the 86-91% range.
# generate SHAP visualization
#Cl
shap.plot.summary.wrap2(shap_score = XGB_model_1.11_stratified.Cl.SHAP$shap_score,
X = labeled.test.stratified.Cl[, 5:(ncol(labeled.test.stratified.Cl)-2)] %>% as.matrix(),
top_n = 20)
#Br
shap.plot.summary.wrap2(shap_score = XGB_model_1.11_stratified.Br.SHAP$shap_score,
X = labeled.test.stratified.Br[, 5:(ncol(labeled.test.stratified.Br)-2)] %>% as.matrix(),
top_n = 20)
#Ix
shap.plot.summary.wrap2(shap_score = XGB_model_1.11_stratified.Ix.SHAP$shap_score,
X = labeled.test.stratified.Ix[, 5:(ncol(labeled.test.stratified.Ix)-2)] %>% as.matrix(),
top_n = 20)
Quick visualization shows meaningful differences between Halogens wrt feature effects driving classification. _________________
Next, we will explore this more directly…
#2. Cl, I, Br: SHAP vs SHAP vs SHAP - color=sig: IvCl, IvBr, ClvBr, All
# generate merged data set
rbind(XGB_model_1.11_stratified.Cl.SHAP$shap_score[, "Halogen" := "ArCl"],
XGB_model_1.11_stratified.Br.SHAP$shap_score[, "Halogen" := "ArBr"],
XGB_model_1.11_stratified.Ix.SHAP$shap_score[, "Halogen" := "ArIx"]) -> XGB_model_1.11_stratified.SHAP2
# generate means test table
# create indices for test variables
means_indx <-XGB_model_1.11_stratified.SHAP2[, 1:(ncol(XGB_model_1.11_stratified.SHAP2)-1)] %>% names()
# setup and execute stats tables
XGB_model_1.11_stratified.SHAP.Means <- CreateTableOne(vars = means_indx,
strata = "Halogen" ,
data = XGB_model_1.11_stratified.SHAP2)
#extract data matrix
XGB_model_1.11_stratified.SHAP.Means <- print(XGB_model_1.11_stratified.SHAP.Means$ContTable, smd=TRUE, quote = FALSE, noSpaces = TRUE)
## Stratified by Halogen
## ArBr ArCl
## n 404 409
## additive_*C3_NMR_shift (mean (SD)) 0.32 (0.83) -0.04 (0.19)
## additive_*C3_electrostatic_charge (mean (SD)) 0.06 (0.19) -0.05 (0.23)
## additive_*C4_NMR_shift (mean (SD)) 0.11 (0.16) 0.02 (0.16)
## additive_*C4_electrostatic_charge (mean (SD)) 0.19 (0.44) 0.03 (0.24)
## additive_*C5_NMR_shift (mean (SD)) 0.05 (0.08) 0.00 (0.14)
## additive_*C5_electrostatic_charge (mean (SD)) 0.10 (0.24) -0.01 (0.13)
## additive_*N1_electrostatic_charge (mean (SD)) 0.07 (0.10) -0.01 (0.09)
## additive_*O1_electrostatic_charge (mean (SD)) 0.23 (0.66) 0.04 (0.43)
## additive_E_HOMO (mean (SD)) 0.16 (0.36) 0.00 (0.27)
## additive_E_LUMO (mean (SD)) 0.09 (0.19) -0.04 (0.65)
## additive_V1_frequency (mean (SD)) 0.14 (0.24) -0.03 (0.14)
## additive_V1_intensity (mean (SD)) 0.05 (0.07) 0.00 (0.12)
## additive_dipole_moment (mean (SD)) 0.13 (0.41) 0.00 (0.20)
## additive_electronegativity (mean (SD)) 0.03 (0.07) -0.01 (0.09)
## additive_hardness (mean (SD)) 0.03 (0.06) 0.00 (0.10)
## additive_molecular_volume (mean (SD)) 0.10 (0.20) -0.02 (0.22)
## additive_molecular_weight (mean (SD)) 0.05 (0.09) -0.02 (0.09)
## additive_ovality (mean (SD)) 0.04 (0.11) 0.00 (0.06)
## additive_surface_area (mean (SD)) 0.03 (0.05) -0.02 (0.10)
## aryl_halide_*C1_NMR_shift (mean (SD)) 0.07 (0.41) 0.09 (1.77)
## aryl_halide_*C1_electrostatic_charge (mean (SD)) 0.05 (0.29) 0.09 (0.50)
## aryl_halide_*C2_NMR_shift (mean (SD)) 0.03 (0.14) 0.05 (0.28)
## aryl_halide_*C2_electrostatic_charge (mean (SD)) 0.01 (0.07) 0.02 (0.14)
## aryl_halide_*C3_NMR_shift (mean (SD)) 0.03 (0.16) 0.00 (0.10)
## aryl_halide_*C3_electrostatic_charge (mean (SD)) 0.00 (0.02) 0.01 (0.06)
## aryl_halide_*C4_NMR_shift (mean (SD)) 0.01 (0.10) 0.00 (0.06)
## aryl_halide_*C4_electrostatic_charge (mean (SD)) 0.01 (0.05) 0.00 (0.02)
## aryl_halide_*H2_NMR_shift (mean (SD)) 0.03 (0.07) -0.01 (0.15)
## aryl_halide_*H2_electrostatic_charge (mean (SD)) 0.00 (0.01) 0.00 (0.00)
## aryl_halide_*H3_NMR_shift (mean (SD)) 0.04 (0.06) 0.01 (0.04)
## aryl_halide_*H3_electrostatic_charge (mean (SD)) 0.00 (0.01) 0.00 (0.00)
## aryl_halide_E_HOMO (mean (SD)) 0.08 (0.18) -0.04 (0.49)
## aryl_halide_E_LUMO (mean (SD)) 0.04 (0.08) -0.02 (0.24)
## aryl_halide_V1_frequency (mean (SD)) 0.01 (0.02) 0.00 (0.00)
## aryl_halide_V1_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.02)
## aryl_halide_V2_frequency (mean (SD)) 0.02 (0.04) 0.00 (0.01)
## aryl_halide_V2_intensity (mean (SD)) 0.00 (0.01) 0.01 (0.03)
## aryl_halide_V3_frequency (mean (SD)) 0.04 (0.18) 0.00 (0.01)
## aryl_halide_V3_intensity (mean (SD)) 0.01 (0.04) 0.00 (0.00)
## aryl_halide_dipole_moment (mean (SD)) 0.02 (0.04) 0.00 (0.01)
## aryl_halide_electronegativity (mean (SD)) 0.00 (0.00) -0.01 (0.11)
## aryl_halide_hardness (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## aryl_halide_molecular_volume (mean (SD)) 0.02 (0.09) 0.00 (0.00)
## aryl_halide_molecular_weight (mean (SD)) 0.01 (0.05) 0.00 (0.00)
## aryl_halide_ovality (mean (SD)) 0.01 (0.03) 0.00 (0.00)
## aryl_halide_surface_area (mean (SD)) 0.00 (0.01) 0.00 (0.00)
## base_*N1_electrostatic_charge (mean (SD)) 0.14 (0.49) -0.05 (0.38)
## base_E_HOMO (mean (SD)) 0.11 (0.30) -0.02 (0.20)
## base_E_LUMO (mean (SD)) 0.06 (0.15) -0.01 (0.10)
## base_dipole_moment (mean (SD)) 0.03 (0.08) 0.00 (0.05)
## base_electronegativity (mean (SD)) 0.01 (0.04) 0.00 (0.03)
## base_hardness (mean (SD)) 0.04 (0.20) -0.01 (0.15)
## base_molecular_volume (mean (SD)) 0.03 (0.11) -0.01 (0.08)
## base_molecular_weight (mean (SD)) 0.01 (0.06) 0.00 (0.04)
## base_ovality (mean (SD)) 0.01 (0.02) 0.00 (0.02)
## base_surface_area (mean (SD)) 0.00 (0.01) 0.00 (0.01)
## ligand_*C10_NMR_shift (mean (SD)) 0.05 (0.17) -0.01 (0.55)
## ligand_*C10_electrostatic_charge (mean (SD)) 0.01 (0.10) -0.01 (0.18)
## ligand_*C11_NMR_shift (mean (SD)) 0.13 (0.62) -0.05 (0.22)
## ligand_*C11_electrostatic_charge (mean (SD)) 0.06 (0.31) -0.03 (0.08)
## ligand_*C12_NMR_shift (mean (SD)) 0.02 (0.10) -0.01 (0.05)
## ligand_*C12_electrostatic_charge (mean (SD)) 0.01 (0.03) 0.00 (0.11)
## ligand_*C13_NMR_shift (mean (SD)) 0.00 (0.02) 0.00 (0.05)
## ligand_*C13_electrostatic_charge (mean (SD)) 0.01 (0.05) 0.00 (0.02)
## ligand_*C14_NMR_shift (mean (SD)) 0.01 (0.02) 0.00 (0.01)
## ligand_*C14_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C15_NMR_shift (mean (SD)) 0.00 (0.03) -0.01 (0.08)
## ligand_*C15_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C16_NMR_shift (mean (SD)) 0.01 (0.09) -0.01 (0.05)
## ligand_*C16_electrostatic_charge (mean (SD)) 0.01 (0.05) 0.00 (0.02)
## ligand_*C17_NMR_shift (mean (SD)) 0.00 (0.01) 0.00 (0.02)
## ligand_*C17_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.01)
## ligand_*C1_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.01)
## ligand_*C1_electrostatic_charge (mean (SD)) 0.00 (0.02) 0.00 (0.01)
## ligand_*C2_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C2_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C3_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C3_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C4_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C4_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C5_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C5_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C6_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C6_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C7_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C7_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C8_NMR_shift (mean (SD)) 0.00 (0.01) 0.00 (0.00)
## ligand_*C8_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C9_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*C9_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H11_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H11_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H3_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H3_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H4_NMR_shift (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H4_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*H9_NMR_shift (mean (SD)) 0.00 (0.01) 0.00 (0.00)
## ligand_*H9_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_*P1_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V10_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V10_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V1_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V1_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V2_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V2_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V3_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V3_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V4_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V4_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V5_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V5_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V6_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V6_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V7_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V7_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V8_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V8_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V9_frequency (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_V9_intensity (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## ligand_dipole_moment (mean (SD)) 0.00 (0.00) 0.00 (0.00)
## Stratified by Halogen
## ArIx p test
## n 374
## additive_*C3_NMR_shift (mean (SD)) 0.44 (0.78) <0.001
## additive_*C3_electrostatic_charge (mean (SD)) 0.09 (0.17) <0.001
## additive_*C4_NMR_shift (mean (SD)) 0.05 (0.06) <0.001
## additive_*C4_electrostatic_charge (mean (SD)) 0.10 (0.16) <0.001
## additive_*C5_NMR_shift (mean (SD)) 0.07 (0.18) <0.001
## additive_*C5_electrostatic_charge (mean (SD)) 0.17 (0.24) <0.001
## additive_*N1_electrostatic_charge (mean (SD)) 0.06 (0.09) <0.001
## additive_*O1_electrostatic_charge (mean (SD)) 0.30 (0.52) <0.001
## additive_E_HOMO (mean (SD)) 0.26 (0.35) <0.001
## additive_E_LUMO (mean (SD)) 0.09 (0.13) <0.001
## additive_V1_frequency (mean (SD)) 0.03 (0.15) <0.001
## additive_V1_intensity (mean (SD)) 0.05 (0.06) <0.001
## additive_dipole_moment (mean (SD)) 0.28 (0.47) <0.001
## additive_electronegativity (mean (SD)) 0.03 (0.03) <0.001
## additive_hardness (mean (SD)) 0.07 (0.10) <0.001
## additive_molecular_volume (mean (SD)) 0.22 (0.25) <0.001
## additive_molecular_weight (mean (SD)) 0.10 (0.10) <0.001
## additive_ovality (mean (SD)) 0.04 (0.09) <0.001
## additive_surface_area (mean (SD)) 0.05 (0.06) <0.001
## aryl_halide_*C1_NMR_shift (mean (SD)) 0.07 (0.14) 0.938
## aryl_halide_*C1_electrostatic_charge (mean (SD)) 0.23 (0.60) <0.001
## aryl_halide_*C2_NMR_shift (mean (SD)) 0.11 (0.30) <0.001
## aryl_halide_*C2_electrostatic_charge (mean (SD)) 0.06 (0.15) <0.001
## aryl_halide_*C3_NMR_shift (mean (SD)) 0.03 (0.12) <0.001
## aryl_halide_*C3_electrostatic_charge (mean (SD)) 0.03 (0.12) <0.001
## aryl_halide_*C4_NMR_shift (mean (SD)) 0.00 (0.01) 0.073
## aryl_halide_*C4_electrostatic_charge (mean (SD)) 0.02 (0.09) <0.001
## aryl_halide_*H2_NMR_shift (mean (SD)) 0.00 (0.01) <0.001
## aryl_halide_*H2_electrostatic_charge (mean (SD)) 0.02 (0.05) <0.001
## aryl_halide_*H3_NMR_shift (mean (SD)) 0.01 (0.08) <0.001
## aryl_halide_*H3_electrostatic_charge (mean (SD)) 0.01 (0.04) <0.001
## aryl_halide_E_HOMO (mean (SD)) 0.02 (0.04) <0.001
## aryl_halide_E_LUMO (mean (SD)) 0.01 (0.02) <0.001
## aryl_halide_V1_frequency (mean (SD)) 0.01 (0.02) <0.001
## aryl_halide_V1_intensity (mean (SD)) 0.00 (0.01) <0.001
## aryl_halide_V2_frequency (mean (SD)) 0.01 (0.02) <0.001
## aryl_halide_V2_intensity (mean (SD)) 0.00 (0.00) 0.001
## aryl_halide_V3_frequency (mean (SD)) 0.00 (0.01) <0.001
## aryl_halide_V3_intensity (mean (SD)) 0.00 (0.00) <0.001
## aryl_halide_dipole_moment (mean (SD)) 0.00 (0.00) <0.001
## aryl_halide_electronegativity (mean (SD)) 0.00 (0.01) 0.001
## aryl_halide_hardness (mean (SD)) 0.00 (0.00) <0.001
## aryl_halide_molecular_volume (mean (SD)) 0.00 (0.00) <0.001
## aryl_halide_molecular_weight (mean (SD)) 0.00 (0.01) <0.001
## aryl_halide_ovality (mean (SD)) 0.00 (0.00) <0.001
## aryl_halide_surface_area (mean (SD)) 0.00 (0.00) <0.001
## base_*N1_electrostatic_charge (mean (SD)) 0.11 (0.38) <0.001
## base_E_HOMO (mean (SD)) 0.10 (0.27) <0.001
## base_E_LUMO (mean (SD)) 0.05 (0.13) <0.001
## base_dipole_moment (mean (SD)) 0.03 (0.07) <0.001
## base_electronegativity (mean (SD)) 0.01 (0.03) <0.001
## base_hardness (mean (SD)) 0.05 (0.18) <0.001
## base_molecular_volume (mean (SD)) 0.03 (0.09) <0.001
## base_molecular_weight (mean (SD)) 0.02 (0.05) <0.001
## base_ovality (mean (SD)) 0.01 (0.02) <0.001
## base_surface_area (mean (SD)) 0.00 (0.01) <0.001
## ligand_*C10_NMR_shift (mean (SD)) 0.07 (0.12) 0.002
## ligand_*C10_electrostatic_charge (mean (SD)) 0.08 (0.21) <0.001
## ligand_*C11_NMR_shift (mean (SD)) 0.26 (0.58) <0.001
## ligand_*C11_electrostatic_charge (mean (SD)) 0.14 (0.30) <0.001
## ligand_*C12_NMR_shift (mean (SD)) 0.07 (0.17) <0.001
## ligand_*C12_electrostatic_charge (mean (SD)) 0.01 (0.05) 0.019
## ligand_*C13_NMR_shift (mean (SD)) 0.01 (0.02) 0.010
## ligand_*C13_electrostatic_charge (mean (SD)) 0.03 (0.08) <0.001
## ligand_*C14_NMR_shift (mean (SD)) 0.02 (0.04) <0.001
## ligand_*C14_electrostatic_charge (mean (SD)) 0.00 (0.01) <0.001
## ligand_*C15_NMR_shift (mean (SD)) 0.00 (0.02) 0.012
## ligand_*C15_electrostatic_charge (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C16_NMR_shift (mean (SD)) 0.03 (0.07) <0.001
## ligand_*C16_electrostatic_charge (mean (SD)) 0.02 (0.04) <0.001
## ligand_*C17_NMR_shift (mean (SD)) 0.00 (0.01) 0.011
## ligand_*C17_electrostatic_charge (mean (SD)) 0.00 (0.01) <0.001
## ligand_*C1_NMR_shift (mean (SD)) 0.00 (0.01) <0.001
## ligand_*C1_electrostatic_charge (mean (SD)) 0.01 (0.02) <0.001
## ligand_*C2_NMR_shift (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C2_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.010
## ligand_*C3_NMR_shift (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C3_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.188
## ligand_*C4_NMR_shift (mean (SD)) 0.00 (0.00) 0.002
## ligand_*C4_electrostatic_charge (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C5_NMR_shift (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C5_electrostatic_charge (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C6_NMR_shift (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C6_electrostatic_charge (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C7_NMR_shift (mean (SD)) 0.00 (0.00) 0.339
## ligand_*C7_electrostatic_charge (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C8_NMR_shift (mean (SD)) 0.00 (0.01) <0.001
## ligand_*C8_electrostatic_charge (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C9_NMR_shift (mean (SD)) 0.00 (0.00) <0.001
## ligand_*C9_electrostatic_charge (mean (SD)) 0.00 (0.00) NaN
## ligand_*H11_NMR_shift (mean (SD)) 0.00 (0.00) <0.001
## ligand_*H11_electrostatic_charge (mean (SD)) 0.00 (0.00) NaN
## ligand_*H3_NMR_shift (mean (SD)) 0.00 (0.00) 0.574
## ligand_*H3_electrostatic_charge (mean (SD)) 0.00 (0.00) 0.001
## ligand_*H4_NMR_shift (mean (SD)) 0.00 (0.00) 0.002
## ligand_*H4_electrostatic_charge (mean (SD)) 0.00 (0.00) NaN
## ligand_*H9_NMR_shift (mean (SD)) 0.00 (0.00) <0.001
## ligand_*H9_electrostatic_charge (mean (SD)) 0.00 (0.00) <0.001
## ligand_*P1_electrostatic_charge (mean (SD)) 0.00 (0.00) <0.001
## ligand_V10_frequency (mean (SD)) 0.00 (0.00) <0.001
## ligand_V10_intensity (mean (SD)) 0.00 (0.00) 0.058
## ligand_V1_frequency (mean (SD)) 0.00 (0.00) <0.001
## ligand_V1_intensity (mean (SD)) 0.00 (0.00) <0.001
## ligand_V2_frequency (mean (SD)) 0.00 (0.00) NaN
## ligand_V2_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_V3_frequency (mean (SD)) 0.00 (0.00) <0.001
## ligand_V3_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_V4_frequency (mean (SD)) 0.00 (0.00) <0.001
## ligand_V4_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_V5_frequency (mean (SD)) 0.00 (0.00) 0.018
## ligand_V5_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_V6_frequency (mean (SD)) 0.00 (0.00) NaN
## ligand_V6_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_V7_frequency (mean (SD)) 0.00 (0.00) NaN
## ligand_V7_intensity (mean (SD)) 0.00 (0.00) <0.001
## ligand_V8_frequency (mean (SD)) 0.00 (0.00) <0.001
## ligand_V8_intensity (mean (SD)) 0.00 (0.00) <0.001
## ligand_V9_frequency (mean (SD)) 0.00 (0.00) 0.006
## ligand_V9_intensity (mean (SD)) 0.00 (0.00) NaN
## ligand_dipole_moment (mean (SD)) 0.00 (0.00) NaN
## Stratified by Halogen
## SMD
## n
## additive_*C3_NMR_shift (mean (SD)) 0.534
## additive_*C3_electrostatic_charge (mean (SD)) 0.464
## additive_*C4_NMR_shift (mean (SD)) 0.419
## additive_*C4_electrostatic_charge (mean (SD)) 0.353
## additive_*C5_NMR_shift (mean (SD)) 0.337
## additive_*C5_electrostatic_charge (mean (SD)) 0.615
## additive_*N1_electrostatic_charge (mean (SD)) 0.575
## additive_*O1_electrostatic_charge (mean (SD)) 0.340
## additive_E_HOMO (mean (SD)) 0.544
## additive_E_LUMO (mean (SD)) 0.194
## additive_V1_frequency (mean (SD)) 0.610
## additive_V1_intensity (mean (SD)) 0.318
## additive_dipole_moment (mean (SD)) 0.507
## additive_electronegativity (mean (SD)) 0.361
## additive_hardness (mean (SD)) 0.528
## additive_molecular_volume (mean (SD)) 0.718
## additive_molecular_weight (mean (SD)) 0.826
## additive_ovality (mean (SD)) 0.335
## additive_surface_area (mean (SD)) 0.612
## aryl_halide_*C1_NMR_shift (mean (SD)) 0.012
## aryl_halide_*C1_electrostatic_charge (mean (SD)) 0.238
## aryl_halide_*C2_NMR_shift (mean (SD)) 0.231
## aryl_halide_*C2_electrostatic_charge (mean (SD)) 0.235
## aryl_halide_*C3_NMR_shift (mean (SD)) 0.179
## aryl_halide_*C3_electrostatic_charge (mean (SD)) 0.208
## aryl_halide_*C4_NMR_shift (mean (SD)) 0.102
## aryl_halide_*C4_electrostatic_charge (mean (SD)) 0.278
## aryl_halide_*H2_NMR_shift (mean (SD)) 0.294
## aryl_halide_*H2_electrostatic_charge (mean (SD)) 0.429
## aryl_halide_*H3_NMR_shift (mean (SD)) 0.355
## aryl_halide_*H3_electrostatic_charge (mean (SD)) 0.363
## aryl_halide_E_HOMO (mean (SD)) 0.330
## aryl_halide_E_LUMO (mean (SD)) 0.355
## aryl_halide_V1_frequency (mean (SD)) 0.361
## aryl_halide_V1_intensity (mean (SD)) 0.301
## aryl_halide_V2_frequency (mean (SD)) 0.520
## aryl_halide_V2_intensity (mean (SD)) 0.219
## aryl_halide_V3_frequency (mean (SD)) 0.303
## aryl_halide_V3_intensity (mean (SD)) 0.312
## aryl_halide_dipole_moment (mean (SD)) 0.515
## aryl_halide_electronegativity (mean (SD)) 0.191
## aryl_halide_hardness (mean (SD)) 0.306
## aryl_halide_molecular_volume (mean (SD)) 0.320
## aryl_halide_molecular_weight (mean (SD)) 0.343
## aryl_halide_ovality (mean (SD)) 0.332
## aryl_halide_surface_area (mean (SD)) 0.314
## base_*N1_electrostatic_charge (mean (SD)) 0.309
## base_E_HOMO (mean (SD)) 0.355
## base_E_LUMO (mean (SD)) 0.354
## base_dipole_moment (mean (SD)) 0.358
## base_electronegativity (mean (SD)) 0.370
## base_hardness (mean (SD)) 0.256
## base_molecular_volume (mean (SD)) 0.298
## base_molecular_weight (mean (SD)) 0.285
## base_ovality (mean (SD)) 0.299
## base_surface_area (mean (SD)) 0.289
## ligand_*C10_NMR_shift (mean (SD)) 0.165
## ligand_*C10_electrostatic_charge (mean (SD)) 0.340
## ligand_*C11_NMR_shift (mean (SD)) 0.447
## ligand_*C11_electrostatic_charge (mean (SD)) 0.471
## ligand_*C12_NMR_shift (mean (SD)) 0.434
## ligand_*C12_electrostatic_charge (mean (SD)) 0.137
## ligand_*C13_NMR_shift (mean (SD)) 0.142
## ligand_*C13_electrostatic_charge (mean (SD)) 0.425
## ligand_*C14_NMR_shift (mean (SD)) 0.423
## ligand_*C14_electrostatic_charge (mean (SD)) 0.412
## ligand_*C15_NMR_shift (mean (SD)) 0.110
## ligand_*C15_electrostatic_charge (mean (SD)) 0.358
## ligand_*C16_NMR_shift (mean (SD)) 0.402
## ligand_*C16_electrostatic_charge (mean (SD)) 0.433
## ligand_*C17_NMR_shift (mean (SD)) 0.118
## ligand_*C17_electrostatic_charge (mean (SD)) 0.406
## ligand_*C1_NMR_shift (mean (SD)) 0.270
## ligand_*C1_electrostatic_charge (mean (SD)) 0.426
## ligand_*C2_NMR_shift (mean (SD)) 0.194
## ligand_*C2_electrostatic_charge (mean (SD)) 0.147
## ligand_*C3_NMR_shift (mean (SD)) 0.201
## ligand_*C3_electrostatic_charge (mean (SD)) 0.078
## ligand_*C4_NMR_shift (mean (SD)) 0.144
## ligand_*C4_electrostatic_charge (mean (SD)) 0.314
## ligand_*C5_NMR_shift (mean (SD)) 0.265
## ligand_*C5_electrostatic_charge (mean (SD)) 0.241
## ligand_*C6_NMR_shift (mean (SD)) 0.450
## ligand_*C6_electrostatic_charge (mean (SD)) 0.199
## ligand_*C7_NMR_shift (mean (SD)) 0.050
## ligand_*C7_electrostatic_charge (mean (SD)) 0.332
## ligand_*C8_NMR_shift (mean (SD)) 0.420
## ligand_*C8_electrostatic_charge (mean (SD)) 0.335
## ligand_*C9_NMR_shift (mean (SD)) 0.343
## ligand_*C9_electrostatic_charge (mean (SD)) <0.001
## ligand_*H11_NMR_shift (mean (SD)) 0.285
## ligand_*H11_electrostatic_charge (mean (SD)) <0.001
## ligand_*H3_NMR_shift (mean (SD)) 0.047
## ligand_*H3_electrostatic_charge (mean (SD)) 0.124
## ligand_*H4_NMR_shift (mean (SD)) 0.121
## ligand_*H4_electrostatic_charge (mean (SD)) <0.001
## ligand_*H9_NMR_shift (mean (SD)) 0.414
## ligand_*H9_electrostatic_charge (mean (SD)) 0.420
## ligand_*P1_electrostatic_charge (mean (SD)) 0.316
## ligand_V10_frequency (mean (SD)) 0.406
## ligand_V10_intensity (mean (SD)) 0.080
## ligand_V1_frequency (mean (SD)) 0.415
## ligand_V1_intensity (mean (SD)) 0.246
## ligand_V2_frequency (mean (SD)) <0.001
## ligand_V2_intensity (mean (SD)) <0.001
## ligand_V3_frequency (mean (SD)) 0.381
## ligand_V3_intensity (mean (SD)) <0.001
## ligand_V4_frequency (mean (SD)) 0.199
## ligand_V4_intensity (mean (SD)) <0.001
## ligand_V5_frequency (mean (SD)) 0.127
## ligand_V5_intensity (mean (SD)) <0.001
## ligand_V6_frequency (mean (SD)) <0.001
## ligand_V6_intensity (mean (SD)) <0.001
## ligand_V7_frequency (mean (SD)) <0.001
## ligand_V7_intensity (mean (SD)) 0.210
## ligand_V8_frequency (mean (SD)) 0.231
## ligand_V8_intensity (mean (SD)) 0.277
## ligand_V9_frequency (mean (SD)) 0.106
## ligand_V9_intensity (mean (SD)) <0.001
## ligand_dipole_moment (mean (SD)) <0.001
#clean up and convert to data table
colnames(XGB_model_1.11_stratified.SHAP.Means) <- paste0(colnames(XGB_model_1.11_stratified.SHAP.Means), " (n=", as.numeric(XGB_model_1.11_stratified.SHAP.Means[1,]), ")")
colnames(XGB_model_1.11_stratified.SHAP.Means) <- c(colnames(XGB_model_1.11_stratified.SHAP.Means)[1:3], "p", "test", "SMD")
XGB_model_1.11_stratified.SHAP.Means <- XGB_model_1.11_stratified.SHAP.Means[-1,]
XGB_model_1.11_stratified.SHAP.Means <- XGB_model_1.11_stratified.SHAP.Means %>% as.data.table(keep.rownames = TRUE)
# drop parentheticals
XGB_model_1.11_stratified.SHAP.Means[, 1:4 := lapply(.SD, function(x){tstrsplit(x=x, split=" ", fixed=TRUE, keep = 1L) %>% unlist() }), .SDcols=1:4]
# drop text column
XGB_model_1.11_stratified.SHAP.Means[, test := NULL]
# convert to numeric
XGB_model_1.11_stratified.SHAP.Means[, 2:6 := lapply(.SD, as.numeric), .SDcols=2:6]
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion
# fix NA and NaN
XGB_model_1.11_stratified.SHAP.Means[is.nan(p), p := 999]
XGB_model_1.11_stratified.SHAP.Means[is.na(p), p := -0.001]
XGB_model_1.11_stratified.SHAP.Means[is.na(SMD), SMD := -0.001]
## now do charts
XGB_model_1.11_stratified.SHAP.Means %>%
plot_ly(x= ~`ArBr (n=404)`,
y= ~`ArIx (n=374)`,
z= ~`ArCl (n=409)`,
sizemode='area', size= ~SMD*SMD, #sizeref=.001,
color= ~SMD, colorscale = 'Viridis',
type='scatter3d', mode='markers',
hoverinfo = 'text',
text = ~paste(rn, '</br></br>',
"ArBr=", `ArBr (n=404)`, " (Δlog-RR)", '</br>',
"ArIx=", `ArIx (n=374)`, " (Δlog-RR)", '</br>',
"ArCl=", `ArCl (n=409)`, " (Δlog-RR)", '</br>',
"SMD=", SMD))
## Warning: `line.width` does not currently support multiple values.
## Warning: 'scatter3d' objects don't have these attributes: 'sizemode', 'colorscale'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'y', 'z', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'mode', 'surfaceaxis', 'surfacecolor', 'projection', 'connectgaps', 'line', 'marker', 'textposition', 'textfont', 'hoverinfo', 'error_x', 'error_y', 'error_z', 'xcalendar', 'ycalendar', 'zcalendar', 'scene', 'idssrc', 'customdatasrc', 'metasrc', 'xsrc', 'ysrc', 'zsrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'hoverinfosrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
Here, SMD is presented as both bubble size and color: the larger the size and more intense the color, the larger the significance of differences. _______________
Finally, a diagnostic regression to quantify the differential impact of each Halogen wrt reaction probability.
#data
XGB_model_1.11_stratified.SHAP.lm <- data.table(Yield = labeled.training.stratified$yield_data,
SHAP = rowSums(XGB_model_1.10.SHAP$shap_score[, 1:(ncol(XGB_model_1.10.SHAP$shap_score)-2)]),
Halide = labeled.training.stratified$Halide,
Halogen = labeled.training.stratified$Halogen)
#lm test - null
lm(data = XGB_model_1.11_stratified.SHAP.lm, "Yield ~ SHAP") %>% summary()
##
## Call:
## lm(formula = "Yield ~ SHAP", data = XGB_model_1.11_stratified.SHAP.lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.602 -12.552 -0.579 9.282 55.926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.53067 0.36652 53.29 <2e-16 ***
## SHAP 3.74683 0.05231 71.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.21 on 2766 degrees of freedom
## Multiple R-squared: 0.6497, Adjusted R-squared: 0.6496
## F-statistic: 5131 on 1 and 2766 DF, p-value: < 2.2e-16
#lm test - Halogen
lm(data = XGB_model_1.11_stratified.SHAP.lm, "Yield ~ SHAP+Halogen") %>% summary()
##
## Call:
## lm(formula = "Yield ~ SHAP+Halogen", data = XGB_model_1.11_stratified.SHAP.lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.004 -12.410 -0.453 9.053 55.780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.23856 0.64300 28.365 <2e-16 ***
## SHAP 3.80427 0.06528 58.280 <2e-16 ***
## HalogenArCl 2.02133 0.87324 2.315 0.0207 *
## HalogenArI 1.20339 0.75809 1.587 0.1125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.2 on 2764 degrees of freedom
## Multiple R-squared: 0.6505, Adjusted R-squared: 0.6501
## F-statistic: 1715 on 3 and 2764 DF, p-value: < 2.2e-16
#lm test - Halide
lm(data = XGB_model_1.11_stratified.SHAP.lm, "Yield ~ SHAP+Halide") %>% summary()
##
## Call:
## lm(formula = "Yield ~ SHAP+Halide", data = XGB_model_1.11_stratified.SHAP.lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.372 -8.362 0.632 8.445 46.277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.3710 1.1162 7.499 8.60e-14 ***
## SHAP 3.8037 0.0733 51.889 < 2e-16 ***
## HalideCF3.Cl 3.6931 1.5579 2.371 0.01783 *
## HalideCF3.I -1.3331 1.5042 -0.886 0.37557
## HalideEt.Br 12.5797 1.5318 8.213 3.29e-16 ***
## HalideEt.Cl 14.5574 1.6914 8.607 < 2e-16 ***
## HalideEt.I 13.1761 1.5186 8.676 < 2e-16 ***
## HalideOMe.Br 6.4480 1.5391 4.189 2.88e-05 ***
## HalideOMe.Cl 18.4050 1.7297 10.640 < 2e-16 ***
## HalideOMe.I 4.7457 1.5183 3.126 0.00179 **
## Halidepyr2.Br 15.9268 1.5086 10.557 < 2e-16 ***
## Halidepyr2.Cl 15.1312 1.5242 9.927 < 2e-16 ***
## Halidepyr2.I 16.4447 1.5463 10.635 < 2e-16 ***
## Halidepyr3.Br 14.8383 1.5432 9.615 < 2e-16 ***
## Halidepyr3.Cl 7.7631 1.5648 4.961 7.44e-07 ***
## Halidepyr3.I 22.4757 1.5025 14.959 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.73 on 2752 degrees of freedom
## Multiple R-squared: 0.7122, Adjusted R-squared: 0.7106
## F-statistic: 454 on 15 and 2752 DF, p-value: < 2.2e-16
Aggregate SHAP value (log-RR of reaction yield > .10) alone performs remarkably well. Even more remarkably, at this level Halogen classification has little impact on predictive performance. This change, however, once things are broken down to the more granular Halide level: R-squared = .71, beating some ML models. _____________________________
Summary: We have now established that that the effect matrix generated by SHAP based on a binary classification task adequately encodes the mechanistic information contained in and implied by the descriptor data. This is demonstrated via several feature effect analyses and confirmed via linear regression wrt yield. In the latter, the SHAP-derived effect matrix is shown to provide robust prediction under various scenarios, including aggregation. Based on these results, next steps will be to apply the SHAP effect matrix to the original task championed by Doyle et al – random forest regression. Recall, that Doyle et al replaced all nominal descriptor data, with scaled (normalized) data. Doyle et al are not the only group who has found discriptor data troublesome absent normalization. Here, we now evaluate the degree to which SHAP data may potentially serve as “synthetic data” to resolve such issues. _____________________________
RandomForest, Revisted First, we setup and execute the model.
# we replace "training.scaled" with SHAP effect matrix generated in first run
# setup input vars
rfFit.y <- training.nominal[, 1] %>% as.matrix() %>% as.vector()
rfFit.x <- XGB_model_1.10.SHAP$shap_score[, 1:(ncol(XGB_model_1.10.SHAP$shap_score)-2)] %>% as.matrix()
# setup dummy
# plotObsVsPred(extractPrediction(list(train(rfFit.y ~ ., data=rfFit.x, trControl=train_control, method="lm"))))
# random forest
set.seed(8915)
rfFit_SHAP <- train(y = rfFit.y, x=rfFit.x, trControl=train_control, method="rf", importance=TRUE)
# generate and display dx plot
png(filename="R/plots/rf_SHAP.png", width = 1000, height = 600)
predVals <- extractPrediction(list(rfFit_SHAP))
plotObsVsPred(predVals)
dev.off()
## quartz_off_screen
## 2
#dispaly
magick::image_read("R/plots/rf_SHAP.png") %>% plot()
#save
saveRDS(rfFit, "rds/rfFit_SHAP.rds")
Diagnostic chart looks promising. Lets now compare performance to prior models by Doyle et al.
# generate test data for SHAP
XGB_model_1.10.SHAP.test_data <- shap.values(xgb_model = XGB_model_1.10, X_train = as.matrix(test.nominal[, -(1:4)]))
# ============================================================================
# Read in previously trained models saved as .rds files
# ============================================================================
# Run to read in previously trained models
knnFit <- readRDS("rds/knnFit.rds")
#svmFit <- readRDS("rds/svmFit.rds")
bayesglmFit <- readRDS("rds/bayesglmFit.rds")
lmFit <- readRDS("rds/lmFit.rds")
nnetFit.100nodes <- readRDS("rds/nnetFit_100nodes.rds")
rfFit <- readRDS("rds/rfFit.rds")
# ============================================================================
# RMSE and R^2 plot for different machine learning models
# ============================================================================
# Predict for testing set
# lm.pred <- predict(lmFit, test.scaled[, -(2:4)])
# #svm.pred <- predict(svmFit, test.scaled)
# knn.pred <- predict(knnFit, test.scaled)
# nnet.pred <- compute(nnetFit.100nodes, test.scaled[, 4:nrow(test.scaled)])$net.result
# bayesglm.pred <- predict(bayesglmFit, test.scaled)
# rf.pred <- predict(rfFit, test.scaled)
rf.pred.SHAP <- predict(rfFit_SHAP, XGB_model_1.10.SHAP.test_data$shap_score)
# Set negative predictions to zero
# lm.pred[lm.pred<0] <- 0
# svm.pred[svm.pred<0] <- 0
# knn.pred[knn.pred<0] <- 0
# nnet.pred[nnet.pred<0] <- 0
# bayesglm.pred[bayesglm.pred<0] <- 0
# rf.pred[rf.pred<0] <- 0
rf.pred.SHAP[rf.pred.SHAP<0] <- 0
# R^2 values
# lm.pred.r2 <- cor(lm.pred, test.scaled$yield)**2
# #svm.pred.r2 <- cor(svm.pred, test.scaled$yield)**2
# knn.pred.r2 <- cor(knn.pred, test.scaled$yield)**2
# nnet.pred.r2 <- cor(nnet.pred, test.scaled$yield)**2
# bayesglm.pred.r2 <- cor(bayesglm.pred, test.scaled$yield)**2
# rf.pred.r2 <- cor(rf.pred, test.scaled$yield)**2
rf.pred.SHAP.r2 <- cor(rf.pred.SHAP, test.nominal$yield_data)**2
# RMSE
# lm.pred.rmse <- rmse(lm.pred, test.scaled$yield)
# #svm.pred.rmse <- rmse(svm.pred, test.scaled$yield)
# knn.pred.rmse <- rmse(knn.pred, test.scaled$yield)
# nnet.pred.rmse <- rmse(nnet.pred, test.scaled$yield)
# bayesglm.pred.rmse <- rmse(bayesglm.pred, test.scaled$yield)
# rf.pred.rmse <- rmse(rf.pred, test.scaled$yield)
rf.pred.SHAP.rmse <- rmse(rf.pred.SHAP, test.nominal$yield_data)
# Plot RMSE and R^2, dropping SVN
df <- data.frame(rmse = c(lm.pred.rmse, knn.pred.rmse, nnet.pred.rmse, bayesglm.pred.rmse, rf.pred.rmse, rf.pred.SHAP.rmse), # svm.pred.rmse,
r2 = c(lm.pred.r2, knn.pred.r2, nnet.pred.r2, bayesglm.pred.r2, rf.pred.r2, rf.pred.SHAP.r2)) # svm.pred.r2,
row.names(df) <- c('Linear Model', 'kNN', 'Neural Network', 'Bayes GLM', 'Random Forest (Scaled)', "Random Forest (SHAP)") #'SVM'
rmse.plot <- ggplot(df, aes(y=reorder(rownames(df), rmse), x=rmse)) +
geom_point() +
geom_text(label=round(df$rmse, 1), vjust=-1, size=3) +
labs(x='RMSE', y='') +
xlim(0,20)
r2.plot <- ggplot(df, aes(y=reorder(rownames(df), rmse), x=r2)) +
geom_point() +
geom_text(label=round(df$r2, 2), vjust=-1, size=3) +
labs(x='Rsquared', y='') +
xlim(0.6,1)
plots <- arrangeGrob(rmse.plot, r2.plot, ncol=2)
ggsave(plots, file="R/plots/ml_models_SHAP.png", width=7, height=2.5)
# dpslay
magick::image_read("R/plots/ml_models_SHAP.png") %>% plot()
Most impressively, we find that RF using the SHAP data – which was derived from a binary predictive task – performs almost as well as the champion RF model Keep in mind that the XGBoost model used to generate the SHAP data was not “tuned” for performance and the predictive objective (yield < 10%) was arbitrary.
Taken together, this suggests a bright future for SHAP-derived “synthetic data”.
FIN _________
Acknowledgements.
PSCRIPT _________
Given certain generous assumptions, the significance of the SHAP experiment above can be reframed as follows: 1. Chemical descriptors generated via theoretical calculations are sufficient for mapping the associative structure of a chemical system wrt lab-measured yield. 2. XGBoost, an automated mechanism for performing additive logistic regression on a stagewise basis, can be used to reliably predict yield on a binary basis (0,!0). 3. Any mathematical structure (equation or matrix of equations) that reliably predicts an outcome can be reliably assumed to represent a mechanistically valid map. 4. SHAP, which generates an additive log-likelihood model based on the deductive Shapley heuristic, maps a probabilistic hypersurface that is valid at any level of differentiation. 5. Predictive robustness alone is insufficient to assume validity wrt all features and all feature combinations; however, this is (proven) guaranteed by the Shapley value. 6. For each local case, SHAP encodes the full scope of associative hierarchies into a compact 1D linear structure that represents the log-RR effect of each feature wrt each local case. 7. Ergo: the “effect matrix” generated by SHAP can be treated as an independently viable system for elucidation of synthetic mechanisms via secondary transformations and analyses.
PPSCRIPT _________
We tease this now a bit by examineb various alternative aggregations of the feature effect matrix wrt prediction of yield. 1. We assume that negative causation is independent from and mathematically orthogonal to positive causation. 2. We therefore, for each case, square each feature effect and sum according to its original sign – negatives grouped with negatives, positives with positives. 3. We then take the square root of each such sum and collect together with other intermediate/alternative metrics: 1) original SHAP sum, 2) sum of negatives, 3) sum of positives, 4) sum of negatives, 5) squared sum of negatives, 6) squared sum of positives, and 7) difference between squared sums of positives and negatives. 4. Finally, for each of the above, we run simple regressions wrt yield and compare R-squared and RMSE.
# First, define qubit matrix per 1-7 above
XGB_model_1.10.SHAP.Qubit <- data.table()
# SumTotal
XGB_model_1.10.SHAP.Qubit$SumTotal <- XGB_model_1.10.SHAP$shap_score[, rowSums(.SD), .SDcols=1:(ncol(XGB_model_1.10.SHAP$shap_score)-2)]
# SumPos
sapply(1:nrow(XGB_model_1.10.SHAP$shap_score), function(ii){
XGB_model_1.10.SHAP$shap_score[, 1:120][ii,.SD, .SDcols = as.vector(XGB_model_1.10.SHAP$shap_score[ii, 1:120]>0)] %>%
as.numeric() %>%
sum()
}) -> XGB_model_1.10.SHAP.Qubit$SumPos
# SumNeg
sapply(1:nrow(XGB_model_1.10.SHAP$shap_score), function(ii){
XGB_model_1.10.SHAP$shap_score[, 1:120][ii,.SD, .SDcols = as.vector(XGB_model_1.10.SHAP$shap_score[ii, 1:120]<0)] %>%
as.numeric() %>%
sum()
}) -> XGB_model_1.10.SHAP.Qubit$SumNeg
# SqSumNeg
sapply(1:nrow(XGB_model_1.10.SHAP$shap_score), function(ii){
XGB_model_1.10.SHAP$shap_score[, 1:120][ii,sapply(.SD, function(x){as.numeric(x)^2}), .SDcols = XGB_model_1.10.SHAP$shap_score[ii, 1:120]<0] %>%
sum() %>%
sqrt()
}) -> XGB_model_1.10.SHAP.Qubit$SqSumNeg
# SqSumPos
sapply(1:nrow(XGB_model_1.10.SHAP$shap_score), function(ii){
XGB_model_1.10.SHAP$shap_score[, 1:120][ii,sapply(.SD, function(x){as.numeric(x)^2}), .SDcols = XGB_model_1.10.SHAP$shap_score[ii, 1:120]>0] %>%
sum() %>%
sqrt()
}) -> XGB_model_1.10.SHAP.Qubit$SqSumPos
# DiffSqSum
XGB_model_1.10.SHAP.Qubit[, "DiffSqSum" := as.numeric(SqSumPos) - as.numeric(SqSumNeg)]
#add yield
XGB_model_1.10.SHAP.Qubit <- cbind(Yield = XGB_model_1.10.SHAP.Corr$Yield, XGB_model_1.10.SHAP.Qubit)
#process
XGB_model_1.10.SHAP.Qubit <- XGB_model_1.10.SHAP.Qubit[, lapply(.SD, as.numeric)]
#run lm
lm(Yield ~ SumTotal, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ SumTotal, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.602 -12.552 -0.579 9.282 55.926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.53067 0.36652 53.29 <2e-16 ***
## SumTotal 3.74683 0.05231 71.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.21 on 2766 degrees of freedom
## Multiple R-squared: 0.6497, Adjusted R-squared: 0.6496
## F-statistic: 5131 on 1 and 2766 DF, p-value: < 2.2e-16
lm(Yield ~ SumPos, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ SumPos, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.133 -11.759 -0.688 8.643 61.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.70937 0.73026 -18.77 <2e-16 ***
## SumPos 6.74607 0.09413 71.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.21 on 2766 degrees of freedom
## Multiple R-squared: 0.65, Adjusted R-squared: 0.6499
## F-statistic: 5137 on 1 and 2766 DF, p-value: < 2.2e-16
lm(Yield ~ SumNeg, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ SumNeg, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.001 -12.536 -1.588 10.533 50.939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.8350 0.5131 114.67 <2e-16 ***
## SumNeg 7.7422 0.1211 63.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.4 on 2766 degrees of freedom
## Multiple R-squared: 0.5966, Adjusted R-squared: 0.5964
## F-statistic: 4090 on 1 and 2766 DF, p-value: < 2.2e-16
lm(Yield ~ SqSumNeg, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ SqSumNeg, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.882 -13.300 -1.835 11.081 62.093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.3079 0.5857 106.38 <2e-16 ***
## SqSumNeg -26.7186 0.4443 -60.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.03 on 2766 degrees of freedom
## Multiple R-squared: 0.5666, Adjusted R-squared: 0.5665
## F-statistic: 3616 on 1 and 2766 DF, p-value: < 2.2e-16
lm(Yield ~ SqSumPos, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ SqSumPos, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.797 -11.660 -1.416 9.703 65.193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.3182 0.8932 -16.03 <2e-16 ***
## SqSumPos 26.9090 0.4605 58.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.33 on 2766 degrees of freedom
## Multiple R-squared: 0.5525, Adjusted R-squared: 0.5523
## F-statistic: 3415 on 1 and 2766 DF, p-value: < 2.2e-16
lm(Yield ~ DiffSqSum, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ DiffSqSum, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.031 -12.066 -0.894 10.030 53.472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.8682 0.3525 64.87 <2e-16 ***
## DiffSqSum 15.1677 0.2196 69.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.59 on 2766 degrees of freedom
## Multiple R-squared: 0.6331, Adjusted R-squared: 0.6329
## F-statistic: 4772 on 1 and 2766 DF, p-value: < 2.2e-16
lm(Yield ~ ., data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ ., data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.337 -11.253 -0.471 8.301 56.922
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.3277 2.7571 -2.295 0.0218 *
## SumTotal -4.3281 0.6786 -6.378 2.1e-10 ***
## SumPos 11.7002 0.9873 11.850 < 2e-16 ***
## SumNeg NA NA NA NA
## SqSumNeg -17.5367 1.8093 -9.693 < 2e-16 ***
## SqSumPos -3.9552 1.2848 -3.078 0.0021 **
## DiffSqSum NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.8 on 2763 degrees of freedom
## Multiple R-squared: 0.6675, Adjusted R-squared: 0.667
## F-statistic: 1387 on 4 and 2763 DF, p-value: < 2.2e-16
lm(Yield ~ SumTotal+DiffSqSum, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ SumTotal + DiffSqSum, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.419 -12.413 -0.543 9.424 55.274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.0553 0.4190 47.869 <2e-16 ***
## SumTotal 3.0849 0.2622 11.766 <2e-16 ***
## DiffSqSum 2.7703 1.0752 2.577 0.01 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.2 on 2765 degrees of freedom
## Multiple R-squared: 0.6506, Adjusted R-squared: 0.6503
## F-statistic: 2574 on 2 and 2765 DF, p-value: < 2.2e-16
# # Plot RMSE and R^2, dropping SVN
# df <- data.frame(rmse = c(lm.pred.rmse, knn.pred.rmse, nnet.pred.rmse, bayesglm.pred.rmse, rf.pred.rmse, rf.pred.SHAP.rmse), # svm.pred.rmse,
# r2 = c(lm.pred.r2, knn.pred.r2, nnet.pred.r2, bayesglm.pred.r2, rf.pred.r2, rf.pred.SHAP.r2)) # svm.pred.r2,
# row.names(df) <- c('Linear Model', 'kNN', 'Neural Network', 'Bayes GLM', 'Random Forest (Scaled)', "Random Forest (SHAP)") #'SVM'
#
# rmse.plot <- ggplot(df, aes(y=reorder(rownames(df), rmse), x=rmse)) +
# geom_point() +
# geom_text(label=round(df$rmse, 1), vjust=-1, size=3) +
# labs(x='RMSE', y='') +
# xlim(0,20)
# r2.plot <- ggplot(df, aes(y=reorder(rownames(df), rmse), x=r2)) +
# geom_point() +
# geom_text(label=round(df$r2, 2), vjust=-1, size=3) +
# labs(x='Rsquared', y='') +
# xlim(0.6,1)
# plots <- arrangeGrob(rmse.plot, r2.plot, ncol=2)
# ggsave(plots, file="R/plots/ml_models_SHAP.png", width=7, height=2.5)
#
# # dpslay
# magick::image_read("R/plots/ml_models_SHAP.png") %>% plot()
#
Wild. Geometric mean holds up well compared to other aggregations, but loses out to the additive mean in a straight comparison. Experimentation along this direction worthwhile, assuming an automated means for separating postive and negative causal structures can be identified. For example, perhaps given unknown uncertainties wrt SHAP effect esitmations (local), it might be better to determine positive or negative causal bucket based on overall aggregate score then apply uniformly to all cases
PPPSCRIPT _________
The previous note begged the question as to how one should go about performing feature-wise disaggregation. Recall that the justification for such an approach is based on the way in which the Shapley value is mathematically defined: Every possible constellation of features is assumed to be a valid “game” and the overall system is accordingly optimized: Data for this optimization is provided to SHAP within the structure of the XGBoost decision tree itself, where all valid constellations and hiercharcies are tested: The deductive nature of the Shapely calculation (likely effect of exclusion wrt predictive outcome) guarantees that, even if a given feature or feature set is itself not meaningful wrt the (additive) predictive outcome, that the variability encoded within a given feauture column, feature set, or feature set aggregation should perform well in a linear modelling task. Put differently, the effect matrix produced by SHAP represents the effect (wrt a local outcome) of NOT having a given feature present – this mode of reasoning is identical to that taught to physicians (“differntial diagnosis”) and forces the model to assume that all sources of variability are reflected within the data. As a result, the imputed local effect of a feature with low-but-consistent predictive power will incorporate all known/unknown sources of variability impacting the prediction upon which the marginal effect calculation is based.
This was hinted when we showed that the aggregate SHAP score alone performed just as well as the top 20 features wrt yield prediction. Lets now rerun that analysis per the above.
Luckily, Doyle et al make this task easier (operatinally) by having added clear labels to each of the 120 features included in this data set… yay!
#create aggregate SHAP cols
XGB_model_1.10.SHAP$shap_score[, "additive" := rowSums(.SD), .SDcols = names(XGB_model_1.10.SHAP$shap_score) %like% "additive_" ]
XGB_model_1.10.SHAP$shap_score[, "aryl_halide" := rowSums(.SD), .SDcols = names(XGB_model_1.10.SHAP$shap_score) %like% "aryl_halide_" ]
XGB_model_1.10.SHAP$shap_score[, "base" := rowSums(.SD), .SDcols = names(XGB_model_1.10.SHAP$shap_score) %like% "base_" ]
XGB_model_1.10.SHAP$shap_score[, "ligand" := rowSums(.SD), .SDcols = names(XGB_model_1.10.SHAP$shap_score) %like% "ligand_" ]
#build dataset for testing
XGB_model_1.10.SHAP.Qubit[, c("additive", "aryl_halide", "base", "ligand") := XGB_model_1.10.SHAP$shap_score[, c("additive", "aryl_halide", "base", "ligand")]]
#run linear models, one for each, one for all, and sumtotal too
lm(Yield ~ additive, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ additive, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.991 -19.128 -4.377 15.234 67.703
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.4355 0.5175 51.09 <2e-16 ***
## additive 5.4286 0.1842 29.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.9 on 2766 degrees of freedom
## Multiple R-squared: 0.2391, Adjusted R-squared: 0.2388
## F-statistic: 869 on 1 and 2766 DF, p-value: < 2.2e-16
lm(Yield ~ aryl_halide, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ aryl_halide, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.394 -14.690 -0.782 12.318 59.710
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.49332 0.41900 60.84 <2e-16 ***
## aryl_halide 4.68327 0.09662 48.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.14 on 2766 degrees of freedom
## Multiple R-squared: 0.4593, Adjusted R-squared: 0.4591
## F-statistic: 2349 on 1 and 2766 DF, p-value: < 2.2e-16
lm(Yield ~ base, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ base, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.975 -21.204 -5.132 20.789 68.762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.9463 0.5070 61.03 <2e-16 ***
## base 7.9557 0.4005 19.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.63 on 2766 degrees of freedom
## Multiple R-squared: 0.1249, Adjusted R-squared: 0.1246
## F-statistic: 394.7 on 1 and 2766 DF, p-value: < 2.2e-16
lm(Yield ~ ligand, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ ligand, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.566 -21.692 -3.376 18.954 68.058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.3654 0.5072 61.84 <2e-16 ***
## ligand 7.1316 0.3832 18.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.82 on 2766 degrees of freedom
## Multiple R-squared: 0.1113, Adjusted R-squared: 0.111
## F-statistic: 346.4 on 1 and 2766 DF, p-value: < 2.2e-16
lm(Yield ~ additive+aryl_halide+base+ligand, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ additive + aryl_halide + base + ligand,
## data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.367 -12.650 -0.549 9.485 56.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.50389 0.36818 52.97 <2e-16 ***
## additive 3.27772 0.12980 25.25 <2e-16 ***
## aryl_halide 3.65632 0.08165 44.78 <2e-16 ***
## base 5.08253 0.25472 19.95 <2e-16 ***
## ligand 4.79013 0.24102 19.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.04 on 2763 degrees of freedom
## Multiple R-squared: 0.6574, Adjusted R-squared: 0.6569
## F-statistic: 1326 on 4 and 2763 DF, p-value: < 2.2e-16
lm(Yield ~ SumTotal, data = XGB_model_1.10.SHAP.Qubit) %>% summary()
##
## Call:
## lm(formula = Yield ~ SumTotal, data = XGB_model_1.10.SHAP.Qubit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.602 -12.552 -0.579 9.282 55.926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.53067 0.36652 53.29 <2e-16 ***
## SumTotal 3.74683 0.05231 71.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.21 on 2766 degrees of freedom
## Multiple R-squared: 0.6497, Adjusted R-squared: 0.6496
## F-statistic: 5131 on 1 and 2766 DF, p-value: < 2.2e-16
While I do not yet have the background to comment on the meaning of the individual models, comparision of the all-inclusive model to the base model reveals overall predictive power to be identifical wrt Yield. Each aggregated feature set is statistically significant, in both individual and all-inclusive models, providing evidence to support prior statments wrt SHAP and the implications of how the Shapley value is defined/calculated wrt expected relationship between hierarchical predictive structure in the XGBoost model and SHAP encoding of such.