Proof of concept for passing in an additional vector to a custom loss function. The basis for this is as follows…
I have a highly skewed binary classification outcome - the problem here can be considered ‘yield’. There is a gatekeeper funnel that an observation must pass through before they can yield. Not every observation that makes it through the funnel yields. Since the prediction in this scenario will be made prior to the funnel, I cannot include the funnel as a feature. I would like to penalize the model predicting ‘yield’ in the post-funnel population when the observation did not yield.
Approach…
Predict using the built in binary_crossentropy function from Keras (no funnel in cost function)
Predict using a custom loss function to replicate binary_crossentropy (no funnel in cost function). Compare results with step 1 to ensure that my original custom loss function is good, prior to incorporating the funnel.
Predict using a modified step 2 by adding in a penalty for the case when the funnel is 1, the actual is 0, and the predict is 1. Compare to results from 1 and 2 to ensure that it is better (or at least not worse).
Note: I will be using as similar of an architecture as possible across all three steps. My initial thought to help ensure the funnel
lines up with (y_true
,y_pred
) in the loss function of step 3 is to turn shuffle
off in the model fit and doing a single full batch pass per epoch (batch_size = nrow(train_x)
). This approach will be applied to each of the three models. I will be using a large number of epochs (1000) to aid in convergence. This setup requires further research to confirm if it is sufficient.
library(keras)
set.seed(123)
df <- data.frame(a=rnorm(1000),
b=rnorm(1000),
c=rnorm(1000),
y=sample(c(0,1),size=1000,prob=c(.9,.1),replace=TRUE))
# Indicator is always 1 if y == 1, otherwise could be 1 or 0
funnel <- apply(as.matrix(df$y),MARGIN =1,FUN=function(y){
ifelse(y==1,1,sample(c(0,1),size=1,prob=c(.8,.2)))})
# Since the population passing through the funnel is exceptional, we add '1' when the funnel is '1'.
# The post-funnel population is much less variable, but the yield population should have slightly
# better values on the feature set, so yield will add a slight bit more to make those observations
# marginally stand out.
df$a <- df$a+funnel+.3*df$y
df$b <- df$b+funnel+.4*df$y
df$c <- df$c+funnel+.35*df$y
# Define Train/Test Sets
test_ind <- seq(1, length(df$y), 5) # 1 pt in 5 used for testing
train_x <- data.matrix(df[-test_ind,-4])
train_y <- data.matrix(df[-test_ind, 4])
test_x <- data.matrix(df[test_ind,-4])
test_y <- data.matrix(df[test_ind, 4])
funnel <- data.matrix(funnel[-test_ind])
F1 Score
f1 <- function (y_true,y_pred) {
y_pred <- k_round(y_pred)
precision <- k_sum(y_pred*y_true)/(k_sum(y_pred)+k_epsilon())
recall <- k_sum(y_pred*y_true)/(k_sum(y_true)+k_epsilon())
(2*precision*recall)/(precision+recall+k_epsilon())
}
# set keras seed
use_session_with_seed(345)
# defining a keras sequential model
modelBuiltIn <- keras_model_sequential()
# model architecture
modelBuiltIn %>%
layer_dense(units = 20, input_shape = ncol(train_x)) %>%
layer_dropout(rate=0.25)%>%
layer_activation(activation = 'relu') %>%
layer_dense(units = 1) %>%
layer_activation(activation = 'sigmoid')
# compiling the defined model with metric = accuracy and optimiser as adam.
modelBuiltIn %>% compile(
loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = c('binary_accuracy',f1)
)
# fitting the model on the training dataset
modelBuiltIn %>% fit(train_x, train_y,
epochs = 1000,
batch_size = nrow(train_x),
shuffle = FALSE)
# score and produce predictions
scoreBuiltIn <- modelBuiltIn %>% evaluate(test_x, test_y)
predBuiltIn <- modelBuiltIn %>% predict(test_x)
# reset keras seed
use_session_with_seed(345)
# Custom Loss Function (cross entropy)
loss <- function(y_true,y_pred){
k_sum(-((y_true*k_log(y_pred))+(1-y_true)*k_log(1-y_pred)))
}
# defining a keras sequential model
modelCust <- keras_model_sequential()
# model architecture
modelCust %>%
layer_dense(units = 20, input_shape = ncol(train_x)) %>%
layer_dropout(rate=0.25)%>%
layer_activation(activation = 'relu') %>%
layer_dense(units = 1) %>%
layer_activation(activation = 'sigmoid')
# compiling the defined model with metric = accuracy and optimiser as adam.
modelCust %>% compile(
loss = loss,
optimizer = 'adam',
metrics = c('binary_accuracy',f1)
)
# fitting the model on the training dataset
modelCust %>% fit(train_x, train_y,
epochs = 1000,
batch_size = nrow(train_x),
shuffle = FALSE)
# score and produce predictions
scoreCust <- modelCust %>% evaluate(test_x, test_y)
predCust <- modelCust %>% predict(test_x)
Differences are very near zero, suggesting a successful custom loss function.
dif <- predBuiltIn-predCust
maxDif <- max(abs(dif))
meanDif <- mean(dif)
data.frame(maxDif=maxDif,meanDif=meanDif)
## maxDif meanDif
## 1 0.0006842911 5.613498e-05
# reset keras seed
use_session_with_seed(345)
# turn the funnel into a Keras constant
funnelConst <- k_constant(funnel,dtype='float32')
# Custom Loss Function (cross entropy) with funnel
lossFil <- function(y_true,y_pred){
k_sum(-((y_true*k_log(y_pred))+(1-y_true)*k_log(1-y_pred)+
k_transpose(funnelConst)*(1-y_true)*k_log(1-y_pred)))
}
# defining a keras sequential model
modelCustFil <- keras_model_sequential()
# model architecture
modelCustFil %>%
layer_dense(units = 20, input_shape = ncol(train_x)) %>%
layer_dropout(rate=0.25)%>%
layer_activation(activation = 'relu') %>%
layer_dense(units = 1) %>%
layer_activation(activation = 'sigmoid')
# compiling the defined model with metric = accuracy and optimiser as adam.
modelCustFil %>% compile(
loss = lossFil,
optimizer = 'adam',
metrics = c('binary_accuracy',f1)
)
# fitting the model on the training dataset
modelCustFil %>% fit(train_x, train_y,
epochs = 1000,
batch_size = nrow(train_x),
shuffle = FALSE)
# score and produce predictions
scoreCustFil <- modelCustFil %>% evaluate(test_x, test_y)
predCustFil <- modelCustFil %>% predict(test_x)
A mild increase in both accuracy and F1 score suggests there may be validity to including the funnel in the loss function, and the above demonstrates the code for how this might be done.
Accuracy
data.frame(BuiltIn=scoreBuiltIn$binary_accuracy,
Custom=scoreCust$binary_accuracy,
CustomWithFunnel=scoreCustFil$binary_accuracy)
## BuiltIn Custom CustomWithFunnel
## 1 0.92 0.92 0.935
F1 Score
data.frame(BuiltIn=scoreBuiltIn$python_function,
Custom=scoreCust$python_function,
CustomWithFunnel=scoreCustFil$python_function)
## BuiltIn Custom CustomWithFunnel
## 1 0.3123809 0.3123809 0.3230476
shuffle=FALSE
and batch_size = nrow(train_x)
in fit()
sufficient to ensure funnel
and (y_true
,y_pred
) actually line up to the same observations? Performance suggests they may be correct, but I would like to dig in further to confirm this is the case.