Tensorflow/Keras - Custom Loss Function with an Additional Vector

Overview

Proof of concept for passing in an additional vector to a custom loss function. The basis for this is as follows…

I have a highly skewed binary classification outcome - the problem here can be considered ‘yield’. There is a gatekeeper funnel that an observation must pass through before they can yield. Not every observation that makes it through the funnel yields. Since the prediction in this scenario will be made prior to the funnel, I cannot include the funnel as a feature. I would like to penalize the model predicting ‘yield’ in the post-funnel population when the observation did not yield.

Approach…

Predict using the built in binary_crossentropy function from Keras (no funnel in cost function)
Predict using a custom loss function to replicate binary_crossentropy (no funnel in cost function). Compare results with step 1 to ensure that my original custom loss function is good, prior to incorporating the funnel.
Predict using a modified step 2 by adding in a penalty for the case when the funnel is 1, the actual is 0, and the predict is 1. Compare to results from 1 and 2 to ensure that it is better (or at least not worse).

Note: I will be using as similar of an architecture as possible across all three steps. My initial thought to help ensure the funnel lines up with (y_true,y_pred) in the loss function of step 3 is to turn shuffle off in the model fit and doing a single full batch pass per epoch (batch_size = nrow(train_x)). This approach will be applied to each of the three models. I will be using a large number of epochs (1000) to aid in convergence. This setup requires further research to confirm if it is sufficient.

Simulate Data

library(keras)
set.seed(123)
df <- data.frame(a=rnorm(1000),
                 b=rnorm(1000),
                 c=rnorm(1000),
                 y=sample(c(0,1),size=1000,prob=c(.9,.1),replace=TRUE))

# Indicator is always 1 if y == 1, otherwise could be 1 or 0
funnel <- apply(as.matrix(df$y),MARGIN =1,FUN=function(y){
          ifelse(y==1,1,sample(c(0,1),size=1,prob=c(.8,.2)))})

# Since the population passing through the funnel is exceptional, we add '1' when the funnel is '1'.
# The post-funnel population is much less variable, but the yield population should have slightly 
# better values on the feature set, so yield will add a slight bit more to make those observations
# marginally stand out.
df$a <- df$a+funnel+.3*df$y
df$b <- df$b+funnel+.4*df$y
df$c <- df$c+funnel+.35*df$y

# Define Train/Test Sets
test_ind <- seq(1, length(df$y), 5)    # 1 pt in 5 used for testing
train_x <- data.matrix(df[-test_ind,-4])
train_y <- data.matrix(df[-test_ind, 4])
test_x <-  data.matrix(df[test_ind,-4])
test_y <-  data.matrix(df[test_ind, 4])
funnel <- data.matrix(funnel[-test_ind])

Custom Metrics to be used across all models

F1 Score

f1 <- function (y_true,y_pred) {
  y_pred <- k_round(y_pred)
  precision <- k_sum(y_pred*y_true)/(k_sum(y_pred)+k_epsilon())
  recall    <- k_sum(y_pred*y_true)/(k_sum(y_true)+k_epsilon())
  (2*precision*recall)/(precision+recall+k_epsilon())
}

Logistic output using ‘binary_crossentropy’

# set keras seed
use_session_with_seed(345)

# defining a keras sequential model
modelBuiltIn <- keras_model_sequential()

# model architecture
modelBuiltIn %>% 
  layer_dense(units = 20, input_shape = ncol(train_x)) %>%
  layer_dropout(rate=0.25)%>%
  layer_activation(activation = 'relu') %>% 
  layer_dense(units = 1) %>%
  layer_activation(activation = 'sigmoid')

# compiling the defined model with metric = accuracy and optimiser as adam.
modelBuiltIn %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = c('binary_accuracy',f1)
)

# fitting the model on the training dataset
modelBuiltIn %>% fit(train_x, train_y, 
                     epochs = 1000, 
                     batch_size = nrow(train_x),
                     shuffle = FALSE)

# score and produce predictions
scoreBuiltIn <- modelBuiltIn %>% evaluate(test_x, test_y)
predBuiltIn <- modelBuiltIn %>% predict(test_x)

Logistic output using a custom loss function

# reset keras seed
use_session_with_seed(345)

# Custom Loss Function (cross entropy)
loss <- function(y_true,y_pred){
 k_sum(-((y_true*k_log(y_pred))+(1-y_true)*k_log(1-y_pred)))
}

# defining a keras sequential model
modelCust <- keras_model_sequential()

# model architecture
modelCust %>% 
  layer_dense(units = 20, input_shape = ncol(train_x)) %>% 
  layer_dropout(rate=0.25)%>%
  layer_activation(activation = 'relu') %>% 
  layer_dense(units = 1) %>%
  layer_activation(activation = 'sigmoid')

# compiling the defined model with metric = accuracy and optimiser as adam.
modelCust %>% compile(
  loss = loss,
  optimizer = 'adam',
  metrics = c('binary_accuracy',f1)
)

# fitting the model on the training dataset
modelCust %>% fit(train_x, train_y, 
                  epochs = 1000, 
                  batch_size = nrow(train_x),
                  shuffle = FALSE)

# score and produce predictions
scoreCust <- modelCust %>% evaluate(test_x, test_y)
predCust <- modelCust %>% predict(test_x)

Compare results

Differences are very near zero, suggesting a successful custom loss function.

dif <- predBuiltIn-predCust
maxDif <- max(abs(dif))
meanDif <- mean(dif)

data.frame(maxDif=maxDif,meanDif=meanDif)

##         maxDif      meanDif
## 1 0.0006842911 5.613498e-05

Logistic output using a custom loss function with the funnel added

# reset keras seed
use_session_with_seed(345)

# turn the funnel into a Keras constant
funnelConst <- k_constant(funnel,dtype='float32')

# Custom Loss Function (cross entropy) with funnel
lossFil <- function(y_true,y_pred){
 k_sum(-((y_true*k_log(y_pred))+(1-y_true)*k_log(1-y_pred)+
           k_transpose(funnelConst)*(1-y_true)*k_log(1-y_pred)))
}

# defining a keras sequential model
modelCustFil <- keras_model_sequential()

# model architecture
modelCustFil %>% 
  layer_dense(units = 20, input_shape = ncol(train_x)) %>% 
  layer_dropout(rate=0.25)%>%
  layer_activation(activation = 'relu') %>% 
  layer_dense(units = 1) %>%
  layer_activation(activation = 'sigmoid')

# compiling the defined model with metric = accuracy and optimiser as adam.
modelCustFil %>% compile(
  loss = lossFil,
  optimizer = 'adam',
  metrics = c('binary_accuracy',f1)
)

# fitting the model on the training dataset
modelCustFil %>% fit(train_x, train_y, 
                     epochs = 1000, 
                     batch_size = nrow(train_x),
                     shuffle = FALSE)

# score and produce predictions
scoreCustFil <- modelCustFil %>% evaluate(test_x, test_y)
predCustFil <- modelCustFil %>% predict(test_x)

Compare Results

A mild increase in both accuracy and F1 score suggests there may be validity to including the funnel in the loss function, and the above demonstrates the code for how this might be done.

Accuracy

data.frame(BuiltIn=scoreBuiltIn$binary_accuracy,
           Custom=scoreCust$binary_accuracy,
           CustomWithFunnel=scoreCustFil$binary_accuracy)

##   BuiltIn Custom CustomWithFunnel
## 1    0.92   0.92            0.935

F1 Score

data.frame(BuiltIn=scoreBuiltIn$python_function,
           Custom=scoreCust$python_function,
           CustomWithFunnel=scoreCustFil$python_function)

##     BuiltIn    Custom CustomWithFunnel
## 1 0.3123809 0.3123809        0.3230476

Ideas/Questions for further simulation

Are the arguments shuffle=FALSE and batch_size = nrow(train_x) in fit() sufficient to ensure funnel and (y_true,y_pred) actually line up to the same observations? Performance suggests they may be correct, but I would like to dig in further to confirm this is the case.
Make the post-funnel differences more subtle
Train a larger network