Variational autoencoders for anomaly detection

Sigrid Keydana, Trivadis
2017/16/09

Variational autoencoders: Setting the scene

Welcome to the world of deep neural networks

No magic involved: autoencoders

neither a supervised nor an unsupervised, but a self-supervised technique
not conducive to learning interesting features / abstractions
useful for applications such as image denoising and dimensionality reduction for visualization

Anomalies

An outlier is… an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism [3]

—> we need a probabilistic approach

Enter: generative stochastic models

Latent variable models

Maximize:

\( P(X) = \int P(X|z;\theta) P(z) dz \)

Questions

How do we get a mapping from easy-to-sample-from \( z \)'s to the empirical output (\( X \))?
How do we make sure our \( z \)'s are in the appropriate range to yield the \( X \)'s?

Variational autoencoder

Sample from \( P(Z) = \mathcal N(z|0,I) \)
Let the network learn the decoder: \( P(X|z;\theta) = \mathcal N(X|f(z,\theta), \sigma^2 I) \)
Let the network learn the encoder: \( Q(z|X) \)

Variational autoencoder objective

Maximize the variational lower bound

\[ E_{z \sim Q}[log P(X|z)] - \mathcal D[Q(z|X)||P(z)] \]

where the terms are

1) the reconstruction probability

2) the Kullback-Leibler divergence between the approximate posterior and the prior for \( z \)

VAE in a nutshell

How can we use this?

Generate stuff (e.g., images)

How can we use this for ANOMALY DETECTION?

VAE models the distribution, not the values
anomalies are seen as coming from a different process / distribution, so…
can diagnose as anomalies those cases that have high reconstruction error / low reconstruction probability [5]

Let's see this in practice!

MNIST, what else ;-)
fraud detection
network intrusion detection

Technology employed

Keras, used from R, via the bindings provided by Rstudio:
- used for all models
DL4J:
- for reconstruction-probability based outliers in MNIST

Variational-autoencoding MNIST

Ubiquitous MNIST

MNIST handwritten digits database

The encoder

 # input layer
x <- layer_input(batch_shape = c(batch_size, original_dim))   
# hidden intermediate, lower-res
h <- layer_dense(x, intermediate_dim, activation = "relu") 
# latent var 1, 2-dim (mainly for plotting!): mean
z_mean <- layer_dense(h, latent_dim)    
# latent var 2, 2-dim: variance
z_log_var <- layer_dense(h, latent_dim)                            

sampling <- function(arg){
  z_mean <- arg[,0:1]
  z_log_var <- arg[,2:3]

  epsilon <- K$random_normal(
    shape = c(batch_size, latent_dim), 
    mean=0.,
    stddev=epsilon_std
  )
  z_mean + K$exp(z_log_var/2)*epsilon
}

# latent vars are sampled: nondeterministic!
z <- layer_concatenate(list(z_mean, z_log_var)) %>%               
  layer_lambda(sampling)

The decoder

# the latent layer from above
z <- layer_concatenate(list(z_mean, z_log_var)) %>%                            
  layer_lambda(sampling)

# hidden intermediate, higher-res
decoder_h <- layer_dense(units = intermediate_dim, activation = "relu") 
# decoder for the mean, high-res again
decoder_mean <- layer_dense(units = original_dim, activation = "sigmoid")      

h_decoded <- decoder_h(z)
x_decoded_mean <- decoder_mean(h_decoded)

The model

# the complete model, from input to decoded output
vae <- keras_model(x, x_decoded_mean)                                          

# cross-entropy: the reconstruction part of the loss function
xent_loss <- function(target, reconstruction) {
  as.double(original_dim) * loss_binary_crossentropy(target, reconstruction)   
}

# Kullback-Leibler divergence: the regularization part of the loss function
kl_loss <- function(target, reconstruction) {
  -0.5*K$mean(1 + z_log_var - K$square(z_mean) - K$exp(z_log_var), axis = -1L)
}

vae_loss <- function(x, x_decoded_mean){
  xent_loss <- (original_dim/1.0)*loss_binary_crossentropy(x, x_decoded_mean)
  kl_loss <- -0.5*K$mean(1 + z_log_var - K$square(z_mean) - K$exp(z_log_var), axis = -1L)
  xent_loss + kl_loss
}

# compile the model with (hopefully) adequate optimizer and learning rate
vae %>% compile(optimizer = optimizer_rmsprop(lr = learning_rate), loss = vae_loss, metrics = c(xent_loss, kl_loss))

MNIST latent space

Best and worst reconstructions

using reconstruction probability

So which are the anomalies?

have to define cut-off points
as with other machine learning /statistics-based algorithms
so is this a problem?
the more urgent question really is…

How well does this work with real-world datasets?

MNIST is 28*28 pixels, grayscale ==> super homogeneous data!
best supervised learning error rate as of today (test set): 0.23% ==> super simple!

How about real datasets with different types of variables of different scale?

Real-world example: fraud detection

Fraudulent Transactions [6]

data “from an undisclosed source”, anonymized
transactions reported by salespeople
- salesman id
- product id
- quantity sold
- total value of sale
- fraud yes/no/unknown
total transactions 401,146 - of these, 385,414 unknown!

Before we start tweaking model parameters...

How to best preprocess the data?

product ids will have to be one-hot-encoded (but won't be able to use them all to avoid overly sparse matrix)
quantity and total value should both be kept (not reduced to unit price!) because each of them may be informative alone
however, we need all data to be on similar scale to be able to pick an adequate loss function
thus, quantity and total value have to be normalized
nonetheless, we have a mix of 0/1 and continuous data - let's see how this goes…

Parameterizing the model for fraud detection

in theory, binary crossentropy should work fine for data between 0 and 1
in practice, mean squared error worked better
using MSE, also had to change activation of the output (decoded mean) layer to linear

Training procedure

not evident how to proceed with the unlabeled (“unknown”) data (= 96% of whole dataset!)
decision: train on “unknown” set, use labeled data (fraud or non-fraud) for testing

Fraud latent space:

Scores on test set

Reconstruction error (MSE, in this case):

final MSE on training set: 3.50
MSE on fraud test set: 3.57
MSE on non-fraud test set: 3.49

Problem: small sample sizes (non-fraud: 6013, fraud: 180)

Real-world example: network intrusion detection

UNSW-NB15 intrusion detection dataset [7,8]

synthetic dataset
9 types of attack: fuzzers, analysis, backdoors, DoS, exploits, generic, reconnaissance, shellcode, worms
49 features created from raw network traffic captured with tcpdump

Again - what shall we do with the data?

tried: one-hot encoding protocol, service and state categorical variables
tried: binning and one-hot-encoding continuous variables
tried: different ways of normalizing continuous variables
in the end, what worked best was getting rid of the categorical variables and scaling all others

How should we parameterize the model?

tried different loss functions appropriate for normally distributed data, MSE worked best
also used batch normalization and \( l2 \) regularization for all hidden layers, e.g.

h <- layer_dense(x, intermediate_dim) 
if(use_batch_normalization) h <- h %>% layer_batch_normalization() 
h <- h %>% layer_activation("relu") %>% layer_activity_regularization(l1=l1, l2=l2)

Training procedure

dataset already comes with a training set and a test set
use all non-attack records from the training set (= 37,000) for training - that's not too much given the complexity of the data

Network intrusions latent space:

Scores on test set

Reconstruction error (MSE):

type	number of cases	MSE
normal (no attack)	56,000	40.4
analysis	2,000	81.1
DoS	12,264	55.3
exploits	33,393	50.4
fuzzers	18,184	37.8
generic	40,000	48.8
reconnaissance	10,491	20.5
shellcode	1,133	19.3
worms	130	38.6

Conclusion

not sure we're ready for a conclusion yet…
definitely too much preprocessing involved, given we want to use DL for automation
conceptually, VAE is a very pleasing fit for the task because of its inherent probabilistic nature
in practice, this may need further experimentation, e.g., use embeddings on categorical variables
desirable: follow up model performance (e.g., the intrusion detection model) together with a domain expert

Questions?

Thanks for your attention!

Sources

[1] Asimov Institute, The Neural Network Zoo.

[2] Keras blog, Building autoencoders in Keras.

[3] Hawkins, D. (1980), Identification of outliers. Chapman and Hall, London.

[4] Doersch, C.,Tutorial on Variational Autoencoders.

[5] An, J. & Cho, S.,Variational Autoencoder based Anomaly Detection using Reconstruction Probability

[6] Torgo, L. (2017), Data Mining with R, 2nd ed.

[7] Moustafa, Nour, and Jill Slay. “UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)."Military Communications and Information Systems Conference (MilCIS), 2015. IEEE, 2015.

[8] Moustafa, Nour, and Jill Slay. "The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set.” Information Security Journal: A Global Perspective (2016): 1-14.