Variational autoencoders for anomaly detection

Sigrid Keydana, Trivadis
2017/16/09

 

 

Variational autoencoders: Setting the scene

Welcome to the world of deep neural networks

 

Source: [1]

No magic involved: autoencoders

 

  • neither a supervised nor an unsupervised, but a self-supervised technique
  • not conducive to learning interesting features / abstractions
  • useful for applications such as image denoising and dimensionality reduction for visualization
Source: [2]

 

Anomalies

 

An outlier is… an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism [3]

—> we need a probabilistic approach

Enter: generative stochastic models

Latent variable models

 

Maximize:

\( P(X) = \int P(X|z;\theta) P(z) dz \)

 

Source: [4]

Questions

 

  • How do we get a mapping from easy-to-sample-from \( z \)'s to the empirical output (\( X \))?

  • How do we make sure our \( z \)'s are in the appropriate range to yield the \( X \)'s?

Variational autoencoder

 

  • Sample from \( P(Z) = \mathcal N(z|0,I) \)

  • Let the network learn the decoder: \( P(X|z;\theta) = \mathcal N(X|f(z,\theta), \sigma^2 I) \)

  • Let the network learn the encoder: \( Q(z|X) \)

Variational autoencoder objective

 

Maximize the variational lower bound

\[ E_{z \sim Q}[log P(X|z)] - \mathcal D[Q(z|X)||P(z)] \]

where the terms are

1) the reconstruction probability

2) the Kullback-Leibler divergence between the approximate posterior and the prior for \( z \)

VAE in a nutshell

 

Source: [4]

How can we use this?

 

Generate stuff (e.g., images)

Source: [4]

How can we use this for ANOMALY DETECTION?

 

  • VAE models the distribution, not the values

  • anomalies are seen as coming from a different process / distribution, so…

  • can diagnose as anomalies those cases that have high reconstruction error / low reconstruction probability [5]

Let's see this in practice!

 

  • MNIST, what else ;-)

  • fraud detection

  • network intrusion detection

Technology employed

 

 

 

Variational-autoencoding MNIST

Ubiquitous MNIST

 

MNIST handwritten digits database

The encoder

 

 # input layer
x <- layer_input(batch_shape = c(batch_size, original_dim))   
# hidden intermediate, lower-res
h <- layer_dense(x, intermediate_dim, activation = "relu") 
# latent var 1, 2-dim (mainly for plotting!): mean
z_mean <- layer_dense(h, latent_dim)    
# latent var 2, 2-dim: variance
z_log_var <- layer_dense(h, latent_dim)                            

sampling <- function(arg){
  z_mean <- arg[,0:1]
  z_log_var <- arg[,2:3]

  epsilon <- K$random_normal(
    shape = c(batch_size, latent_dim), 
    mean=0.,
    stddev=epsilon_std
  )
  z_mean + K$exp(z_log_var/2)*epsilon
}

# latent vars are sampled: nondeterministic!
z <- layer_concatenate(list(z_mean, z_log_var)) %>%               
  layer_lambda(sampling)

The decoder

 

# the latent layer from above
z <- layer_concatenate(list(z_mean, z_log_var)) %>%                            
  layer_lambda(sampling)

# hidden intermediate, higher-res
decoder_h <- layer_dense(units = intermediate_dim, activation = "relu") 
# decoder for the mean, high-res again
decoder_mean <- layer_dense(units = original_dim, activation = "sigmoid")      

h_decoded <- decoder_h(z)
x_decoded_mean <- decoder_mean(h_decoded)

The model

 

# the complete model, from input to decoded output
vae <- keras_model(x, x_decoded_mean)                                          

# cross-entropy: the reconstruction part of the loss function
xent_loss <- function(target, reconstruction) {
  as.double(original_dim) * loss_binary_crossentropy(target, reconstruction)   
}

# Kullback-Leibler divergence: the regularization part of the loss function
kl_loss <- function(target, reconstruction) {
  -0.5*K$mean(1 + z_log_var - K$square(z_mean) - K$exp(z_log_var), axis = -1L)
}

vae_loss <- function(x, x_decoded_mean){
  xent_loss <- (original_dim/1.0)*loss_binary_crossentropy(x, x_decoded_mean)
  kl_loss <- -0.5*K$mean(1 + z_log_var - K$square(z_mean) - K$exp(z_log_var), axis = -1L)
  xent_loss + kl_loss
}

# compile the model with (hopefully) adequate optimizer and learning rate
vae %>% compile(optimizer = optimizer_rmsprop(lr = learning_rate), loss = vae_loss, metrics = c(xent_loss, kl_loss))

MNIST latent space

 

Best and worst reconstructions

 

using reconstruction probability

So which are the anomalies?

 

  • have to define cut-off points
  • as with other machine learning /statistics-based algorithms
  • so is this a problem?
  • the more urgent question really is…

How well does this work with real-world datasets?

 

  • MNIST is 28*28 pixels, grayscale ==> super homogeneous data!
  • best supervised learning error rate as of today (test set): 0.23% ==> super simple!

How about real datasets with different types of variables of different scale?

 

 

Real-world example: fraud detection

Fraudulent Transactions [6]

 

  • data “from an undisclosed source”, anonymized
  • transactions reported by salespeople
    • salesman id
    • product id
    • quantity sold
    • total value of sale
    • fraud yes/no/unknown
  • total transactions 401,146 - of these, 385,414 unknown!

Before we start tweaking model parameters...

 

How to best preprocess the data?

  • product ids will have to be one-hot-encoded (but won't be able to use them all to avoid overly sparse matrix)
  • quantity and total value should both be kept (not reduced to unit price!) because each of them may be informative alone
  • however, we need all data to be on similar scale to be able to pick an adequate loss function
  • thus, quantity and total value have to be normalized
  • nonetheless, we have a mix of 0/1 and continuous data - let's see how this goes…

Parameterizing the model for fraud detection

 

  • in theory, binary crossentropy should work fine for data between 0 and 1
  • in practice, mean squared error worked better
  • using MSE, also had to change activation of the output (decoded mean) layer to linear

Training procedure

 

  • not evident how to proceed with the unlabeled (“unknown”) data (= 96% of whole dataset!)
  • decision: train on “unknown” set, use labeled data (fraud or non-fraud) for testing

Fraud latent space:

 

Scores on test set

 

Reconstruction error (MSE, in this case):

  • final MSE on training set: 3.50
  • MSE on fraud test set: 3.57
  • MSE on non-fraud test set: 3.49

Problem: small sample sizes (non-fraud: 6013, fraud: 180)

 

 

Real-world example: network intrusion detection

UNSW-NB15 intrusion detection dataset [7,8]

 

  • synthetic dataset
  • 9 types of attack: fuzzers, analysis, backdoors, DoS, exploits, generic, reconnaissance, shellcode, worms
  • 49 features created from raw network traffic captured with tcpdump

Again - what shall we do with the data?

 

  • tried: one-hot encoding protocol, service and state categorical variables
  • tried: binning and one-hot-encoding continuous variables
  • tried: different ways of normalizing continuous variables
  • in the end, what worked best was getting rid of the categorical variables and scaling all others

How should we parameterize the model?

 

  • tried different loss functions appropriate for normally distributed data, MSE worked best
  • also used batch normalization and \( l2 \) regularization for all hidden layers, e.g.
h <- layer_dense(x, intermediate_dim) 
if(use_batch_normalization) h <- h %>% layer_batch_normalization() 
h <- h %>% layer_activation("relu") %>% layer_activity_regularization(l1=l1, l2=l2)

Training procedure

 

  • dataset already comes with a training set and a test set
  • use all non-attack records from the training set (= 37,000) for training - that's not too much given the complexity of the data

Network intrusions latent space:

 

Scores on test set

 

Reconstruction error (MSE):

typenumber of casesMSE
normal (no attack)56,00040.4
analysis2,00081.1
DoS12,26455.3
exploits33,39350.4
fuzzers18,18437.8
generic40,00048.8
reconnaissance10,49120.5
shellcode1,13319.3
worms13038.6

Conclusion

 

  • not sure we're ready for a conclusion yet…
  • definitely too much preprocessing involved, given we want to use DL for automation
  • conceptually, VAE is a very pleasing fit for the task because of its inherent probabilistic nature
  • in practice, this may need further experimentation, e.g., use embeddings on categorical variables
  • desirable: follow up model performance (e.g., the intrusion detection model) together with a domain expert

Questions?

 

Thanks for your attention!

Sources

 

[1] Asimov Institute, The Neural Network Zoo.

[2] Keras blog, Building autoencoders in Keras.

[3] Hawkins, D. (1980), Identification of outliers. Chapman and Hall, London.

[4] Doersch, C.,Tutorial on Variational Autoencoders.

[5] An, J. & Cho, S.,Variational Autoencoder based Anomaly Detection using Reconstruction Probability

[6] Torgo, L. (2017), Data Mining with R, 2nd ed.

[7] Moustafa, Nour, and Jill Slay. “UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)."Military Communications and Information Systems Conference (MilCIS), 2015. IEEE, 2015.

[8] Moustafa, Nour, and Jill Slay. "The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set.” Information Security Journal: A Global Perspective (2016): 1-14.