Generative adversarial networks, or GANs, as proposed originally by Ian Goodfellow et. al in 2014, are deep neural networks architectures, comprised of two neural networks, competing one against the other:
The discriminator and the generator play the following two-player minimax game:
\[\ min max V (D, G) = E_{x~pdata(x)} [log D(x)] + E_{z~Pz (z)} [log (1 - D (G(z)))] \]
With: Pz : data distribution over noise input, z
Pdata : data distribution over real distribution x
G(z) : generator's output
D(x): discriminator output
For the discriminator, D(x) should be maximized and D (G (z)) should be minimzed. For the generator, D (G (z)) should be maximized: the discriminator goal is to recognize real images and generated images better, while the generator tries to fool the discriminator into considering the generated output as real. Thus, we can say that training GANs consists in finding Nash equilibrium to a two-player minmax game between the generator and the discriminator.
Since their apparition, the training of Generative Adversarial Networks was associated with many problems. “Mode collapse”, also known as “Helevetica scenario” is one of the key underlying problems of the GANs architecture. Mode collapse occurs when the generator is outputting a single sample or a limited diversity of samples. In this situation, the generated output doesn’t cover the multiple modes of the real data distribution: the outputs are coming from the same specific part of the underlying data distribution, and by consequence, the generator is collapsing to one of the data modes.
Mode collapse occurs because the generated output converges to an optimal: x* that is considered the more realistic output by the discriminator. In this case, x* is independent of z, because: x* = argmax D(x). As a consequence, the gradient associated with z approaches 0 and the mode collapses to a single point. Detecting this single point allow the discriminator to detect generated output. The generator exploits the discriminator by switching modes. Given that the generator is not incentivised to cover multiple modes, this scenario will be repeated again and again: the model will not converge and the generated output will lack diversity.
Recent researchers suggested using multiple generators to help avoid some problems related to GANs and prevent the mode collapse problem. This method of “Mixture Generative Adversarial Networks”, abbreviated as MGAN, is formulated as a minimax game of a discriminator, many generators and a classifier that peforms multi-class classification (specifies which generator a sample comes from).
MGAN architechture-Image source : Q.Hoang et al. Paper on MGAN
The idea behind this approach is to use a mixture of many distributions in order to approximate the data distribution and cover different data modes. Actually, each generator focuses on a region of the data space, so the mixture of them improve mode coverage, given that each distribution captures a subset of data modes separately from those of others: the MGAN paper published by Q.Hoang et al. shows that, at equilibrium, JS divergence between generators distributions is maximal, while it is minimal between the mixture of generators distributions and the empirical data distribution.
Using multiple generators with GANs was proven to be extremely effective. Empirically, this approach demonstrated the abilities to successfully learn multimodal data and a fast convergence rate. After the implementation of MGAN on datasets like CIFAR-10, STL-10 and ImageNet, the aforementioned paper reported impressive results. MGAN achieved state-of-the-art inception scores, which is a metric that rewards good and varied samples.
Salimans et al. suggest other alternatives to adress GANs training issues, like using minibatch discrimination. The idea is for the discriminator to take into account all the information present in each batch of data, instead of looking to a single input. Thus, the discriminator is able to consider the relationship between training data points in one batch, instead of processing each point independently.
This procedure aim at improving GANs training by using other examples as side informations: it allows comparing the intrabatch similarities and adds it as a feature for the discriminator. When all the samples in a batch are very close to eachother, the discriminator will understand that data is fake. Mode collapse becomes easy to detect and this will force the generator to output more diversity. Unfortunately, this procedure is computationnally expensive. The cost and time of its implementation is high.
Batch Normalization is another important technique to adress mode collapse. It is a technique that normalizes the feature vectors to have no mean or unit variance. Batch normalization helps us to reduce the internal covariate shift by normalizing values, and use higher learning rates, making our network reach the global minimum faster.
The same paper by Salimans et.al suggests using VBN: Virtual Batch Normalization. It is a technique where each data sample is normalized based on a fixed/reference batch. VBN is different from Batch Normalization, where each sample is normalized within its minibatch. Contrary to Batch Normalization, VBN doesn’t suffer from the interdependence of the samples inside each minibatch. However, VBN is computationally expensive because it requires running forward propagation on two minibatches of data.
In GAN’s multiple generators model, as presented in MGAN paper by Q.Hoang et al, the authors applied batch normalization to all layers in the networks except for the output layer.
Some researchers believe that the original GAN loss is the cause of mode collapse in many scenarios. Some of them suggest using a different loss function, like using Wasserstein distance as GAN’s loss function.
Wasserstein distance is a measure of the distance between two probability distributions. It is an interesting measure because, even for two distributions that are located in lower dimensional manifolds without overlaps, this measure can still provide, contrary to KL or JS divergence, a smooth representation for the distance in between. This can be very helpful for a stable learning process using gradient descent.
WGAN (Wasserstein GANs) were introduced by Arjovsky et al. as an alternative to traditional GANs, to improve GANs training. Using a new cost function based on Wasserstein distance, WGAN have shown empirically an improvement in training stability and prevented mode collapse.
When tested empirically, WGAN-GP model (a variant of WGAN with many improvements) didn’t show better results than MGAN. However, it is still an interesting idea to experiment the use of Wasserstein distance instead of JS divergence in MGAN, in order to force divergence for the multiple generators.
In section 5.1.1 of “NIPS 2016 Tutorial: Generative Adversarial Networks”, Goodfellow explains that GANs systematically generate a small number of modes due to a defect in the training procedure.
I have the conviction that, MGAN have shown a great success because the authors understood what was principally wrong with traditional GANs: their approach was directly conceived to change fundamentally the way of training GANs, and not to introduce some improvements on the objective function.
Solving mode collapse for a single-generator GAN was out of scope of the MGAN model. To prevent mode collapse, the architeture of MGAN didn’t have the goal of improving within-generator diversity: the goal was instead to improve among-generator diversity.