This markdown is a collection of thoughts and tests to:
There are three commonly agreed upon failure modes which we must consider:
This is a potential problem with very deep networks and scales by the size of the learnable parameter space
In GANs this implies that the discriminator is too good and does not provide the Generator with enough information to learn
Potential solutions include using the Wasserstein Loss Function which was designed to prevent this
I have found the literature (especially StyleGAN) very cautious about changing loss functions however. The 2017 Google Brain paper “Are GANs created equal?” states, when referring to different cost functions:
“We did not find evidence that any of the tested algorithms consistently outperforms the original one.”
Here they are referring to testing WGAN loss functions over the traditional loss functions employed in StyleGAN
Moreover, we don’t know using other loss functions will work with their custom Path Length Regularization loss function for the Generator
adding noise to discriminator input and decreasing learning rateFrom our image output we can discard mode-collapse as we are still producing a relatively wide array of mugshots with varying features. We now consider an additional metric to probe these potential failures.
We consider a set of additional performance metrics computed on the previously completed training run in order to get a better image of what potentially may have caused the network to not train successfully. For this, we compute kNN-Recall and kNN-Precision metrics based on work in https://arxiv.org/abs/1904.06991
These are slightly different from the normal definition used for binary classifiers. In this setting;
Precision is defined as: The probability that a random image sampled from the GAN generated distribution falls within the support of the true data distribution
average sample quality of images produced by the GANRecall is defined as: The probability that a random image from the real distribution falls within the support of the generated/learned distribution
coverage of the sample distributionThe following visual makes these definitions a little easier to understand and highlights the importance of “distribution support” w.r.t generated image realism.
A point to note here is that k-NN refers to the methodology used to approximate the manifold of the true data distribution. This is a super neat approach that uses a set of feature vectors to:
“Obtain the [feature space manifold] estimate by calculating pairwise Euclidean distances between all feature vectors in the set and, for each feature vector, forming a hypersphere with radius equal to the distance to its kth nearest neighbor.”
This allows for a non-parametric representation of the real and generated data manifolds at a feature level. The metrics are then computed by sampling points from these and using simple " does point x lie in the support of distribution P " statements.
convergence failure of the GANprecision and recall metrics over the training epochs for our first attempt:
Both indicate a steep decrease in general image quality after an initial increase, supporting the idea that we are in a simple convergence failure. We now consider possible next steps.
We now outline a few potential next steps. These are the options we have for changing the networks training characteristics without fundamentally altering the StyleGAN.
Another parameter for us to change is the overall network configuration. Each of these comes with some fundamental changes to the actual G-D architecture and requires different regularization considerations. A list of the available configurations is given below:
config-a → Baseline StyleGANconfig-b → Weight demodulationconfig-c → Lazy regularizationconfig-d → Path length regularizationconfig-e → No growing, new G & D arch.config-f → Large networks (default)The most promising candidate for a smaller network is config-e which comes with a set of options for altered generator (G) and discriminator (D) architecture. This configuration, as outlined in the StyleGAN2 appendix, is meant for images without high resolution features and represents an approximate 20% decrease in the number of trainable parameters from the larger config-f. As such, for our dataset, it seems appropriate to use this smaller configuration.
This configuration does not alter the network structure from F, as both implement a Residual-Network discriminator and a Skip-connection generator.
We have two parameters that we can potentially tune, these are the learning rate and the regularization weight.
Learning Rate: This is currently a fixed parameter at 0.002, but we are able to reduce this manually. Reducing the learning rate may potentially allow for a more stable training run and I have found other people using alpha=0.001 with 512x512 images in StyleGAN and config-e
Regularization weight: StyleGAN2 uses an R1-regularization function from this paper: https://arxiv.org/abs/1801.04406 . We can tune delta parameter which is essentially a weight on the regularization term. For StyleGAN2 this term varies between 10 for config-f and 100 for config-e depending on the dataset these are applied to. This may potentially take some trial and error and I can’t say that I have a fully formed opinion here.
Truncation Psi: This is a parameter frozen for the generator network to push it towards generating images which could be considered ‘common’. Essentially this truncated the distribution for the generator. Here I think we should leave this set at the values designed for the network as its not entirely clear to me how exactly this impacts the actual training process.This is kind of simple. Using the face-detection script from the cropping pipeline I was able to exclude all non-face images from our data-pre-processing steps. So, in our next training run we’ll only have faces :)