StyleGAN 2 - Failure Modes, Tests, and Configurations

This markdown is a collection of thoughts and tests to:

Figure out what kind of failure mode we are most likely in
Determine what kind of levers we can pull
What these levers will do and what we want them to do
What our next steps actually are

Common GAN failure modes:

There are three commonly agreed upon failure modes which we must consider:

Mode Collapse
- This implies the GAN concentrates its learning on a small proportion of the data distribution
- This would imply lack of variation in generated images which we do not observe
- I believe this is quite unlikely to be the problem
Vanishing gradients
- This is a potential problem with very deep networks and scales by the size of the learnable parameter space
- In GANs this implies that the discriminator is too good and does not provide the Generator with enough information to learn
- Potential solutions include using the Wasserstein Loss Function which was designed to prevent this
- I have found the literature (especially StyleGAN) very cautious about changing loss functions however. The 2017 Google Brain paper “Are GANs created equal?” states, when referring to different cost functions:
  
  “We did not find evidence that any of the tested algorithms consistently outperforms the original one.”
- Here they are referring to testing WGAN loss functions over the traditional loss functions employed in StyleGAN
- Moreover, we don’t know using other loss functions will work with their custom Path Length Regularization loss function for the Generator
Failure to converge
- This is just something that happens with GANs … as far as I can tell
- I have not found a convincing general explanation for this happening, other than “they’re just big networks so it’s hard sometimes”
- With these would reach a suboptimal point characterized by very low discriminator loss as generated images decline steadily in quality
- Potential solutions include adding noise to discriminator input and decreasing learning rate

From our image output we can discard mode-collapse as we are still producing a relatively wide array of mugshots with varying features. We now consider an additional metric to probe these potential failures.

Additional Performance Metrics - Precision and Recall

We consider a set of additional performance metrics computed on the previously completed training run in order to get a better image of what potentially may have caused the network to not train successfully. For this, we compute kNN-Recall and kNN-Precision metrics based on work in https://arxiv.org/abs/1904.06991

These are slightly different from the normal definition used for binary classifiers. In this setting;

Precision is defined as: The probability that a random image sampled from the GAN generated distribution falls within the support of the true data distribution
- This measures the average sample quality of images produced by the GAN
- A higher probability of sampling from the part of the generated distribution that shares the space of the true data implies a higher proportion of images look ‘real’
Recall is defined as: The probability that a random image from the real distribution falls within the support of the generated/learned distribution
- This measures the coverage of the sample distribution
- Higher probability implies the support of the learned distribution overlaps with a larger proportion of the true data distribution

The following visual makes these definitions a little easier to understand and highlights the importance of “distribution support” w.r.t generated image realism.

A point to note here is that k-NN refers to the methodology used to approximate the manifold of the true data distribution. This is a super neat approach that uses a set of feature vectors to:

“Obtain the [feature space manifold] estimate by calculating pairwise Euclidean distances between all feature vectors in the set and, for each feature vector, forming a hypersphere with radius equal to the distance to its kth nearest neighbor.”

This allows for a non-parametric representation of the real and generated data manifolds at a feature level. The metrics are then computed by sampling points from these and using simple " does point x lie in the support of distribution P " statements.

The hypothesis is this:

If precision and recall are both decreasing over the training period, the quality of generated images is falling on a macro level
This should allow the discriminator to discern between true and fake images with lower loss
Quickly falling discriminator loss indicates that we are in a convergence failure of the GAN
Potential solutions include:
- Decreasing the learning rate currently fixed at 0.002
- Adding noise to discriminator inputs (https://arxiv.org/pdf/1701.04862.pdf)

What do the metrics show:

These are plots of the computed precision and recall metrics over the training epochs for our first attempt:

Both indicate a steep decrease in general image quality after an initial increase, supporting the idea that we are in a simple convergence failure. We now consider possible next steps.

Next Steps

We now outline a few potential next steps. These are the options we have for changing the networks training characteristics without fundamentally altering the StyleGAN.

1. Network Configurations

Another parameter for us to change is the overall network configuration. Each of these comes with some fundamental changes to the actual G-D architecture and requires different regularization considerations. A list of the available configurations is given below:

config-a → Baseline StyleGAN
config-b → Weight demodulation
config-c → Lazy regularization
config-d → Path length regularization
config-e → No growing, new G & D arch.
config-f → Large networks (default)

The most promising candidate for a smaller network is config-e which comes with a set of options for altered generator (G) and discriminator (D) architecture. This configuration, as outlined in the StyleGAN2 appendix, is meant for images without high resolution features and represents an approximate 20% decrease in the number of trainable parameters from the larger config-f. As such, for our dataset, it seems appropriate to use this smaller configuration.

This configuration does not alter the network structure from F, as both implement a Residual-Network discriminator and a Skip-connection generator.

2. Hyperparameter Tuning

We have two parameters that we can potentially tune, these are the learning rate and the regularization weight.

Learning Rate: This is currently a fixed parameter at 0.002, but we are able to reduce this manually. Reducing the learning rate may potentially allow for a more stable training run and I have found other people using alpha=0.001 with 512x512 images in StyleGAN and config-e
- The most convicing argument to try to stabilize training is the massive immideate spike in Precision and Recall
- Also, the FID decrease seen in the markdown here (https://rpubs.com/JonasKnecht/665744)
Regularization weight: StyleGAN2 uses an R1-regularization function from this paper: https://arxiv.org/abs/1801.04406 . We can tune delta parameter which is essentially a weight on the regularization term. For StyleGAN2 this term varies between 10 for config-f and 100 for config-e depending on the dataset these are applied to. This may potentially take some trial and error and I can’t say that I have a fully formed opinion here.
- The idea is to compare our dataset to the ones for which StyelGAN2 has delta parameters assigned
- For FFHQ they set it to 10
- For CARS they set it to 100
- The question here is … “do our mugshots behave like CARS or FFHQ” so that we can potentially decrease the set of possible values for delta
Truncation Psi: This is a parameter frozen for the generator network to push it towards generating images which could be considered ‘common’. Essentially this truncated the distribution for the generator. Here I think we should leave this set at the values designed for the network as its not entirely clear to me how exactly this impacts the actual training process.

3. Data cleaning

This is kind of simple. Using the face-detection script from the cropping pipeline I was able to exclude all non-face images from our data-pre-processing steps. So, in our next training run we’ll only have faces :)