StyleGAN2 k-NN Precision and Recall Evaluation

Introduction and Outline

In this markdown I list some of the key findings from a literature review and discussion around precision and recall as they relate to our model objectives.

A quick summary of the current situation and what we are trying to figure out:

The GAN is training very well and is performing on all metrics we are tracking
We are focusing specifically on kNN-Precision and kNN-Recall which measure average image quality and image variety respectively
There is a trade-off/ceiling for these metrics which is well reported in the literature and the original paper introducing these here: https://arxiv.org/abs/1904.06991

The discussion this markdown deals with concerns the following questions:

What direct impact does the models performance on both metrics have ?
How does this impact our face editing pipeline ?
What potential steps do we have to impact precision and recall for our model ?

I have tried to approach this topic as systematically as possible. As such, I start with an outline of the current state, the literature, what the implications are, and then address the open questions of importance to our project.

Current k-NN Precision and Recall performance

Below is a set of plots which outline the current performance of our model. These are meant to highlight the trajectory across training epochs for both measures, as well as their combined training progress.

As you can see above, both precision and recall are converging quite monotonically to approximately 0.8 and 0.48 respectively. Some important points to consider here are:

Precision scores higher because it is easier for the GAN to concentrate ‘learning power’ on a subset of the true data and produce those almost perfectly
From a review of the dynamics outlined in https://arxiv.org/abs/1904.06991 this has to do with the size of the R1 regularization term outlined here https://arxiv.org/abs/1801.04406
- Lower \(\gamma\) implies lower gradient penalty and leads to higher precision
- From what I can tell, this highlights the connection between the R1 term and trying to force the GAN to spread its learned distribution

In plot number 3 I provide a view of their combined progress. Here I think it is interesting to note a few things about training dynamics:

It appears the GAN progresses in Precision before Recall
I.e. the GAN becomes relatively good at producing realistic images an expected 50% of the time quickly
As recall is low, however, it appears the GAN is putting computing power to a small part of the total data distribution
As training progresses, this process becomes much more linear and both precision and recall grow at slightly less than 1:1

Another reason for this plot is to investigate the relative speed of improvement and we can see that as training progresses the actual advancements have all but stalled for both metrics. This gives us an idea of what we can realistically expect from these metrics over the next 15 million iterations.

Literature on Precision and Recall for GANs

Here I am just outlining what the literature “standard” is for these models. We can obviously think about what it is we would ideally want from our model, but I think its important to frame this discussion in the context of what is realistically possible.

There are a few important facts to consider here:

StyleGAN1 sits around (0.73, 0.4) in empirical testing and we are outperforming these measures already
The highest recall metric achieved by any StyleGAN1 network configuration is at 0.46
- This is without instance normalization
The highest precision recorded is 0.81 in a model with a reduced \(\gamma\) at a factor of 100

These points are made especially clear in the following plot. Getting data for StyleGAN2 on FFHQ involves running their pre-trained model which (I believe) would take a bit more effort than its worth, given that this plot gives us a good idea of what we can realistically expect.

The yellow point is where out model is currently sitting. Some important dynamics include:

Truncation of Generated images

Truncating the distribution of the generator increases precision and lowers recall
On the other hand, we can increase the ‘spread’ of the distribution by moving away from the average face and truncating less
The following plot makes this point quite well, notice the clear trade-off:

R1 regularization

With higher regularization the GAN is forced to learn more of the distribution
Average image quality is reduced, but spread is increased
As we discussed in previous iterations, the R1 weight \(\gamma\) is config-dependent and already at a high value (100) for config-e

Implications for Face-editing

Given we know how our model is performing, and what the literature shows as a baseline, we can now think about the impacts this has for our model and our goal to edit mugshots in a systematic way. This is a very complicated problem without a “right” answer !

High Precision at 0.8 and Low Recall at 0.48 will mean that:
- For approximately 8/10 mugshots we project the GAN counterpart will look very realistic
- We will only be covering about 48% of the available mugshot space with these realistic projects
  - I.e. for any randomly sampled point from the real data distribution we expect a 48% probability of it coinciding with the learned GAN distribution
  - How much variation is in these 48% is unclear (to me at least, for now)
The low recall score is potentially problematic for face editing if:
- The GAN is concentrating its ‘learning’ power on an important subset of features
- For example, if the part of the data-distribution it covering contains only white faces or only males
This would mean that:
- We are not able to project a large enough variety of mugshots to control for these confounding factors
- I.e. we would only be able to ‘realistically’ edit white faces
- Note I don’t think this is really the case, just because the images I’ve been tracking contain a large enough variety and have no noticeable variation in quality across primary features
- I am planning to project a large set of images (which takes a very long time) to look into this using SIIM
- Note I am also working on a strategy to test exactly which parts of the data-distribution the GAN is covering (this is kind of very difficult)

Potential next steps

Here is what I think we may want to do in the next step:

Instead of editing every mugshot we take a well learned subset and ensure that along axis which we care about we have enough variation
We then use this subset for editing instead of relying on perhaps lower quality images
Here we can use the truncation parameter \(\psi\) which truncates the generator distribution

Things we need to be careful about the features we’re excluding:

In order to do this, we need to ensure that we have a general understanding of which images the GAN is not covering as well
For this we’d need to devise some sort of sampling technique to generate images with low recall scores (not sure how this could be done yet)
We need to ensure that the part of the distribution the GAN has learned contains enough variation in features we want to control for (ethnicity, gender, age, etc.) so that potential face-change-gradients we discover are generalizable enough