The score distribution of real images is summarized below:

Score Proportion
1 0.030
2 0.273
3 0.515
4 0.182

Based on this distribution, we calculate the total number of images (1:1) needed to for a significant two-sided test when synthetic images are worse than real ones by certain proportions (i.e., tolerance levels). The calculated sample sizes a function of the tolerance level powered at 80%, 70%, and 60% are plotted below.

The exact numbers for certain tolerance levels are tabulated below.

Table 1. Total sample size needed for different tolerance levels
60% power 70% power 80% power
0.08 291 367 466
0.1 158 199 253
0.12 90 113 143
0.14 50 63 80

References

  1. Mao, L., Kim, K., and Miao, X. (2022). Sample size formula for general win ratio analysis. Biometrics, 78, 1257-1268.