Comparison of real vs synthetic images

Author

Lu Mao

Statistical analysis

Scores on a Likert scale of 1-4 (see below) were summarized by frequency and percentage, as well as mean and standard deviation (SD), by image type (real or fake) and rater (GT or JW). Inter-rater agreement was assessed by the intraclass correlation coefficient (ICC, Bartko 1966) for the original score or Cohen’s kappa (Cohen 1960) for the dichotomized score (1-2 vs 3-4). Within each rater, real and fake images were compared on the original score using the Wilcoxon rank sum test and on the dichotomized score using the odds ratio (OR) and chi-square test. P values < 0.05 were considered statistically significant. All analyses were performed in R version 4.3.2 (R Foundation for Statistical Computing, Vienna, Austria).

Scores on a 4-point Likert scale.
1	2	3	4
Poor	Suboptimal	Adequate	Optimal

Inter-rater agreement

The frequency and percentage of scores are summarized by image type and rater in Table 1, with score distributions plotted in Figure 1. The two raters have similar score distributions for fake images, but not for real ones (GT tends to down-rate and JW tends to over-rate them).

Table 1: Score summaries.
Type	Rater	1	2	3	4	Overall
Real	GT	4 (3.4%)	24 (20.3%)	56 (47.5%)	34 (28.8%)	118 (100%)
Real	JW	0 (0%)	5 (4.2%)	47 (39.8%)	66 (55.9%)	118 (100%)
Fake	GT	9 (6.9%)	22 (16.8%)	77 (58.8%)	23 (17.6%)	131 (100%)
Fake	JW	2 (1.5%)	26 (19.8%)	89 (67.9%)	14 (10.7%)	131 (100%)

Figure 1: Score distributions by reader.

The ICCs for inter-rater agreement and their 95% confidence intervals (CI) are tabulated in Table 2. Fake images show stronger agreement (ICC = 0.34) than real images do (ICC \(\approx\) 0).

Table 2: Intra-class correlations (ICC) between GT and JW.
Type	ICC (95% CI)	P (ICC=0)
Real	-0.049 (-0.227, 0.132)	0.702
Fake	0.339 (0.178, 0.481)	<0.001
Overall	0.201 (0.079, 0.317)	<0.001

Real vs fake

We compare the mean scores of real versus fake images by each rater and the two-rater average. Table 3 and Figure 2 show the results. GT give similar scores to real and fake images, but JW rates real images significantly higher than fake ones.

Table 3: Mean (SD) scores by rater and rater-average.
Rater	Real (N=118)	Fake (N=131)	P
GT	3 (0.8)	2.9 (0.8)	0.161
JW	3.5 (0.6)	2.9 (0.6)	<0.001
Avg	3.3 (0.5)	2.9 (0.6)	<0.001

Figure 2: Bar plot of mean scores by image type and rater.

Dichotomized scores

We dichotomize the score scale by 1-2 (Bad) vs 3-4 (Good). The dichotomized score distributions can be read directly from Figure 1 (using the cutoff between 2 and 3). Table 4 lists Cohen’s kappa coefficients for inter-rater agreement by image type. Like the original scores, there is stronger agreement on fake images than on real ones.

Table 4: Kappa correlations of binary scores between GT and JW.
Type	Kappa	P (kappa=0)
Real	-0.077	0.202
Fake	0.191	0.028
Overall	0.083	0.162

Table 5 below summarizes the distribution of “Good” (3-4) scores by image type and rater, and computes the odds ratio (OR) comparing fake vs real images for a good score. For GT, the two types of images are equivalent; for JW, however, fake images are only 0.16 times as likely to receive good scores as real images do.

Table 5: Frequency and percentage of Good (3-4) scores and OR comparing fake vs real images.
Rater	Real (N=118)	Fake (N=131)	OR (95% CI)	P
GT	90 (76.3%)	100 (76.3%)	1 (0.56, 1.8)	0.99
JW	113 (95.8%)	103 (78.6%)	0.16 (0.06, 0.44)	<0.001

(We can also combine the two readers but I didn’t bother to because they are so different.)

References

Bartko, John J. 1966. “The Intraclass Correlation Coefficient as a Measure of Reliability.” Psychological Reports 19 (1): 3–11. https://doi.org/10.2466/pr0.1966.19.1.3.

Cohen, Jacob. 1960. “A Coefficient of Agreement for Nominal Scales.” Educational and Psychological Measurement 20 (1): 37–46. https://doi.org/10.1177/001316446002000104.