1 | 2 | 3 | 4 |
---|---|---|---|
Poor | Suboptimal | Adequate | Optimal |
Comparison of real vs synthetic images
Statistical analysis
Scores on a Likert scale of 1-4 (see below) were summarized by frequency and percentage, as well as mean and standard deviation (SD), by image type (real or fake) and rater (GT or JW). Inter-rater agreement was assessed by the intraclass correlation coefficient (ICC, Bartko 1966) for the original score or Cohen’s kappa (Cohen 1960) for the dichotomized score (1-2 vs 3-4). Within each rater, real and fake images were compared on the original score using the Wilcoxon rank sum test and on the dichotomized score using the odds ratio (OR) and chi-square test. P values < 0.05 were considered statistically significant. All analyses were performed in R version 4.3.2 (R Foundation for Statistical Computing, Vienna, Austria).
Inter-rater agreement
The frequency and percentage of scores are summarized by image type and rater in Table 1, with score distributions plotted in Figure 1. The two raters have similar score distributions for fake images, but not for real ones (GT tends to down-rate and JW tends to over-rate them).
Type | Rater | 1 | 2 | 3 | 4 | Overall |
---|---|---|---|---|---|---|
Real | GT | 4 (3.4%) | 24 (20.3%) | 56 (47.5%) | 34 (28.8%) | 118 (100%) |
Real | JW | 0 (0%) | 5 (4.2%) | 47 (39.8%) | 66 (55.9%) | 118 (100%) |
Fake | GT | 9 (6.9%) | 22 (16.8%) | 77 (58.8%) | 23 (17.6%) | 131 (100%) |
Fake | JW | 2 (1.5%) | 26 (19.8%) | 89 (67.9%) | 14 (10.7%) | 131 (100%) |
The ICCs for inter-rater agreement and their 95% confidence intervals (CI) are tabulated in Table 2. Fake images show stronger agreement (ICC = 0.34) than real images do (ICC \(\approx\) 0).
Type | ICC (95% CI) | P (ICC=0) |
---|---|---|
Real | -0.049 (-0.227, 0.132) | 0.702 |
Fake | 0.339 (0.178, 0.481) | <0.001 |
Overall | 0.201 (0.079, 0.317) | <0.001 |
Real vs fake
We compare the mean scores of real versus fake images by each rater and the two-rater average. Table 3 and Figure 2 show the results. GT give similar scores to real and fake images, but JW rates real images significantly higher than fake ones.
Rater | Real (N=118) | Fake (N=131) | P |
---|---|---|---|
GT | 3 (0.8) | 2.9 (0.8) | 0.161 |
JW | 3.5 (0.6) | 2.9 (0.6) | <0.001 |
Avg | 3.3 (0.5) | 2.9 (0.6) | <0.001 |
Dichotomized scores
We dichotomize the score scale by 1-2 (Bad) vs 3-4 (Good). The dichotomized score distributions can be read directly from Figure 1 (using the cutoff between 2 and 3). Table 4 lists Cohen’s kappa coefficients for inter-rater agreement by image type. Like the original scores, there is stronger agreement on fake images than on real ones.
Type | Kappa | P (kappa=0) |
---|---|---|
Real | -0.077 | 0.202 |
Fake | 0.191 | 0.028 |
Overall | 0.083 | 0.162 |
Table 5 below summarizes the distribution of “Good” (3-4) scores by image type and rater, and computes the odds ratio (OR) comparing fake vs real images for a good score. For GT, the two types of images are equivalent; for JW, however, fake images are only 0.16 times as likely to receive good scores as real images do.
Rater | Real (N=118) | Fake (N=131) | OR (95% CI) | P |
---|---|---|---|---|
GT | 90 (76.3%) | 100 (76.3%) | 1 (0.56, 1.8) | 0.99 |
JW | 113 (95.8%) | 103 (78.6%) | 0.16 (0.06, 0.44) | <0.001 |
(We can also combine the two readers but I didn’t bother to because they are so different.)