Evaluation of real vs synthetic images by two readers

Author

Lu Mao

Published

March 27, 2025

Statistical analysis

Scores on a Likert scale of 1-5 were summarized by frequency and percentage, as well as mean and standard deviation (SD), by image type (real or synthetic) and rater (MS or GVT). Inter-rater agreement was assessed by the intraclass correlation coefficient (ICC, Bartko 1966). For each rater and their average, real and synthetic images were compared on the score using the Wilcoxon rank sum test. P values < 0.05 were considered statistically significant. All analyses were performed in R version 4.4.2 (R Foundation for Statistical Computing, Vienna, Austria).

Inter-rater agreement

Rater 1: MS
Rater 2: GVT

The frequency and percentage of scores are summarized by image type and rater in Table 1, with score distributions plotted in Figure 1.

Table 1: Score summaries.

Type	Rater	1	2	3	4	5	Overall
Real	Rater 1	0 (0%)	3 (2.4%)	37 (29.6%)	68 (54.4%)	17 (13.6%)	125 (100%)
	Rater 2	11 (8.8%)	17 (13.6%)	46 (36.8%)	35 (28%)	16 (12.8%)	125 (100%)
Synthetic	Rater 1	0 (0%)	6 (4.8%)	32 (25.6%)	57 (45.6%)	30 (24%)	125 (100%)
	Rater 2	5 (4%)	29 (23.2%)	58 (46.4%)	28 (22.4%)	5 (4%)	125 (100%)

Figure 1: Score distributions by reader.

The ICCs for inter-rater agreement and their 95% confidence intervals (CI) are tabulated in Table 2.

Table 2: Intra-class correlations (ICC) between MS and GVT.

Type	ICC (95% CI)	P (ICC=0)
Synthetic	-0.05 (-0.222, 0.126)	0.709
Real	0.087 (-0.089, 0.258)	0.166
Overall	0.018 (-0.106, 0.142)	0.386

Real vs synthetic

We compare the mean scores of real versus synthetic images by each rater and the two-rater average. Table 3 and Figure 2 show the results.

Table 3: Mean (SD) scores by rater and rater-average.

Rater	Real (N=125)	Synthetic (N=125)	P
Rater 1	3.8 (0.7)	3.9 (0.8)	0.257
Rater 2	3.2 (1.1)	3 (0.9)	0.041
Average	3.5 (0.7)	3.4 (0.7)	0.5

Figure 2: Bar plot of mean scores by image type and rater.

Comparison of PSNR scores between models

Table 4: P values of Wilcoxon signed rank test comparing PSNR scores of dsSNICT to other models per slice.

DDPM	VQVAE	AEKL	CycleGAN
<0.001	<0.001	<0.001	<0.001

Figure 3: Boxplot of PSNR scores comparing dsSNICT to other models. A gray line connects the two scores on the same slice.

Although the mean scores of dsSNICT vs VQVAE look close, the former is higher in 54 (76.1%) out of 71 slices.

References

Bartko, John J. 1966. “The Intraclass Correlation Coefficient as a Measure of Reliability.” Psychological Reports 19 (1): 3–11. https://doi.org/10.2466/pr0.1966.19.1.3.