Type | Rater | 1 | 2 | 3 | 4 | 5 | Overall |
---|---|---|---|---|---|---|---|
Real | Rater 1 | 0 (0%) | 3 (2.4%) | 37 (29.6%) | 68 (54.4%) | 17 (13.6%) | 125 (100%) |
Rater 2 | 11 (8.8%) | 17 (13.6%) | 46 (36.8%) | 35 (28%) | 16 (12.8%) | 125 (100%) | |
Synthetic | Rater 1 | 0 (0%) | 6 (4.8%) | 32 (25.6%) | 57 (45.6%) | 30 (24%) | 125 (100%) |
Rater 2 | 5 (4%) | 29 (23.2%) | 58 (46.4%) | 28 (22.4%) | 5 (4%) | 125 (100%) |
Evaluation of real vs synthetic images by two readers
Statistical analysis
Scores on a Likert scale of 1-5 were summarized by frequency and percentage, as well as mean and standard deviation (SD), by image type (real or synthetic) and rater (MS or GVT). Inter-rater agreement was assessed by the intraclass correlation coefficient (ICC, Bartko 1966). For each rater and their average, real and synthetic images were compared on the score using the Wilcoxon rank sum test. P values < 0.05 were considered statistically significant. All analyses were performed in R version 4.4.2 (R Foundation for Statistical Computing, Vienna, Austria).
Inter-rater agreement
- Rater 1: MS
- Rater 2: GVT
The frequency and percentage of scores are summarized by image type and rater in Table 1, with score distributions plotted in Figure 1.
The ICCs for inter-rater agreement and their 95% confidence intervals (CI) are tabulated in Table 2.
Type | ICC (95% CI) | P (ICC=0) |
---|---|---|
Synthetic | -0.05 (-0.222, 0.126) | 0.709 |
Real | 0.087 (-0.089, 0.258) | 0.166 |
Overall | 0.018 (-0.106, 0.142) | 0.386 |
Real vs synthetic
We compare the mean scores of real versus synthetic images by each rater and the two-rater average. Table 3 and Figure 2 show the results.
Rater | Real (N=125) | Synthetic (N=125) | P |
---|---|---|---|
Rater 1 | 3.8 (0.7) | 3.9 (0.8) | 0.257 |
Rater 2 | 3.2 (1.1) | 3 (0.9) | 0.041 |
Average | 3.5 (0.7) | 3.4 (0.7) | 0.5 |
Comparison of PSNR scores between models
DDPM | VQVAE | AEKL | CycleGAN |
---|---|---|---|
<0.001 | <0.001 | <0.001 | <0.001 |
- Although the mean scores of dsSNICT vs VQVAE look close, the former is higher in 54 (76.1%) out of 71 slices.