ID | Question |
---|---|
Q1 | Presence of artifacts |
Q2 | Signal to noise ratio |
Q3 | Image sharpness/resolution |
Q4 | Overall image quality |
Q5 | Diagnostic image quality vs standard |
Q6 | Lesion detectability vs standard |
Q7 | Which (unknown) data set do you prefer |
Q8 | Was MRI helpful for finding lesion on PET |
Evaluation of simulated low dose 18F-FDG breast PET/MRI with deep learning methods
Statistical analysis
Ordinal responses (Q1–Q6; see below) were summarized by mean and standard deviation (SD). Reader-averaged scores were compared between simulated low dose (SLD) and SLD with deep-learning-based denoising (DN) using the Wilcoxon signed rank test. Between-reader agreement was assessed by the intra-class correlation coefficient (ICC) (Shrout and Fleiss 1979) for ordinal responses and by Fleiss’ kappa (Fleiss 1971) for nominal responses (e.g., Q7). SUVs (mean and max) for lesion and contralateral fibroglanular tissue (CL-FGT) are compared across modalities using repeated measures ANOVA. Between-modality agreement is assessed by Bland-Altman analysis and ICC. P-values < 0.05 were considered statistically significant. All analyses were performed in R version 4.3.2 (R Foundation for Statistical Computing, Vienna, Austria).
Results
Questionnaire responses
Q1 - Q6
Responses to Q1 - Q6 are graded on 5-point Likert scales, with 1 the worst and 5 the best. The following bar charts show the mean scores (with \(\pm\) SD error bars) comparing SLD vs DN by reader and reader-average. For all questions, significantly higher reader-average scores are recorded for DN as compared to SLD.
In particular, the reader-average scores are summarized numerically in Table 1.
ID | SLD | DN | P |
---|---|---|---|
Q1 | 3.8 (0.5) | 4.3 (0.4) | <0.001 |
Q2 | 3.2 (0.5) | 4 (0.4) | <0.001 |
Q3 | 3.7 (0.4) | 4 (0.4) | <0.001 |
Q4 | 3.4 (0.5) | 4 (0.3) | <0.001 |
Q5 | 1.7 (0.5) | 2.5 (0.4) | <0.001 |
Q6 | 2.7 (0.3) | 2.9 (0.3) | 0.002 |
The between-reader ICC and 95% confidence interval (CI) for each question are tabulated below, which suggest generally low agreement between the three readers, except for Q2. (This could also due to low between-patient variation.)
ID | ICC (95% CI) |
---|---|
Q1 | -0.032 (-0.173, 0.153) |
Q2 | 0.471 (0.295, 0.636) |
Q3 | -0.026 (-0.168, 0.16) |
Q4 | 0.195 (0.02, 0.393) |
Q5 | 0.31 (0.129, 0.5) |
Q6 | 0.005 (-0.143, 0.195) |
Q7
Responses to Q7 about the preference of dataset are summarized by reader and overall in Table 2. It is clear that readers overwhelmingly prefer DN to SLN.
Response | AF | QK | SP | Overall |
---|---|---|---|---|
SLD | 0 (0%) | 2 (8.7%) | 1 (4.3%) | 4.3% |
DN | 16 (69.6%) | 17 (73.9%) | 20 (87%) | 76.8% |
Equivalent | 2 (8.7%) | 3 (13%) | 2 (8.7%) | 10.1% |
Neither | 5 (21.7%) | 1 (4.3%) | 0 (0%) | 8.7% |
The win ratio (Mao, Kim, and Miao 2021) for preference of DN over SLN, treating Equivalent/Neither as ties, is 17.67 (95% CI, 5.52 – 56.53).
The percentages of chosen options are plotted below.
The between-reader Fleiss kappa (95%) is 0.034. (Again because patient-to-patient variation is low due to predominance of “DN”.) The percent of consensus among the three readers is 11/23 = 47.8%. The percent of two readers agreeing and the third differing by 1 point (e.g., (2, 2, 3)) is 18/23 = 78.3%
Q8
Responses to Q8 about whether MRI was helpful for locating/detecting the lesion on PET are summarized by reader and overall in Table 3.
Response | AF | QK | SP | Overall |
---|---|---|---|---|
No | 11 (47.8%) | 0 (0%) | 0 (0%) | 15.9% |
Somewhat | 5 (21.7%) | 0 (0%) | 16 (69.6%) | 30.4% |
Very | 7 (30.4%) | 23 (100%) | 7 (30.4%) | 53.6% |
There are marked differences between the readers: QK chooses “very helpful” for all subjects; SP divides choice between “very” and “somewhat helpful”; AF considers nearly half of the cases as “unhelpful”. The data are visualized below.
SUV
Bland-Altman analysis
Bland-Altman plots for SUV values comparing each pair of modalities are plotted in Figure 1 below.
Repeated measures ANOVA
Boxplots of SUVs by tissue type and modality are shown in Figure 2.
Repeated measures ANOVA is used to test the between-modality differences, followed by pairwise (paired) t-test with Bonferroni correction. Table 1 summarizes the results.
Tissue | Variable | SLD | DN | Standard | Overall | SLD v Standard | DN v Standard | SLD v DN |
---|---|---|---|---|---|---|---|---|
Lesion | SUVmean | 4.08 (2.1) | 4.06 (2.14) | 3.87 (2.08) | P<0.001 | P<0.001 | P<0.001 | P=1 |
Lesion | SUVmax | 10.52 (8.9) | 8.94 (8.15) | 7.58 (6.31) | P<0.001 | P<0.001 | P=0.012 | P<0.001 |
CL-FGT | SUVmean | 1.34 (0.4) | 1.36 (0.4) | 1.33 (0.4) | P=0.168 | P=1 | P=0.351 | P=0.119 |
CL-FGT | SUVmax | 2.04 (0.52) | 1.72 (0.44) | 1.56 (0.44) | P<0.001 | P<0.001 | P<0.001 | P<0.001 |