Evaluation of simulated low dose 18F-FDG breast PET/MRI with deep learning methods

Author

Lu Mao

Statistical analysis

Ordinal responses (Q1–Q6; see below) were summarized by mean and standard deviation (SD). Reader-averaged scores were compared between simulated low dose (SLD) and SLD with deep-learning-based denoising (DN) using the Wilcoxon signed rank test. Between-reader agreement was assessed by the intra-class correlation coefficient (ICC) (Shrout and Fleiss 1979) for ordinal responses and by Fleiss’ kappa (Fleiss 1971) for nominal responses (e.g., Q7). SUVs (mean and max) for lesion and contralateral fibroglanular tissue (CL-FGT) are compared across modalities using repeated measures ANOVA. Between-modality agreement is assessed by Bland-Altman analysis and ICC. P-values < 0.05 were considered statistically significant. All analyses were performed in R version 4.3.2 (R Foundation for Statistical Computing, Vienna, Austria).

ID Question
Q1 Presence of artifacts
Q2 Signal to noise ratio
Q3 Image sharpness/resolution
Q4 Overall image quality
Q5 Diagnostic image quality vs standard
Q6 Lesion detectability vs standard
Q7 Which (unknown) data set do you prefer
Q8 Was MRI helpful for finding lesion on PET

Results

Questionnaire responses

Q1 - Q6

Responses to Q1 - Q6 are graded on 5-point Likert scales, with 1 the worst and 5 the best. The following bar charts show the mean scores (with \(\pm\) SD error bars) comparing SLD vs DN by reader and reader-average. For all questions, significantly higher reader-average scores are recorded for DN as compared to SLD.

In particular, the reader-average scores are summarized numerically in Table 1.

Table 1. Mean (SD) of reader-average scores for Q1–Q6.
ID SLD DN P
Q1 3.8 (0.5) 4.3 (0.4) <0.001
Q2 3.2 (0.5) 4 (0.4) <0.001
Q3 3.7 (0.4) 4 (0.4) <0.001
Q4 3.4 (0.5) 4 (0.3) <0.001
Q5 1.7 (0.5) 2.5 (0.4) <0.001
Q6 2.7 (0.3) 2.9 (0.3) 0.002

The between-reader ICC and 95% confidence interval (CI) for each question are tabulated below, which suggest generally low agreement between the three readers, except for Q2. (This could also due to low between-patient variation.)

ID ICC (95% CI)
Q1 -0.032 (-0.173, 0.153)
Q2 0.471 (0.295, 0.636)
Q3 -0.026 (-0.168, 0.16)
Q4 0.195 (0.02, 0.393)
Q5 0.31 (0.129, 0.5)
Q6 0.005 (-0.143, 0.195)
Q7

Responses to Q7 about the preference of dataset are summarized by reader and overall in Table 2. It is clear that readers overwhelmingly prefer DN to SLN.

Table 2. Summary responses (N (%)) to preference of dataset (Q7).
Response AF QK SP Overall
SLD 0 (0%) 2 (8.7%) 1 (4.3%) 4.3%
DN 16 (69.6%) 17 (73.9%) 20 (87%) 76.8%
Equivalent 2 (8.7%) 3 (13%) 2 (8.7%) 10.1%
Neither 5 (21.7%) 1 (4.3%) 0 (0%) 8.7%

The win ratio (Mao, Kim, and Miao 2021) for preference of DN over SLN, treating Equivalent/Neither as ties, is 17.67 (95% CI, 5.52 – 56.53).

The percentages of chosen options are plotted below.

The between-reader Fleiss kappa (95%) is 0.034. (Again because patient-to-patient variation is low due to predominance of “DN”.) The percent of consensus among the three readers is 11/23 = 47.8%. The percent of two readers agreeing and the third differing by 1 point (e.g., (2, 2, 3)) is 18/23 = 78.3%

Q8

Responses to Q8 about whether MRI was helpful for locating/detecting the lesion on PET are summarized by reader and overall in Table 3.

Table 3. Summary responses (N (%)) to question about helpfulness of MRI on PET (Q8).
Response AF QK SP Overall
No 11 (47.8%) 0 (0%) 0 (0%) 15.9%
Somewhat 5 (21.7%) 0 (0%) 16 (69.6%) 30.4%
Very 7 (30.4%) 23 (100%) 7 (30.4%) 53.6%

There are marked differences between the readers: QK chooses “very helpful” for all subjects; SP divides choice between “very” and “somewhat helpful”; AF considers nearly half of the cases as “unhelpful”. The data are visualized below.

SUV

Bland-Altman analysis

Bland-Altman plots for SUV values comparing each pair of modalities are plotted in Figure 1 below.

Figure 1: Bland-Altman analysis of SUV values by modality.
Repeated measures ANOVA

Boxplots of SUVs by tissue type and modality are shown in Figure 2.

Figure 2: Boxplots of SUVs by tissue type and modality.

Repeated measures ANOVA is used to test the between-modality differences, followed by pairwise (paired) t-test with Bonferroni correction. Table 1 summarizes the results.

Table 1: Mean (SD) of SUVs by tissue type and modality with overall and pairwise tests.
Tissue Variable SLD DN Standard Overall SLD v Standard DN v Standard SLD v DN
Lesion SUVmean 4.08 (2.1) 4.06 (2.14) 3.87 (2.08) P<0.001 P<0.001 P<0.001 P=1
Lesion SUVmax 10.52 (8.9) 8.94 (8.15) 7.58 (6.31) P<0.001 P<0.001 P=0.012 P<0.001
CL-FGT SUVmean 1.34 (0.4) 1.36 (0.4) 1.33 (0.4) P=0.168 P=1 P=0.351 P=0.119
CL-FGT SUVmax 2.04 (0.52) 1.72 (0.44) 1.56 (0.44) P<0.001 P<0.001 P<0.001 P<0.001

References

Fleiss, Joseph L. 1971. “Measuring Nominal Scale Agreement Among Many Raters.” Psychological Bulletin 76 (5): 378–82. https://doi.org/10.1037/h0031619.
Mao, Lu, KyungMann Kim, and Xinran Miao. 2021. “Sample Size Formula for General Win Ratio Analysis.” Biometrics 78 (3): 1257–68. https://doi.org/10.1111/biom.13501.
Shrout, Patrick E., and Joseph L. Fleiss. 1979. “Intraclass Correlations: Uses in Assessing Rater Reliability.” Psychological Bulletin 86 (2): 420–28. https://doi.org/10.1037/0033-2909.86.2.420.