Evaluation of simulated low dose 18F-FDG breast PET/MRI with deep learning methods

Author

Lu Mao

Statistical analysis

Ordinal responses (Q1–Q6; see below) were summarized by mean and standard deviation (SD). Reader-averaged scores were compared between simulated low dose (SLD) and SLD with deep-learning-based denoising (DN) using the Wilcoxon signed rank test. Between-reader agreement was assessed by the intra-class correlation coefficient (ICC) (Shrout and Fleiss 1979) for ordinal responses and by Fleiss’ kappa (Fleiss 1971) for nominal responses (e.g., Q7). SUVs (mean and max) for lesion and contralateral fibroglanular tissue (CL-FGT) are compared across modalities using repeated measures ANOVA. Between-modality agreement is assessed by Bland-Altman analysis and ICC. P-values < 0.05 were considered statistically significant. All analyses were performed in R version 4.3.2 (R Foundation for Statistical Computing, Vienna, Austria).

ID	Question
Q1	Presence of artifacts
Q2	Signal to noise ratio
Q3	Image sharpness/resolution
Q4	Overall image quality
Q5	Diagnostic image quality vs standard
Q6	Lesion detectability vs standard
Q7	Which (unknown) data set do you prefer
Q8	Was MRI helpful for finding lesion on PET

Results

Questionnaire responses

Q1 - Q6

Responses to Q1 - Q6 are graded on 5-point Likert scales, with 1 the worst and 5 the best. The following bar charts show the mean scores (with \(\pm\) SD error bars) comparing SLD vs DN by reader and reader-average. For all questions, significantly higher reader-average scores are recorded for DN as compared to SLD.

In particular, the reader-average scores are summarized numerically in Table 1.

Table 1. Mean (SD) of reader-average scores for Q1–Q6.
ID	SLD	DN	P
Q1	3.8 (0.5)	4.3 (0.4)	<0.001
Q2	3.2 (0.5)	4 (0.4)	<0.001
Q3	3.7 (0.4)	4 (0.4)	<0.001
Q4	3.4 (0.5)	4 (0.3)	<0.001
Q5	1.7 (0.5)	2.5 (0.4)	<0.001
Q6	2.7 (0.3)	2.9 (0.3)	0.002

The between-reader ICC and 95% confidence interval (CI) for each question are tabulated below, which suggest generally low agreement between the three readers, except for Q2. (This could also due to low between-patient variation.)

ID	ICC (95% CI)
Q1	-0.032 (-0.173, 0.153)
Q2	0.471 (0.295, 0.636)
Q3	-0.026 (-0.168, 0.16)
Q4	0.195 (0.02, 0.393)
Q5	0.31 (0.129, 0.5)
Q6	0.005 (-0.143, 0.195)

Q7

Responses to Q7 about the preference of dataset are summarized by reader and overall in Table 2. It is clear that readers overwhelmingly prefer DN to SLN.

Table 2. Summary responses (N (%)) to preference of dataset (Q7).
Response	AF	QK	SP	Overall
SLD	0 (0%)	2 (8.7%)	1 (4.3%)	4.3%
DN	16 (69.6%)	17 (73.9%)	20 (87%)	76.8%
Equivalent	2 (8.7%)	3 (13%)	2 (8.7%)	10.1%
Neither	5 (21.7%)	1 (4.3%)	0 (0%)	8.7%

The win ratio (Mao, Kim, and Miao 2021) for preference of DN over SLN, treating Equivalent/Neither as ties, is 17.67 (95% CI, 5.52 – 56.53).

The percentages of chosen options are plotted below.

The between-reader Fleiss kappa (95%) is 0.034. (Again because patient-to-patient variation is low due to predominance of “DN”.) The percent of consensus among the three readers is 11/23 = 47.8%. The percent of two readers agreeing and the third differing by 1 point (e.g., (2, 2, 3)) is 18/23 = 78.3%

Q8

Responses to Q8 about whether MRI was helpful for locating/detecting the lesion on PET are summarized by reader and overall in Table 3.

Table 3. Summary responses (N (%)) to question about helpfulness of MRI on PET (Q8).
Response	AF	QK	SP	Overall
No	11 (47.8%)	0 (0%)	0 (0%)	15.9%
Somewhat	5 (21.7%)	0 (0%)	16 (69.6%)	30.4%
Very	7 (30.4%)	23 (100%)	7 (30.4%)	53.6%

There are marked differences between the readers: QK chooses “very helpful” for all subjects; SP divides choice between “very” and “somewhat helpful”; AF considers nearly half of the cases as “unhelpful”. The data are visualized below.

SUV

Bland-Altman analysis

Bland-Altman plots for SUV values comparing each pair of modalities are plotted in Figure 1 below.

Repeated measures ANOVA

Boxplots of SUVs by tissue type and modality are shown in Figure 2.

Figure 2: Boxplots of SUVs by tissue type and modality.

Repeated measures ANOVA is used to test the between-modality differences, followed by pairwise (paired) t-test with Bonferroni correction. Table 1 summarizes the results.

Table 1: Mean (SD) of SUVs by tissue type and modality with overall and pairwise tests.
Tissue	Variable	SLD	DN	Standard	Overall	SLD v Standard	DN v Standard	SLD v DN
Lesion	SUVmean	4.08 (2.1)	4.06 (2.14)	3.87 (2.08)	P<0.001	P<0.001	P<0.001	P=1
Lesion	SUVmax	10.52 (8.9)	8.94 (8.15)	7.58 (6.31)	P<0.001	P<0.001	P=0.012	P<0.001
CL-FGT	SUVmean	1.34 (0.4)	1.36 (0.4)	1.33 (0.4)	P=0.168	P=1	P=0.351	P=0.119
CL-FGT	SUVmax	2.04 (0.52)	1.72 (0.44)	1.56 (0.44)	P<0.001	P<0.001	P<0.001	P<0.001

References

Fleiss, Joseph L. 1971. “Measuring Nominal Scale Agreement Among Many Raters.” Psychological Bulletin 76 (5): 378–82. https://doi.org/10.1037/h0031619.

Mao, Lu, KyungMann Kim, and Xinran Miao. 2021. “Sample Size Formula for General Win Ratio Analysis.” Biometrics 78 (3): 1257–68. https://doi.org/10.1111/biom.13501.

Shrout, Patrick E., and Joseph L. Fleiss. 1979. “Intraclass Correlations: Uses in Assessing Rater Reliability.” Psychological Bulletin 86 (2): 420–28. https://doi.org/10.1037/0033-2909.86.2.420.