Last run: 2019-05-07
This analysis reviews inter-rater agremeent of DWI QC ratings of the POND dataset, exclusively on the basis of eddy QUAD QC reports. The list of participants was non-randomly selected to include participants for which a prior QC comparison generated disagreement (QC on the basis of volume-to-volume visual review, completed by Hajer, and tensor residual plots, completed by Nat). Ratings on the basis of eddy QUAD QC reports summarized here were provided by John and Navona.
This plot displays ratings (y axis) and absolute difference in rating (colour). Here, 1=FAIL, 2=INDETERMINATE, 3=PASS.
In total, we have disagreement in 4 out of 30 instances. Of these, John was more strict (providing a lower rating) than Navona in 3 instance(s) [participant(s): 1050065, 880043, 1050402], and Navona was more strict than John in the remaining 1 instance(s) [participant(s): 880138]. In a total of 0 cases, one rater indicated a rating of PASS and the other indicated FAIL (i.e., an absolute difference of 2).
We can calculate inter-rater agreement, that is, the number of instances of agreement divided by the by number of ratings:
## Percentage agreement (Tolerance=0)
##
## Subjects = 30
## Raters = 2
## %-agree = 86.7
But, this inter-rater agreement calculation does not take into account agreement by chance. For this, we should calculate Cohen’s kappa. Below, we see that we retain good agreement, when properly accounting for chance agreement:
## Cohen's Kappa for 2 Raters (Weights: unweighted)
##
## Subjects = 30
## Raters = 2
## Kappa = 0.756
##
## z = 5.6
## p-value = 2.2e-08
We should also compare a weighted Kappa calculation, which takes into the account that our ratings are ordinal: we want to recognize that FAIL is closer to INDETERMINATE than FAIL is to PASS.
## Cohen's Kappa for 2 Raters (Weights: equal)
##
## Subjects = 30
## Raters = 2
## Kappa = 0.836
##
## z = 5.38
## p-value = 7.45e-08
This analysis reviews inter-rater agreement of DWI QC ratings of the POND dataset, on the basis of eddy QUAD QC reports (above) as well as volume-to-volume visual review, completed by Hajer, and tensor residual plots, completed by Nat.
As above, this plot displays ratings (y axis) and absolute difference in rating (colour), and 1=FAIL, 2=INDETERMINATE, 3=PASS.
For visual review, a list of the 30 participants with absolute difference ratings from smallest (high agreement) to largest (low agreement) is as follows:
## [1] 880418 880464 880533 880601 1050015 1050019 1050090 1050349
## [9] 1050378 1050431 880138 1050065 880107 880624 880703 1050032
## [17] 1050054 1050179 1050429 880397 880473 880558 1050105 1050235
## [25] 1050250 1050253 1050353 880043 1050402 1050131
In total, we have perfect agreement in only 10 out of 30 instances. We see a trend whereby the ratings based on eddy QC (those made by John, average=1.57 and Navona, average=1.63) are lower - i.e., more conservative / more likely to rate FAIL than those based on both visual review of volumes (Hajer, average=2.1) and review of the residual plots (Nat, average=2.23).
We can calculate inter-rater agreement for our 4 raters using Fleiss’s Kappa:
## Fleiss' Kappa for m Raters
##
## Subjects = 30
## Raters = 4
## Kappa = 0.335
##
## z = 6.03
## p-value = 1.59e-09
Compare this to Krippendorff’s alpha for ordinal ratings of two or more raters:
## Krippendorff's alpha
##
## Subjects = 30
## Raters = 4
## alpha = 0.444
We can also calculate intra-class correlation (ICC):
## Call: ICC(x = df.group[, c("john", "navona", "hajer", "nat")])
##
## Intraclass correlation coefficients
## type ICC F df1 df2 p lower bound
## Single_raters_absolute ICC1 0.46 4.3 29 90 4.5e-08 0.27
## Single_random_raters ICC2 0.47 5.5 29 87 2.6e-10 0.28
## Single_fixed_raters ICC3 0.53 5.5 29 87 2.6e-10 0.35
## Average_raters_absolute ICC1k 0.77 4.3 29 90 4.5e-08 0.60
## Average_random_raters ICC2k 0.78 5.5 29 87 2.6e-10 0.60
## Average_fixed_raters ICC3k 0.82 5.5 29 87 2.6e-10 0.68
## upper bound
## Single_raters_absolute 0.65
## Single_random_raters 0.66
## Single_fixed_raters 0.71
## Average_raters_absolute 0.88
## Average_random_raters 0.89
## Average_fixed_raters 0.91
##
## Number of subjects = 30 Number of Judges = 4
Below is a table that visually summarizes all raters’ ratings for the 30 participants, alongside 5 quantitative metrics extracted from various eddy QUAD reports: (1) percent outliers, (2) average signal-to-noise ratio, (3) average contrast-to-noise ratio, (4) average absolute motion, and (5) average relative motion. We have also included a summary value indicating the count of problematic metrics, and estimating a PASS, CAUTION, or FAIL value.
The thresholds for these variables were set as follows:
Metric | Threshold for FAIL |
---|---|
Percent_Outliers | > 0.2 |
Average_SNR | < 20 |
Average_CNR | < 1.4 |
Abs_Motion | > 1mm |
Rel_Motion | >.4mm |
Multiple_Issues | simple count |
Weighted_Score | <=2 PASS | 3 CAUTION | >3 FAIL |
The ‘threshold’ for the first 5 eddy-extracted values was set by John based on review of this sample. These thresholds can be adjusted on the basis of a larger subset, or adjusted on the basis of discussion.
The purpose of this visualization is to discuss which of these metrics could/should be used to inform our QC ratings, and to see how each aligns with our 4 raters’ ratings, which were done independently of their review.
Participant | John | Navona | Hajer | Nat | Percent_Outliers | Average_SNR | Average_CNR | Abs_Motion | Rel_Motion | Multiple_Issues | Weighted_Score |
---|---|---|---|---|---|---|---|---|---|---|---|
880043 | 2 | 3 | 3 | 1 | 0.02 | 20.11 | 1.37 | 0.55 | 0.21 | 1 | Pass |
880107 | 2 | 2 | 3 | 3 | 1.35 | 23.98 | 1.25 | 0.86 | 0.23 | 2 | Caution |
880138 | 3 | 2 | 3 | 3 | 0.1 | 26 | 1.48 | 1.23 | 0.24 | 1 | Pass |
880397 | 1 | 1 | 1 | 3 | 2.21 | 12.9 | 0.89 | 1.77 | 0.66 | 5 | Fail |
880418 | 3 | 3 | 3 | 3 | 0.02 | 22.05 | 1.49 | 0.68 | 0.24 | 0 | Pass |
880464 | 3 | 3 | 3 | 3 | 0.14 | 24.98 | 1.57 | 1.01 | 0.37 | 1 | Pass |
880473 | 1 | 1 | 3 | 1 | 1.98 | 9.76 | 0.45 | 0.82 | 0.42 | 4 | Fail |
880533 | 1 | 1 | 1 | 1 | 1.31 | 5.14 | 0.25 | 1.07 | 0.64 | 5 | Fail |
880558 | 2 | 2 | 1 | 3 | 0.02 | 18.72 | 1.02 | 0.96 | 0.39 | 2 | Caution |
880601 | 3 | 3 | 3 | 3 | 0.56 | 21.16 | 1.26 | 1.46 | 0.46 | 4 | Fail |
880624 | 2 | 2 | 3 | 3 | 0.27 | 16.25 | 1.07 | 2.71 | 0.62 | 5 | Fail |
880703 | 1 | 1 | 2 | 2 | 2.94 | 11.68 | 0.96 | 2.17 | 0.47 | 5 | Fail |
1050015 | 1 | 1 | 1 | 1 | 2.31 | 10.73 | 1.04 | 1.13 | 0.67 | 5 | Fail |
1050019 | 1 | 1 | 1 | 1 | 2.89 | 9.44 | 0.65 | 2.39 | 1.03 | 5 | Fail |
1050032 | 1 | 1 | 2 | 2 | 1.69 | 7.8 | 0.76 | 1.43 | 0.65 | 5 | Fail |
1050054 | 1 | 1 | 2 | 2 | 1.98 | 9.49 | 1.03 | 2.31 | 0.94 | 5 | Fail |
1050065 | 2 | 3 | 3 | 3 | 0.23 | 20.22 | 1.25 | 0.62 | 0.32 | 2 | Caution |
1050090 | 3 | 3 | 3 | 3 | 0.2 | 26.08 | 1.59 | 0.41 | 0.22 | 0 | Pass |
1050105 | 1 | 1 | 1 | 3 | 1.3 | 14.3 | 0.79 | 1.21 | 0.92 | 5 | Fail |
1050131 | 1 | 1 | 3 | 3 | 1.62 | 23 | 1.26 | 0.65 | 0.46 | 3 | Fail |
1050179 | 1 | 1 | 2 | 2 | 4.27 | 12.05 | 1.04 | 1.02 | 0.52 | 5 | Fail |
1050235 | 1 | 1 | 1 | 3 | 0.92 | 15.02 | 1.08 | 0.73 | 0.41 | 4 | Fail |
1050250 | 1 | 1 | 3 | 1 | 2.17 | 15.54 | 1.05 | 1.28 | 0.55 | 5 | Fail |
1050253 | 1 | 1 | 1 | 3 | 5.5 | 13.26 | 0.85 | 1.29 | 0.64 | 5 | Fail |
1050349 | 1 | 1 | 1 | 1 | 2.58 | 8.24 | 0.53 | 1.47 | 0.86 | 5 | Fail |
1050353 | 1 | 1 | 1 | 3 | 5.7 | 12.78 | 0.65 | 1.97 | 1.07 | 5 | Fail |
1050378 | 1 | 1 | 1 | 1 | 6.06 | 10.59 | 0.84 | 2.86 | 0.7 | 5 | Fail |
1050402 | 1 | 2 | 3 | 1 | 0.6 | 16.08 | 1.02 | 0.89 | 0.41 | 4 | Fail |
1050429 | 1 | 1 | 2 | 2 | 3.35 | 16.73 | 1.09 | 1.3 | 0.58 | 5 | Fail |
1050431 | 3 | 3 | 3 | 3 | 0.1 | 23.74 | 1.47 | 1.55 | 0.23 | 1 | Pass |