DWI QC: inter-rater agreement review

Comparison between two raters on the basis of eddy QUAD QC reports

This analysis reviews inter-rater agremeent of DWI QC ratings of the POND dataset, exclusively on the basis of eddy QUAD QC reports. The list of participants was non-randomly selected to include participants for which a prior QC comparison generated disagreement (QC on the basis of volume-to-volume visual review, completed by Hajer, and tensor residual plots, completed by Nat). Ratings on the basis of eddy QUAD QC reports summarized here were provided by John and Navona.

This plot displays ratings (y axis) and absolute difference in rating (colour). Here, 1=FAIL, 2=INDETERMINATE, 3=PASS.

In total, we have disagreement in 4 out of 30 instances. Of these, John was more strict (providing a lower rating) than Navona in 3 instance(s) [participant(s): 1050065, 880043, 1050402], and Navona was more strict than John in the remaining 1 instance(s) [participant(s): 880138]. In a total of 0 cases, one rater indicated a rating of PASS and the other indicated FAIL (i.e., an absolute difference of 2).

We can calculate inter-rater agreement, that is, the number of instances of agreement divided by the by number of ratings:

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 30 
##    Raters = 2 
##   %-agree = 86.7

But, this inter-rater agreement calculation does not take into account agreement by chance. For this, we should calculate Cohen’s kappa. Below, we see that we retain good agreement, when properly accounting for chance agreement:

##  Cohen's Kappa for 2 Raters (Weights: unweighted)
## 
##  Subjects = 30 
##    Raters = 2 
##     Kappa = 0.756 
## 
##         z = 5.6 
##   p-value = 2.2e-08

We should also compare a weighted Kappa calculation, which takes into the account that our ratings are ordinal: we want to recognize that FAIL is closer to INDETERMINATE than FAIL is to PASS.

##  Cohen's Kappa for 2 Raters (Weights: equal)
## 
##  Subjects = 30 
##    Raters = 2 
##     Kappa = 0.836 
## 
##         z = 5.38 
##   p-value = 7.45e-08

Comparison of four raters: eddy QUAD (2), visual volume (1), residual plots (1)

This analysis reviews inter-rater agreement of DWI QC ratings of the POND dataset, on the basis of eddy QUAD QC reports (above) as well as volume-to-volume visual review, completed by Hajer, and tensor residual plots, completed by Nat.

As above, this plot displays ratings (y axis) and absolute difference in rating (colour), and 1=FAIL, 2=INDETERMINATE, 3=PASS.

For visual review, a list of the 30 participants with absolute difference ratings from smallest (high agreement) to largest (low agreement) is as follows:

##  [1]  880418  880464  880533  880601 1050015 1050019 1050090 1050349
##  [9] 1050378 1050431  880138 1050065  880107  880624  880703 1050032
## [17] 1050054 1050179 1050429  880397  880473  880558 1050105 1050235
## [25] 1050250 1050253 1050353  880043 1050402 1050131

In total, we have perfect agreement in only 10 out of 30 instances. We see a trend whereby the ratings based on eddy QC (those made by John, average=1.57 and Navona, average=1.63) are lower - i.e., more conservative / more likely to rate FAIL than those based on both visual review of volumes (Hajer, average=2.1) and review of the residual plots (Nat, average=2.23).

We can calculate inter-rater agreement for our 4 raters using Fleiss’s Kappa:

##  Fleiss' Kappa for m Raters
## 
##  Subjects = 30 
##    Raters = 4 
##     Kappa = 0.335 
## 
##         z = 6.03 
##   p-value = 1.59e-09

Compare this to Krippendorff’s alpha for ordinal ratings of two or more raters:

##  Krippendorff's alpha
## 
##  Subjects = 30 
##    Raters = 4 
##     alpha = 0.444

We can also calculate intra-class correlation (ICC):

## Call: ICC(x = df.group[, c("john", "navona", "hajer", "nat")])
## 
## Intraclass correlation coefficients 
##                          type  ICC   F df1 df2       p lower bound
## Single_raters_absolute   ICC1 0.46 4.3  29  90 4.5e-08        0.27
## Single_random_raters     ICC2 0.47 5.5  29  87 2.6e-10        0.28
## Single_fixed_raters      ICC3 0.53 5.5  29  87 2.6e-10        0.35
## Average_raters_absolute ICC1k 0.77 4.3  29  90 4.5e-08        0.60
## Average_random_raters   ICC2k 0.78 5.5  29  87 2.6e-10        0.60
## Average_fixed_raters    ICC3k 0.82 5.5  29  87 2.6e-10        0.68
##                         upper bound
## Single_raters_absolute         0.65
## Single_random_raters           0.66
## Single_fixed_raters            0.71
## Average_raters_absolute        0.88
## Average_random_raters          0.89
## Average_fixed_raters           0.91
## 
##  Number of subjects = 30     Number of Judges =  4

Exploratory cut-off on basis of extracted eddy QC metrics

Below is a table that visually summarizes all raters’ ratings for the 30 participants, alongside 5 quantitative metrics extracted from various eddy QUAD reports: (1) percent outliers, (2) average signal-to-noise ratio, (3) average contrast-to-noise ratio, (4) average absolute motion, and (5) average relative motion. We have also included a summary value indicating the count of problematic metrics, and estimating a PASS, CAUTION, or FAIL value.

The thresholds for these variables were set as follows:

Metric	Threshold for FAIL
Percent_Outliers	> 0.2
Average_SNR	< 20
Average_CNR	< 1.4
Abs_Motion	> 1mm
Rel_Motion	>.4mm
Multiple_Issues	simple count
Weighted_Score	<=2 PASS \| 3 CAUTION \| >3 FAIL

The ‘threshold’ for the first 5 eddy-extracted values was set by John based on review of this sample. These thresholds can be adjusted on the basis of a larger subset, or adjusted on the basis of discussion.

The purpose of this visualization is to discuss which of these metrics could/should be used to inform our QC ratings, and to see how each aligns with our 4 raters’ ratings, which were done independently of their review.

Participant	John	Navona	Hajer	Nat	Percent_Outliers	Average_SNR	Average_CNR	Abs_Motion	Rel_Motion	Multiple_Issues	Weighted_Score
880043	2	3	3	1	0.02	20.11	1.37	0.55	0.21	1	Pass
880107	2	2	3	3	1.35	23.98	1.25	0.86	0.23	2	Caution
880138	3	2	3	3	0.1	26	1.48	1.23	0.24	1	Pass
880397	1	1	1	3	2.21	12.9	0.89	1.77	0.66	5	Fail
880418	3	3	3	3	0.02	22.05	1.49	0.68	0.24	0	Pass
880464	3	3	3	3	0.14	24.98	1.57	1.01	0.37	1	Pass
880473	1	1	3	1	1.98	9.76	0.45	0.82	0.42	4	Fail
880533	1	1	1	1	1.31	5.14	0.25	1.07	0.64	5	Fail
880558	2	2	1	3	0.02	18.72	1.02	0.96	0.39	2	Caution
880601	3	3	3	3	0.56	21.16	1.26	1.46	0.46	4	Fail
880624	2	2	3	3	0.27	16.25	1.07	2.71	0.62	5	Fail
880703	1	1	2	2	2.94	11.68	0.96	2.17	0.47	5	Fail
1050015	1	1	1	1	2.31	10.73	1.04	1.13	0.67	5	Fail
1050019	1	1	1	1	2.89	9.44	0.65	2.39	1.03	5	Fail
1050032	1	1	2	2	1.69	7.8	0.76	1.43	0.65	5	Fail
1050054	1	1	2	2	1.98	9.49	1.03	2.31	0.94	5	Fail
1050065	2	3	3	3	0.23	20.22	1.25	0.62	0.32	2	Caution
1050090	3	3	3	3	0.2	26.08	1.59	0.41	0.22	0	Pass
1050105	1	1	1	3	1.3	14.3	0.79	1.21	0.92	5	Fail
1050131	1	1	3	3	1.62	23	1.26	0.65	0.46	3	Fail
1050179	1	1	2	2	4.27	12.05	1.04	1.02	0.52	5	Fail
1050235	1	1	1	3	0.92	15.02	1.08	0.73	0.41	4	Fail
1050250	1	1	3	1	2.17	15.54	1.05	1.28	0.55	5	Fail
1050253	1	1	1	3	5.5	13.26	0.85	1.29	0.64	5	Fail
1050349	1	1	1	1	2.58	8.24	0.53	1.47	0.86	5	Fail
1050353	1	1	1	3	5.7	12.78	0.65	1.97	1.07	5	Fail
1050378	1	1	1	1	6.06	10.59	0.84	2.86	0.7	5	Fail
1050402	1	2	3	1	0.6	16.08	1.02	0.89	0.41	4	Fail
1050429	1	1	2	2	3.35	16.73	1.09	1.3	0.58	5	Fail
1050431	3	3	3	3	0.1	23.74	1.47	1.55	0.23	1	Pass