I think one of the hindrances to improving analysis in papers / statistical practice is the lack of canonical examples of real experimental data with associated commentary about why certain methods were / were not used.
My work is on applying AI within clinical practice - so though I would see if the datamethods.org community could give me some advice
Can the datamethods.org hivemind give me some hints?
Method comparison studies in medicine + AI
A common use for AI within medical imaging is to replace the part of reporting that involves experts making laborious manual measurements (e.g. size of a tumor, volume of organ) from images (X-rays / CT / MRI / Ultrasound). When you observe what experts do you quickly realize that there is a fair bit of variability in their clinical practice (either they are measuring to subtly different boundaries, the boundaries are indistinct, where they should measure is poorly defined in guidelines).
Consequently, for most validation AI vs expert measurement studies you end up with multiple expert reads on each image vs 1 AI measurement.
Despite this common experimental set-up for medical AI:
There is remarkably little consistency in how papers report the performance of their model against multiple experts.
There seem to be hundreds of different published methods extending Bland-Altman - but no advice on which to use.
I think you want three measures:
A measure of the difference between AI and the expert consensus.
A measure of the difference between the experts to help contextualize (1)
A comparison of (1) and (2) in the hope:
that (1) is definitely better than (2)
or be sure that (1) is at least no worse than (2) (within some margin).
What not to do
Correlation coefficient / ICC etc.
Arbitrarily picking a number as a threshold and working out sens/spec.
Starts off well, but not sure how to extend into a method comparison study.
Data
Some real data - this is an excerpt (50 images out of a few hundred) of study where 13 experts measured the length of something in each image. There is a single AI measurement for each as well.
Compared to other public datasets used in some of the biostatistics papers - this is quite large - so hopefully it will be of interest to readers.
# A tibble: 6 × 3
image_code user len
<chr> <chr> <dbl>
1 img-001 user-001 NA
2 img-001 user-002 4.53
3 img-001 user-003 NA
4 img-001 user-004 4.19
5 img-001 user-005 4.04
6 img-001 user-006 4.60
head(d_ai)
# A tibble: 6 × 3
image_code user len
<chr> <chr> <dbl>
1 img-001 ai 4.81
2 img-002 ai 5.32
3 img-003 ai 4.29
4 img-004 ai 2.87
5 img-005 ai 3.63
6 img-006 ai 3.29
Lets visualise the humans and the AI case by case.
# Data table with the mean human value for each case + and an ordering variable for plottingd_human_sum <- d_human%>%group_by(image_code)%>%summarise(mean_human =mean(len, na.rm=T))%>%arrange(mean_human)%>%mutate(order =row_number())d_plot <-bind_rows(d_human, d_ai)%>%left_join(d_human_sum)%>%mutate(type =ifelse(user=="ai", "ai", "human"))
Joining with `by = join_by(image_code)`
p1 <-ggplot(aes(x=order, y=len, col=type), data=d_plot)+geom_point(alpha=0.7, shape=16)+scale_color_manual(values=c("ai"="#AA3333", "human"="#AAAAAA"))+scale_x_continuous("Images ordered by mean of humans")+scale_y_continuous("Length (cm)")+theme_minimal()+theme(panel.grid.major.x=element_blank(),panel.grid.minor.x=element_blank())print(p1)
Looks quite nice, AI falls within the human measurements for virtually all cases.
I like the concept of making an expert reference standard from the multiple humans (take the mean, or should it be median), and then calculating the mean or perhaps median (and some quantiles).
Issue with this is you will then want to compare it to that of the experts against each other, and since each expert contributes to the expert reference standard, it will be slightly biased in favour of the experts. The AI beats them anyway, and there are 13 humans in this experiement so the effect should be small - but if there were fewer this bias would be bigger.
Following https://hbiostat.org/bbr/obsvar what we perhaps should calculate instead is the difference between all possible human pairs; and between AI and each human.