Method comparison studies: AI vs multiple experts

I think one of the hindrances to improving analysis in papers / statistical practice is the lack of canonical examples of real experimental data with associated commentary about why certain methods were / were not used.

My work is on applying AI within clinical practice - so though I would see if the datamethods.org community could give me some advice

In part this is motivated by being asked to follow this paper in a review of a manuscript https://web1.sph.emory.edu/observeragreement/CIA%20-%20manuscript.pdf - and to calculate the “Individual Equivalence Coefficient” - which to me seems overly complex.

Can the datamethods.org hivemind give me some hints?

Method comparison studies in medicine + AI

A common use for AI within medical imaging is to replace the part of reporting that involves experts making laborious manual measurements (e.g. size of a tumor, volume of organ) from images (X-rays / CT / MRI / Ultrasound). When you observe what experts do you quickly realize that there is a fair bit of variability in their clinical practice (either they are measuring to subtly different boundaries, the boundaries are indistinct, where they should measure is poorly defined in guidelines).

Consequently, for most validation AI vs expert measurement studies you end up with multiple expert reads on each image vs 1 AI measurement.

Despite this common experimental set-up for medical AI:

  1. There is remarkably little consistency in how papers report the performance of their model against multiple experts.
  2. There seem to be hundreds of different published methods extending Bland-Altman - but no advice on which to use.

I think you want three measures:

  1. A measure of the difference between AI and the expert consensus.
  2. A measure of the difference between the experts to help contextualize (1)
  3. A comparison of (1) and (2) in the hope:
  • that (1) is definitely better than (2)
  • or be sure that (1) is at least no worse than (2) (within some margin).

What not to do

  1. Correlation coefficient / ICC etc.
  2. Arbitrarily picking a number as a threshold and working out sens/spec.

Possible approaches

  1. Methods of Bland and Altman, with worked example here: https://www-users.york.ac.uk/~mb55/meas/bland2007.pdf
  • Only describes limits of agreement between the two methods - don’t get the three measures I would like.
  1. https://web1.sph.emory.edu/observeragreement/CIA%20-%20manuscript.pdf
  • IEC - Seems quite complex and fiddly - and I can’t find any public code for it - so would like some opinions before I try and implement it.
  1. https://cran.r-project.org/web/packages/MethodCompare/index.html
  • I like this, espcially for the precision plot, but it doesn’t provide estimate and CI of the difference in precision.
  1. https://hbiostat.org/bbr/obsvar
  • Starts off well, but not sure how to extend into a method comparison study.

Data

Some real data - this is an excerpt (50 images out of a few hundred) of study where 13 experts measured the length of something in each image. There is a single AI measurement for each as well.

Compared to other public datasets used in some of the biostatistics papers - this is quite large - so hopefully it will be of interest to readers.

suppressPackageStartupMessages(library("tidyverse"))
Warning: package 'ggplot2' was built under R version 4.3.1
Warning: package 'dplyr' was built under R version 4.3.1
Warning: package 'stringr' was built under R version 4.3.1
Warning: package 'lubridate' was built under R version 4.3.1
d_human <- readRDS("./output/d_human_new.rds")

d_ai <- readRDS("./output/d_ai_new.rds")

head(d_human)
# A tibble: 6 × 3
  image_code user       len
  <chr>      <chr>    <dbl>
1 img-001    user-001 NA   
2 img-001    user-002  4.53
3 img-001    user-003 NA   
4 img-001    user-004  4.19
5 img-001    user-005  4.04
6 img-001    user-006  4.60
head(d_ai)
# A tibble: 6 × 3
  image_code user    len
  <chr>      <chr> <dbl>
1 img-001    ai     4.81
2 img-002    ai     5.32
3 img-003    ai     4.29
4 img-004    ai     2.87
5 img-005    ai     3.63
6 img-006    ai     3.29

Lets visualise the humans and the AI case by case.

# Data table with the mean human value for each case + and an ordering variable for plotting
d_human_sum <- d_human%>%
  group_by(image_code)%>%
  summarise(mean_human = mean(len, na.rm=T))%>%
  arrange(mean_human)%>%
  mutate(order = row_number())

d_plot <- bind_rows(d_human, d_ai)%>%
  left_join(d_human_sum)%>%
  mutate(type = ifelse(user=="ai", "ai", "human"))
Joining with `by = join_by(image_code)`
p1 <- ggplot(aes(x=order, y=len, col=type), data=d_plot)+
  geom_point(alpha=0.7, shape=16)+
  scale_color_manual(values=c("ai"="#AA3333", "human"="#AAAAAA"))+
  scale_x_continuous("Images ordered by mean of humans")+
  scale_y_continuous("Length (cm)")+
  theme_minimal()+
  theme(panel.grid.major.x=element_blank(),
        panel.grid.minor.x=element_blank())

print(p1)
Warning: Removed 32 rows containing missing values (`geom_point()`).

Looks quite nice, AI falls within the human measurements for virtually all cases.

I like the concept of making an expert reference standard from the multiple humans (take the mean, or should it be median), and then calculating the mean or perhaps median (and some quantiles).

d_ai_performance <- d_ai%>%
  full_join(d_human_sum)%>%
  mutate(abs_diff = abs(len-mean_human))
Joining with `by = join_by(image_code)`
print("Mean Absolute Error")
[1] "Mean Absolute Error"
print(mean(d_ai_performance$abs_diff, na.rm=T))
[1] 0.2375188
print("Median Absolute Error")
[1] "Median Absolute Error"
print(quantile(d_ai_performance$abs_diff,c(0.5,0.75,0.8,0.9,0.95), na.rm=T))
      50%       75%       80%       90%       95% 
0.1696177 0.3160877 0.4031930 0.4929074 0.6949484 

Issue with this is you will then want to compare it to that of the experts against each other, and since each expert contributes to the expert reference standard, it will be slightly biased in favour of the experts. The AI beats them anyway, and there are 13 humans in this experiement so the effect should be small - but if there were fewer this bias would be bigger.

d_human_performance <- d_human%>%
  full_join(d_human_sum)%>%
  mutate(abs_diff = abs(len-mean_human))%>%
  group_by(image_code)%>%
  summarise(mean_abs_diff = mean(abs_diff, na.rm=T))
Joining with `by = join_by(image_code)`
print("Mean Absolute Error")
[1] "Mean Absolute Error"
print(mean(d_human_performance$mean_abs_diff, na.rm=T))
[1] 0.3273542
print("Median Absolute Error")
[1] "Median Absolute Error"
print(quantile(d_human_performance$mean_abs_diff,c(0.5,0.75,0.8,0.9,0.95), na.rm=T))
      50%       75%       80%       90%       95% 
0.2809962 0.3885107 0.4368667 0.5810151 0.6445229 

3 - https://cran.r-project.org/web/packages/MethodCompare/index.html

# Slightly odd way it likes the dataframe...
library("MethodCompare")
Loading required package: nlme
Warning: package 'nlme' was built under R version 4.3.1

Attaching package: 'nlme'
The following object is masked from 'package:dplyr':

    collapse
d_meth <- bind_rows(d_ai%>%select(image_code, ai=len), d_human%>%select(image_code,human=len))

measure_model <- measure_compare(d_meth, new="ai", Ref="human", ID="image_code")

bias_plot(measure_model)

precision_plot(measure_model)

compare_plot(measure_model)

4 - https://hbiostat.org/bbr/obsvar

Following https://hbiostat.org/bbr/obsvar what we perhaps should calculate instead is the difference between all possible human pairs; and between AI and each human.

d_human_expand <- d_human%>%
  group_by(image_code)%>%
  arrange(user)%>%
  reframe(expand_grid(user1=user, user2=user))%>%
  group_by(image_code)%>%
  filter(user1 < user2)%>%
  arrange(image_code, user1, user2)%>%
  left_join(d_human, by=join_by(image_code==image_code, user1==user))%>%
  rename(len1=len)%>%
  left_join(d_human, by=join_by(image_code==image_code, user2==user))%>%
  rename(len2=len)%>%
  mutate(len_diff = len1 - len2)%>%
  mutate(abs_len_diff = abs(len_diff))%>%
  mutate(type="human")

d_ai_expand <- d_human%>%
  rename(user1=user, len1=len)%>%
  left_join(d_ai%>%rename(len2=len))%>%
  mutate(len_diff = len1-len2)%>%
  mutate(abs_len_diff = abs(len_diff))%>%
  mutate(user2="ai", type="ai")
Joining with `by = join_by(image_code)`
d_final <- bind_rows(d_human_expand, d_ai_expand)

d_result <- d_final%>%
  group_by(image_code, type)%>%
  summarise(mae = mean(abs_len_diff, na.rm=T))%>%
  pivot_wider(names_from=type, values_from=mae)%>%
  mutate(diff_mae = ai-human)
`summarise()` has grouped output by 'image_code'. You can override using the
`.groups` argument.
print("AI MAE - diff from randomly selected expert")
[1] "AI MAE - diff from randomly selected expert"
print(mean(d_result$ai, na.rm=T))
[1] 0.407708
print("Human MAE - diff between two randomly selected experts")
[1] "Human MAE - diff between two randomly selected experts"
print(mean(d_result$human, na.rm=T))
[1] 0.4843156

I don’t like these numbers for a few reasons

  1. They are bigger (so my AI looks worse). I know they are measuring a different thing (difference from a random expert), but still…
  2. When comparing the AI vs Human the ratio looks worse. 0.24/0.32 = 75% vs 0.41/0.48 = 85%.

Anyway, lets subtract the MAE of the humans and the AI for each case and compare them

library("Hmisc")

Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':

    src, summarize
The following objects are masked from 'package:base':

    format.pval, units
smean.cl.boot(d_result$diff_mae)
       Mean       Lower       Upper 
-0.07530355 -0.11831816 -0.02768779 

OK - AI is closer to a random expert than another one.

It seems too simple, especially compared to the suggested method.

Any thoughts on how the datamethods.org community would like the results of an AI vs multirater study to be presented?