DISCLAIMER: I am not an expert statistician. I am a professional data scientist with background in mathematics, probability, and statistics. This is not an expert opinion and should not be used as evidence or basis to take or refrain from action against anyone.
Using statistical analysis, medical fraud researcher Kyle Sheldrick has alleged strong evidence of fraud in a trial of Vitamin C for treatment of sepsis. This short note presents a re-analysis of the data flagged by Sheldrick using a simple significance test developed from first principles. Based on this analysis, the evidence for fraud seems significantly weaker than Sheldrick’s analysis indicates, and suggests there may be cause for reconsidering the allegation.
I copied the data from Dr. Sheldrick’s website into an Excel workbook.
library(tidyverse)
dat <- readxl::read_xlsx('data.xlsx', range = 'A1:E23') %>%
mutate(category = str_c(str_pad(row_number(), width = 2), '. ', category))
dat %>% print(n=100)
## # A tibble: 22 × 5
## category treatment_yes control_yes treatment_no control_no
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 " 1. Male" 27 23 20 24
## 2 " 2. No Comorbidity" 2 1 45 46
## 3 " 3. Diabetes" 16 20 31 27
## 4 " 4. Hypertension" 20 25 27 22
## 5 " 5. heart failure" 15 16 32 31
## 6 " 6. Malignancy" 5 7 42 40
## 7 " 7. COPD" 8 7 39 40
## 8 " 8. Cirrhosis" 6 3 41 44
## 9 " 9. CVA" 8 5 39 42
## 10 "10. CRF" 7 8 40 39
## 11 "11. Morbid Obesity" 6 8 41 39
## 12 "12. Immunocompromised" 6 4 41 43
## 13 "13. Drug Addiction" 5 5 42 42
## 14 "14. Primary Diagnosis: Pn… 18 19 29 28
## 15 "15. Primary Diagnosis: Ur… 11 10 36 37
## 16 "16. Primary Diagnosis: Pr… 7 7 40 40
## 17 "17. Primary Diagnosis: GI… 6 6 41 41
## 18 "18. Primary Diagnosis: Ot… 5 5 42 42
## 19 "19. Ventilation" 22 26 25 21
## 20 "20. Vasopressors" 22 22 25 25
## 21 "21. AKI" 31 30 16 17
## 22 "22. Positive blood cultur… 13 13 34 34
The property of the data that Sheldrick finds suspicious is the distribution of yeses between the treatment and control group - it seems too “equitable”. Here, equitable means the treatment and control group contain an equal or nearly-equal share of the subjects with a certain condition. For example, there are 44 people on vasopressors in the study, and they are perfectly balanced between the treatment and control group, each of which contain 22 people. When the total with a condition is odd, the treatment and control groups can’t have the same number of subjects, so a difference of one in the group sizes is the maximally equitable distribution. Sheldrick notes that, in 13 of the 22 dimensions, the distribution is maximally equitable, and uses a Fisher exact test to show that this is very unlikely.
I do not attempt to evaluate whether Sheldrick’s analysis is correct. Instead, I attempt to re-evaluate equitability by a different and more direct method.
Suppose we are distributing n balls into two urns at random: or each ball we flip an unbiased coin and if heads, it goes to urn A, otherwise urn B. We call the balls equitably distributed when the difference in ball count between urn A and B is small. If urn A contains \(k\) balls, things would be at least as equitable if it instead contained a number of balls between \(k\) and \(n-k\). Thus, the probability of observing an outcome at least as equitable as the one observed is \[p = \frac{\sum_{i=k}^{n-k} {n \choose i}}{2^n}\]
This formula can be used for significance testing. For example, under the null hypothesis of random unbiased distribution described above, the probability of n balls being distributed perfectly equitably is about 25% for 10 balls, 8% for 100 balls, and 2% for 1,000 balls. At the conventional 5% significance threshold, the n=1,000 case would be statistically significant, while the others would not be. This test is implemented in the R code below.
# sum of (n k) + ... + (n n-k)
band_choose <- function(n, k) {
k <- min(k, n-k)
total <- 0
for (i in seq(k, n-k)) total <- total + choose(n, i)
total
}
# Equitable distribution test p-value
test_prob <- function(n, k) band_choose(n, k) / 2^n
For exploratory analysis, it may be helpful to look at the expected value of the test under the null hypothesis, which can be calculated as follows.
expected_test_prob <- function(n0) {
tmp <- tibble(k = 0:n0) %>%
mutate(n = n0) %>%
mutate(prob = map2_dbl(n, k, choose) / 2^n,
test_p = map2_dbl(n, k, test_prob))
tmp %>% summarize(x = sum(prob * test_p)) %>% pull(x)
}
Returning to the sepsis study, Sheldrick’s contention is that the
distribution of people between the treatment and control groups is too
equitable. For each of the 22 conditions, I apply the equitable
distribution test to the way the people with the condition are
distributed between the treatment and control group. For example, when
applied to the 44 people on vasopressors who are perfectly
equi-distributed between treatment and control, I find a test value of
test_prob(44, 22) = 0.12, meaning this perfectly equitable
result would be seen 12% of the time. NOTE: I am assuming that
the binomial distribution is a good approximation to the actual, which
has not been rigorously demonstrated!
The observed and expected test values are plotted below for each dimension.
dat <- dat %>%
mutate(total_yes = treatment_yes + control_yes,
observed = map2_dbl(total_yes, treatment_yes, test_prob),
expected = map_dbl(total_yes, expected_test_prob))
dat %>%
gather(key = 'test_value', value = 'p', c(observed, expected)) %>%
ggplot(aes(x = category, y = p, group = test_value, color = test_value)) +
geom_point() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
coord_cartesian(ylim = c(0, 1)) +
ggtitle('Equitable distribution significance testing',
'Vitamin C sepsis study data') +
ylab('Probability')
At a glance, the observed p-values tend to be lower than the expected, but not very consistently so. If we take the test value as a measure of equitability, 17 of the 22 dimensions are more equitable than expected and 5 are less equitable than expected.
Unlike Sheldrick’s Fisher Exact Test, this analysis does not seem to indicate a spectacularly unlikely level of equi-distribution between the treatment and control subgroups with each of the 22 conditions. The highest observed p-value is \(0.82\), and the probability of seeing all values below this threshold is \(0.82^{22} = 0.013\), unlikely but not spectacularly so.
Focusing on reasonable, non-“cherry picked” subsets, such as the primary diagnosis distribution, gives similarly underwhelming results. All p-values for the five primary diagnosis dimensions are below \(0.35\), and \(0.35^5 = 0.005\). Certainly the data looks unlikely, but with perhaps tens of thousands of small clinical studies being performed each year, dozens to hundreds of innocent studies might show features like this by chance and be falsely flagged by this approach.
I do not claim to have resolved the question of fraud in the sepsis study in question. I do not even claim that Sheldrick’s analysis is wrong, although others have raised concerns since the first draft of this post. However, if there were a compelling case for fraud, I would expect a simple analysis like this to come to the same conclusion as Sheldrick’s analysis did. The fact that it doesn’t seems to be reason for pause and careful reconsideration of the allegation.