Overview

DISCLAIMER: I am not an expert statistician. I am a professional data scientist with background in mathematics, probability, and statistics. This is not an expert opinion and should not be used as evidence or basis to take or refrain from action against anyone.

This is a continuation of my review of Kyle Sheldrick’s fraud allegation against a Vitamin C sepsis study. The materials needed from the last part are carried over in the folded code block below, along with some minor enhancements.

library(tidyverse)

dat <- readxl::read_xlsx('data.xlsx', range = 'A1:E23') %>%
  mutate(category = str_c(str_pad(row_number(), width = 2), '. ', category)) %>% 
  mutate(total_yes = treatment_yes + control_yes)

# sum of (n k) ... (n n-k)
band_choose <- function(n, k) {
  k <- min(k, n-k)
  total <- 0
  for (i in seq(k, n-k)) total <- total + choose(n, i)
  total
}

test_prob_scalar <- function(n, k) band_choose(n, k) / 2^n

test_prob <- function(n, k) map2_dbl(n, k, test_prob_scalar)

expected_test_prob_scalar <- function(n0) {
  tmp <- tibble(k = 0:n0) %>%
    mutate(n = n0) %>%
    mutate(prob = map2_dbl(n, k, choose) / 2^n,
           test_p = map2_dbl(n, k, test_prob))
  tmp %>% summarize(x = sum(prob * test_p)) %>% pull(x)
}

expected_test_prob <- function(n0) map_dbl(n0, expected_test_prob_scalar)

Method

My initial exploration suggested that the data looked unusual based on an equitable distribution test, but did not get to an overall frequentist evaluation of significance, i.e. how infrequently data of this level of equitability would occur under the null hypothesis. To do that, I multiply the 22 p-values from my equitable distribution test into a single final value (actually I sum their logarithms, which is equivalent for our purposes but gives more nicely scaled values).

overall_test_score <- function(df) {
  df %>%
    mutate(test_p = test_prob(total_yes, treatment_yes)) %>%
    summarize(test = mean(log(test_p))) %>%
    pull(test)
}

dat %>% overall_test_score()

## [1] -0.9803446

This value does not by itself represent the likelihood of the data. To get at that, we do 100,000 simulations where we randomly allocate the total number of yeses to the treatment and control group, then calculate the overall score.

# seed for reproducibility
set.seed(12345)
test <- dat %>% select(category, total_yes)
nrep <- 1e5
result <- numeric(nrep)
ncat <- nrow(dat)
for (i in 1:nrep) {
  # progress indicator
  if (i %% (nrep/100) == 0) cat(paste0(i/(nrep/100), '%, '))
  result[i] <- test %>%
    mutate(treatment_yes = rbinom(ncat, total_yes, 0.5)) %>%
    overall_test_score()
}

## 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, 40%, 41%, 42%, 43%, 44%, 45%, 46%, 47%, 48%, 49%, 50%, 51%, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%, 61%, 62%, 63%, 64%, 65%, 66%, 67%, 68%, 69%, 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, 100%,

Results

Here’s a histogram showing the distribution of the simulation results.

hist(result)

Here are quantiles of the score from the 100,000 simulation runs.

quantile(result, (0:50) / 1e5)

##         0%     0.001%     0.002%     0.003%     0.004%     0.005%     0.006% 
## -1.0751206 -1.0685830 -1.0420288 -1.0365957 -1.0291300 -1.0279158 -1.0256176 
##     0.007%     0.008%     0.009%      0.01%     0.011%     0.012%     0.013% 
## -1.0213779 -1.0155896 -1.0105652 -1.0101937 -1.0089381 -1.0040682 -1.0034077 
##     0.014%     0.015%     0.016%     0.017%     0.018%     0.019%      0.02% 
## -0.9999611 -0.9927603 -0.9927281 -0.9914833 -0.9912075 -0.9909895 -0.9883487 
##     0.021%     0.022%     0.023%     0.024%     0.025%     0.026%     0.027% 
## -0.9883120 -0.9882540 -0.9879542 -0.9871529 -0.9852697 -0.9852592 -0.9823512 
##     0.028%     0.029%      0.03%     0.031%     0.032%     0.033%     0.034% 
## -0.9810451 -0.9800988 -0.9796822 -0.9788774 -0.9782178 -0.9781907 -0.9771943 
##     0.035%     0.036%     0.037%     0.038%     0.039%      0.04%     0.041% 
## -0.9761788 -0.9737235 -0.9734116 -0.9727133 -0.9726701 -0.9709615 -0.9700440 
##     0.042%     0.043%     0.044%     0.045%     0.046%     0.047%     0.048% 
## -0.9691795 -0.9679213 -0.9676544 -0.9674069 -0.9670152 -0.9667076 -0.9665323 
##     0.049%      0.05% 
## -0.9656286 -0.9638792

The data comes in between the percentiles 0.028% and 0.029%, meaning we would see data with a score this low about 1 in 3,400 times. A simulation at seed 1234567 returned 1 in 4,500, suggesting the simulation results are a decent approximation of the p-value that would be obtained from an infinite number of replications rather than 100,000.

Discussion

This is a tough call. These data are pretty unlikely by random chance, but enough studies happen per year that you might see one or two that look like this. The utility of a test like this for detecting fraud would depend on the false discovery rate, which depends on how frequently people fraudulently create overly equitable data. If it happens dozens of times a year, or in much more extreme cases (say, the overall test returns a p-value of 1 in a million rather than 1 in 3,400) I suspect the test would probably be useful on balance. But there would certainly be a non-negligible false positive rate and a need for carefully gathering additional evidence before a final judgment.

Here are some other considerations that might tip the balance to an explanation other than fraud:

This approach assumes a binomial distribution for the treatment and control group yeses. This is an approximation. A possibly more exact approach is given by Daniel Tausk here. Tausk obtains a less significant p value, 1 in 490.
We did not account for correlations between the 22 variables. If people tend to come in with particular complexes of conditions, there are fewer effective degrees of freedom in the data generation process, leading to a wider spread of possible scores and to the observed data being less unlikely than our simulation above suggests. Primary diagnosis variables are definitely correlated as each person has only one primary diagnosis.
The study may have been reported improperly as enrolling patients consecutively. Perhaps, in fact, the second group was matched somewhat to the characteristics of the first group. But it’s unclear to me how this matching could have been performed, and reporting the study enrollment as “consecutive” while completely omitting the matching procedure would be of serious concern in itself.

Technical Comments on Kyle Sheldrick’s Vitamin C Study Fraud Allegation, Part 2

Paul Shearer, PhD Applied Mathematics, University of Michigan

2022-04-14

Overview

Method

Results

Discussion