Libraries and functions set up

library(readxl)
library(effectsize)
library(pwr)

# Colquhoun's R function from 2019 supplementary materials
calc.FPR =  function(nsamp,pval,delta,sigma,prior) 
  {
  df=2*(nsamp-1)
#  # calculate power, for below  
#  myp=power.t.test(n=nsamp,sd=sigma,delta=delta,sig#.level=pval,type="two.sample",alternative="two.sided",power=NULL)
#  power = myp$power
  # Note FPR doesn't need calculation of power for p-equals case  
  #  
  #   #under H0, use central t distribution
  tcrit=qt((1-pval/2),df,ncp=0)
  x0=tcrit
  y0=dt(x0,df,0)
  #
  # under H1 use non-central t distribution
  ncp1=delta/sdiff     #non-centrality paramater
  x1=x0  #tcrit
  y1=dt(x1,df,ncp=ncp1)
# Calc false positive risk
  p0=2*y0
  p1=y1
  LR=y1/(2*y0)
  FPR=((1-prior)*p0)/(((1-prior)*p0) + prior*p1)
  output=c(FPR,LR)
  return(output)
}

# Example
## Data
myn=10
mysampA=c(0.7,-1.6,-0.2,-1.2,-0.1,3.4,3.7,0.8,0.0,2)
mysampB=c(1.9,0.8,1.1,0.1,-0.1,4.4,5.5,1.6,4.6,3.4)

#myobsdif= -1.58

## Prepare input variables for function
meanA=sum(mysampA/myn)
meanB=sum(mysampB/myn)
myobsdif= meanA-meanB
sdA=sd(mysampA)
sdB=sd(mysampB)
#do t test "by hand"
mynA=myn
mynB=myn
varpool= ((mynA-1)*sdA^2 +(mynB-1)*sdB^2)/(mynA+mynB-2)
sdpool=sqrt(varpool)
normeffectsize=abs(myobsdif)/sdpool
sdiff=sqrt(varpool/mynA + varpool/mynB)
tval=abs(myobsdif)/sdiff
df=mynA+mynB-2
pval=2*(1-pt(tval,df,0))
pow1=power.t.test(n = myn,
                  delta = abs(myobsdif),
                  sd = sdpool, sig.level = 0.05,power = NULL,
                  type = "two.sample",alternative = "two.sided",
                  strict = FALSE)
powerobs=pow1$power
# end of t test "by hand"

prior=0.5
nsamp=myn
delta=abs(myobsdif)
sigma=sdpool
myfpr=calc.FPR(nsamp,pval,delta,sigma,prior)
myfpr # returns minimum FPR for p(H1) = 0.5 and likelihood ratio for H1/H0

# While Colquhoun's website calculator handles uneven sample sizes, the function in the supplementary materials does not

test.fpr =  function(sample_a,
                     sample_b,
                     pval, # observed p_value
                     delta, # absolute observed difference in means
                     sigma, # pooled SD
                     prior) # prior prob of real effect
  {
  df = t.test(sample_a, sample_b)$parameter
  
  meanA = mean(sample_a)
  meanB = mean(sample_b)
  obsdif = meanA - meanB
  sdA = sd(sample_a)
  sdB = sd(sample_b)

  mynA = length(sample_a)
  mynB = length(sample_b)

  varpool = ((mynA-1)*sdA^2 +(mynB-1)*sdB^2)/(mynA+mynB-2)
  sdpool = sqrt(varpool)

  normeffectsize = abs(myobsdif)/sdpool

  sdiff = sqrt(varpool/mynA + varpool/mynB)

  #tval = abs(myobsdif)/sdiff
  
  #pval = 2*(1 - pt(tval,df,0))

  # (power calculation here)
  
  tcrit = qt((1-pval/2), df, ncp=0)
  x0 = tcrit
  y0 = dt(x0,df,0)
  ncp1 = delta/sdiff     
  x1 = x0  #tcrit
  y1 = dt(x1,df,ncp=ncp1)

  # Calculate false positive risk
  p0 = 2*y0
  p1 = y1
  LR = y1/(2*y0)
  FPR = ((1-prior)*p0)/(((1-prior)*p0) + prior*p1)
  output = c(FPR, LR)
  return(output)
}

Overview

Motivation: reporting p-values with the context of a False Positive Risk for the single test conducted – understanding an NHST and its result like a screening / binomial classification problem. Why?

Report all of the evidence against a single test.
Allow lower p-values than the alpha threshold to be reasoned about in relation to evidence.
Provide a metric that penalises an under-powered test’s p-value result.
Support replication studies (implementation has the effect of promoting good practices pre and post study).

A note on the name: confusingly, False Positive Risk is not the same as False Positive Rate (FPR) in the confusion matrix literature. Colquhouns’s False Positive Risk is analogous to what is called the False Discovery Rate in the screening terminology (FP / FP + TP), the ratio of false positives and all positives. The False Positive Rate (FP / FP + TN) is what the p-value itself does, which is the rate of false positives only for the cases where there is no effect.

Slide from David Colquhoun’s talk on YouTube.

Data

Tenório, T, Bittencourt, II, Isotani, S, Pedro, A, Ospina, P & Tenório, D 2017, ‘Dataset of two experiments of the application of gamified peer assessment model into online learning environment MeuTutor’, Data in Brief, vol. 12, no. C, pp. 433–437, doi: 10.1016/j.dib.2017.04.032.

Experiment 1: Time metric study across three conditions. Specific comparison from the study we will use for this notebook is the time (in seconds) to correct one performed activity for the conditions:

T2 (‘correction using the proposed gamified peer assessment model within the online learning environment MeuTutor’) (n = 18)
T3 (‘manual correction by teacher without the online learning environment MeuTutor’) (n = 10)

The study data is loaded and pre-processed below.

meututor = read_xlsx("data/tenorio_dataset/Experiment_one.xlsx", sheet = 3)
meututor = meututor[-1,] # remove header row
meututor$...2 <- as.numeric(meututor$...2) # convert chr to numeric
meututor$...3 <- as.numeric(meututor$...3)

meututor_t2 <- meututor$...2 # assign to separate vectors
meututor_t3 <- meututor$...3
meututor_t3 <- meututor_t3[!is.na(meututor_t3)] # remove broadcasted NA values

They report using a Welch t-test for unequal variances in R for the T2 X T3 comparison so let’s reproduce the result:

reproduce_result <- t.test(meututor_t2,
                           meututor_t3)
reproduce_result

## 
##  Welch Two Sample t-test
## 
## data:  meututor_t2 and meututor_t3
## t = -2.4605, df = 21.938, p-value = 0.02222
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -308.58038  -26.28629
## sample estimates:
## mean of x mean of y 
##  501.6667  669.1000

# note Welch degrees of freedom stored in:
#reproduce_result$parameter

A p-value of 0.022, as originally reported in the study / provided files. In terms of the estimated effect size, it takes the teacher as little as 26 seconds or as much as 5 minutes longer (95% CI) to mark a completed task compared to the peer feedback system.

Setting up the FPR calculation

Prior probability of H1 (alternative hypothesis): 0.5

We need this value to theorise the proportion of hypothesis tests where there would be a real effect to detect. This is analogous to disease prevalence in the medical screening context.

Colquhoun suggests 50% as a reasonable choice absent any other information. If you are less confident in the plausibility of H1 - e.g., there has been nothing like the hypothesis in the literature, a null result before etc. - you might choose a lower prior probability. There is no valid reason to set a higher prior probability.

n: T2: 18, T3: 10

In the study, condition T2 and T3 have different sample sizes - both numbers are needed for pooled standard deviation calculation.

Pooled sample standard deviation: \[ S_p=\sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2)}{n_1 + n_2 - 2}} \] We will use this in the calculation of the effect size. In estimating a pooled sample SD it weights the individual SDs according to sample sizes.
Effect size:

The effect size is the difference in means for the conditions. When the samples have different standard deviations, the effect size needs to be expressed in terms of the pooled standard deviation (this normalised effect is also known as Hedge’s g).

std_dev_pooled = function(sample_a, sample_b){
  pooled_sd = sqrt(((length(sample_a)-1)*sd(sample_a)^2 + (length(sample_b)-1)*sd(sample_b)^2)/(length(sample_a)+length(sample_b)-2))
  return(pooled_sd)
}

obs_difference = mean(meututor_t2) - mean(meututor_t3)

poolsd = std_dev_pooled(meututor_t2, meututor_t3)

stopifnot(round(std_dev_pooled(meututor_t2, meututor_t3), 3) == round(effectsize::sd_pooled(meututor_t2, meututor_t3), 3))

meu_normeffect = abs(obs_difference)/poolsd
meu_normeffect

## [1] 0.9174353

Cohen - the statistician seminal in power analysis - proposed heuristics for what are small, medium and large effects but researchers are cautioned to frame this within the discipline they are in.

According to Cohen, as this is greater than 0.8, this constitutes a large effect size.

There is also a recommended bias correction for studies where total n is < 50.

grand_n = length(meututor_t2) + length(meututor_t3)
meu_normeffect * ((grand_n - 3) / (grand_n - 2.25)) * sqrt(((grand_n - 2) / grand_n))

## [1] 0.8583134

With the bias correction applied, this is still a large effect.

Currently, the function test.fpr does not use the corrected Hedge’s G in the calculation of the FPR but this can / should be implemented.

Post-hoc power:

Post-hoc power analyses are frowned upon when used to justify or review the study design (e.g., become vulnerable to fallacies like ‘the fact that the study was under-powered and still found an effect supports the strength of the effect’). Generally power is something to be analysed beforehand to determine the appropriate sample size given the anticipated effect size. Here it serves a different purpose since we are modelling a probability (false positive results) for the study that just happened.

The other thing to keep in mind is that the delta or Cohen’s d actually uses a population standard deviation (e.g., this could come from meta-analysis of past studies). Whereas this post-hoc power calculation uses the pooled standard deviation of the observed data. This is again appropriate given the goal is to model an FPR given this one study.

# Note: for unequal sample sizes this needs to be the `pwr.t2n.test` function
meu_power_summary <- pwr.t2n.test(n1 = length(meututor_t2), n2 = length(meututor_t3), d = abs(obs_difference)/poolsd, sig.level = 0.05, power = NULL, alternative = c("two.sided","less","greater"))

meu_power <- meu_power_summary$power
meu_power_summary

## 
##      t test power calculation 
## 
##              n1 = 18
##              n2 = 10
##               d = 0.9174353
##       sig.level = 0.05
##           power = 0.6101538
##     alternative = two.sided

The post-hoc power does not need to be directly used in the FPR calculation but should be reported with it.

Calculating the minimum FPR for the Tenório et al. study

Why is it minimum? Because the ‘true’ P(H1) can only be worse than the 0.5 prior given. This means that the ‘real effect’ arm of the chart can only include a smaller number of NHST test results and the power arm can only detect an effect from within these.
What does ‘p-equals’ mean? This conditions the probability of the significant observed p-value on being in a specific range, rather than being the observed value or any value less than the observed value. Detail on the justification around this is in Colquhoun’s article.

test.fpr(meututor_t2, meututor_t3, reproduce_result$p.value, abs(obs_difference), poolsd, 0.5)

## [1] 0.1164229 7.5893740

Colquhoun’s function returns the FPR and the likelihood ratio of the probabilities of the hypotheses.

Per the first value returned here, there is a 12% risk that any significant result is a false positive (there is no real effect despite the procedure’s p-value result).

Interpretation: apart from directly considering what is practically tolerable given the study and its consequences, the FPR can be used as an opportunity to benchmark against other studies. Colquhoun’s articles demonstrate through substantial simulation results that an FPR of 20-30% is common in many published ‘significant’ findings papers. While it is not as simple as saying: 12% is < than what is the norm in others’ work, so it is a ‘good’ FPR, this is a reasonable starting point for thinking about it. Another way of thinking about it is that the FPR is a number where we’d like to be getting near 0.05 - not because this is a particularly important threshold but it is around the sort of risk minimisation we want.

FPR in the UCL web calculator

We can check against Colquhoun’s web calculator for FPR at http://fpr-calc.ucl.ac.uk/

FPR from web calculator with MeuTutor data: 0.1179

The modified FPR function appears to be working correctly with the small difference explainable by float handling / rounding in the different code. It would not be meaningful to describe the percentage FPR to any decimal points.

Reporting example

Include:

Confidence interval for difference in means: [-309, -26] seconds (95% CI)
Normalised effect size: 0.92 or 0.86 (Hedges correction)
P-value: 0.022
FPR: 12% (H1 prior: 50%)
Power: 0.61

References

Colquhoun, D 2019, “The False Positive Risk: A Proposal Concerning What to Do About p-Values”, The American Statistician, 73:sup1, 192-201, DOI: 10.1080/00031305.2018.1529622

Methods Demo: Reporting minimum false positive risk for a single NHST ‘p equals’ result (Colquhoun)

Eugene Ward