Libraries and functions set up
library(readxl)
library(effectsize)
library(pwr)
# Colquhoun's R function from 2019 supplementary materials
calc.FPR = function(nsamp,pval,delta,sigma,prior)
{
df=2*(nsamp-1)
# # calculate power, for below
# myp=power.t.test(n=nsamp,sd=sigma,delta=delta,sig#.level=pval,type="two.sample",alternative="two.sided",power=NULL)
# power = myp$power
# Note FPR doesn't need calculation of power for p-equals case
#
# #under H0, use central t distribution
tcrit=qt((1-pval/2),df,ncp=0)
x0=tcrit
y0=dt(x0,df,0)
#
# under H1 use non-central t distribution
ncp1=delta/sdiff #non-centrality paramater
x1=x0 #tcrit
y1=dt(x1,df,ncp=ncp1)
# Calc false positive risk
p0=2*y0
p1=y1
LR=y1/(2*y0)
FPR=((1-prior)*p0)/(((1-prior)*p0) + prior*p1)
output=c(FPR,LR)
return(output)
}
# Example
## Data
myn=10
mysampA=c(0.7,-1.6,-0.2,-1.2,-0.1,3.4,3.7,0.8,0.0,2)
mysampB=c(1.9,0.8,1.1,0.1,-0.1,4.4,5.5,1.6,4.6,3.4)
#myobsdif= -1.58
## Prepare input variables for function
meanA=sum(mysampA/myn)
meanB=sum(mysampB/myn)
myobsdif= meanA-meanB
sdA=sd(mysampA)
sdB=sd(mysampB)
#do t test "by hand"
mynA=myn
mynB=myn
varpool= ((mynA-1)*sdA^2 +(mynB-1)*sdB^2)/(mynA+mynB-2)
sdpool=sqrt(varpool)
normeffectsize=abs(myobsdif)/sdpool
sdiff=sqrt(varpool/mynA + varpool/mynB)
tval=abs(myobsdif)/sdiff
df=mynA+mynB-2
pval=2*(1-pt(tval,df,0))
pow1=power.t.test(n = myn,
delta = abs(myobsdif),
sd = sdpool, sig.level = 0.05,power = NULL,
type = "two.sample",alternative = "two.sided",
strict = FALSE)
powerobs=pow1$power
# end of t test "by hand"
prior=0.5
nsamp=myn
delta=abs(myobsdif)
sigma=sdpool
myfpr=calc.FPR(nsamp,pval,delta,sigma,prior)
myfpr # returns minimum FPR for p(H1) = 0.5 and likelihood ratio for H1/H0
# While Colquhoun's website calculator handles uneven sample sizes, the function in the supplementary materials does not
test.fpr = function(sample_a,
sample_b,
pval, # observed p_value
delta, # absolute observed difference in means
sigma, # pooled SD
prior) # prior prob of real effect
{
df = t.test(sample_a, sample_b)$parameter
meanA = mean(sample_a)
meanB = mean(sample_b)
obsdif = meanA - meanB
sdA = sd(sample_a)
sdB = sd(sample_b)
mynA = length(sample_a)
mynB = length(sample_b)
varpool = ((mynA-1)*sdA^2 +(mynB-1)*sdB^2)/(mynA+mynB-2)
sdpool = sqrt(varpool)
normeffectsize = abs(myobsdif)/sdpool
sdiff = sqrt(varpool/mynA + varpool/mynB)
#tval = abs(myobsdif)/sdiff
#pval = 2*(1 - pt(tval,df,0))
# (power calculation here)
tcrit = qt((1-pval/2), df, ncp=0)
x0 = tcrit
y0 = dt(x0,df,0)
ncp1 = delta/sdiff
x1 = x0 #tcrit
y1 = dt(x1,df,ncp=ncp1)
# Calculate false positive risk
p0 = 2*y0
p1 = y1
LR = y1/(2*y0)
FPR = ((1-prior)*p0)/(((1-prior)*p0) + prior*p1)
output = c(FPR, LR)
return(output)
}
Motivation: reporting p-values with the context of a False Positive Risk for the single test conducted – understanding an NHST and its result like a screening / binomial classification problem. Why?
A note on the name: confusingly, False Positive Risk is not the same as False Positive Rate (FPR) in the confusion matrix literature. Colquhouns’s False Positive Risk is analogous to what is called the False Discovery Rate in the screening terminology (FP / FP + TP), the ratio of false positives and all positives. The False Positive Rate (FP / FP + TN) is what the p-value itself does, which is the rate of false positives only for the cases where there is no effect.
Slide from David Colquhoun’s talk on YouTube.
Tenório, T, Bittencourt, II, Isotani, S, Pedro, A, Ospina, P & Tenório, D 2017, ‘Dataset of two experiments of the application of gamified peer assessment model into online learning environment MeuTutor’, Data in Brief, vol. 12, no. C, pp. 433–437, doi: 10.1016/j.dib.2017.04.032.
Experiment 1: Time metric study across three conditions. Specific comparison from the study we will use for this notebook is the time (in seconds) to correct one performed activity for the conditions:
The study data is loaded and pre-processed below.
meututor = read_xlsx("data/tenorio_dataset/Experiment_one.xlsx", sheet = 3)
meututor = meututor[-1,] # remove header row
meututor$...2 <- as.numeric(meututor$...2) # convert chr to numeric
meututor$...3 <- as.numeric(meututor$...3)
meututor_t2 <- meututor$...2 # assign to separate vectors
meututor_t3 <- meututor$...3
meututor_t3 <- meututor_t3[!is.na(meututor_t3)] # remove broadcasted NA values
They report using a Welch t-test for unequal variances in R for the T2 X T3 comparison so let’s reproduce the result:
reproduce_result <- t.test(meututor_t2,
meututor_t3)
reproduce_result
##
## Welch Two Sample t-test
##
## data: meututor_t2 and meututor_t3
## t = -2.4605, df = 21.938, p-value = 0.02222
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -308.58038 -26.28629
## sample estimates:
## mean of x mean of y
## 501.6667 669.1000
# note Welch degrees of freedom stored in:
#reproduce_result$parameter
A p-value of 0.022, as originally reported in the study / provided files. In terms of the estimated effect size, it takes the teacher as little as 26 seconds or as much as 5 minutes longer (95% CI) to mark a completed task compared to the peer feedback system.
We need this value to theorise the proportion of hypothesis tests where there would be a real effect to detect. This is analogous to disease prevalence in the medical screening context.
Colquhoun suggests 50% as a reasonable choice absent any other information. If you are less confident in the plausibility of H1 - e.g., there has been nothing like the hypothesis in the literature, a null result before etc. - you might choose a lower prior probability. There is no valid reason to set a higher prior probability.
In the study, condition T2 and T3 have different sample sizes - both numbers are needed for pooled standard deviation calculation.
Pooled sample standard deviation: \[ S_p=\sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2)}{n_1 + n_2 - 2}} \] We will use this in the calculation of the effect size. In estimating a pooled sample SD it weights the individual SDs according to sample sizes.
Effect size:
The effect size is the difference in means for the conditions. When the samples have different standard deviations, the effect size needs to be expressed in terms of the pooled standard deviation (this normalised effect is also known as Hedge’s g).
std_dev_pooled = function(sample_a, sample_b){
pooled_sd = sqrt(((length(sample_a)-1)*sd(sample_a)^2 + (length(sample_b)-1)*sd(sample_b)^2)/(length(sample_a)+length(sample_b)-2))
return(pooled_sd)
}
obs_difference = mean(meututor_t2) - mean(meututor_t3)
poolsd = std_dev_pooled(meututor_t2, meututor_t3)
stopifnot(round(std_dev_pooled(meututor_t2, meututor_t3), 3) == round(effectsize::sd_pooled(meututor_t2, meututor_t3), 3))
meu_normeffect = abs(obs_difference)/poolsd
meu_normeffect
## [1] 0.9174353
Cohen - the statistician seminal in power analysis - proposed heuristics for what are small, medium and large effects but researchers are cautioned to frame this within the discipline they are in.
According to Cohen, as this is greater than 0.8, this constitutes a large effect size.
There is also a recommended bias correction for studies where total n is < 50.
grand_n = length(meututor_t2) + length(meututor_t3)
meu_normeffect * ((grand_n - 3) / (grand_n - 2.25)) * sqrt(((grand_n - 2) / grand_n))
## [1] 0.8583134
With the bias correction applied, this is still a large effect.
Currently, the function test.fpr does not use the corrected Hedge’s G in the calculation of the FPR but this can / should be implemented.
Post-hoc power analyses are frowned upon when used to justify or review the study design (e.g., become vulnerable to fallacies like ‘the fact that the study was under-powered and still found an effect supports the strength of the effect’). Generally power is something to be analysed beforehand to determine the appropriate sample size given the anticipated effect size. Here it serves a different purpose since we are modelling a probability (false positive results) for the study that just happened.
The other thing to keep in mind is that the delta or Cohen’s d actually uses a population standard deviation (e.g., this could come from meta-analysis of past studies). Whereas this post-hoc power calculation uses the pooled standard deviation of the observed data. This is again appropriate given the goal is to model an FPR given this one study.
# Note: for unequal sample sizes this needs to be the `pwr.t2n.test` function
meu_power_summary <- pwr.t2n.test(n1 = length(meututor_t2), n2 = length(meututor_t3), d = abs(obs_difference)/poolsd, sig.level = 0.05, power = NULL, alternative = c("two.sided","less","greater"))
meu_power <- meu_power_summary$power
meu_power_summary
##
## t test power calculation
##
## n1 = 18
## n2 = 10
## d = 0.9174353
## sig.level = 0.05
## power = 0.6101538
## alternative = two.sided
The post-hoc power does not need to be directly used in the FPR calculation but should be reported with it.
Why is it minimum? Because the ‘true’ P(H1) can only be worse than the 0.5 prior given. This means that the ‘real effect’ arm of the chart can only include a smaller number of NHST test results and the power arm can only detect an effect from within these.
What does ‘p-equals’ mean? This conditions the probability of the significant observed p-value on being in a specific range, rather than being the observed value or any value less than the observed value. Detail on the justification around this is in Colquhoun’s article.
test.fpr(meututor_t2, meututor_t3, reproduce_result$p.value, abs(obs_difference), poolsd, 0.5)
## [1] 0.1164229 7.5893740
Colquhoun’s function returns the FPR and the likelihood ratio of the probabilities of the hypotheses.
Per the first value returned here, there is a 12% risk that any significant result is a false positive (there is no real effect despite the procedure’s p-value result).
Interpretation: apart from directly considering what is practically tolerable given the study and its consequences, the FPR can be used as an opportunity to benchmark against other studies. Colquhoun’s articles demonstrate through substantial simulation results that an FPR of 20-30% is common in many published ‘significant’ findings papers. While it is not as simple as saying: 12% is < than what is the norm in others’ work, so it is a ‘good’ FPR, this is a reasonable starting point for thinking about it. Another way of thinking about it is that the FPR is a number where we’d like to be getting near 0.05 - not because this is a particularly important threshold but it is around the sort of risk minimisation we want.
We can check against Colquhoun’s web calculator for FPR at http://fpr-calc.ucl.ac.uk/
FPR from web calculator with MeuTutor data: 0.1179
The modified FPR function appears to be working correctly with the small difference explainable by float handling / rounding in the different code. It would not be meaningful to describe the percentage FPR to any decimal points.
Include:
Colquhoun, D 2019, “The False Positive Risk: A Proposal Concerning What to Do About p-Values”, The American Statistician, 73:sup1, 192-201, DOI: 10.1080/00031305.2018.1529622
Tenório, T, Bittencourt, II, Isotani, S, Pedro, A, Ospina, P & Tenório, D 2017, ‘Dataset of two experiments of the application of gamified peer assessment model into online learning environment MeuTutor’, Data in Brief, vol. 12, no. C, pp. 433–437, doi: 10.1016/j.dib.2017.04.032.