Experiments on unconscious cognition (e.g., studies of masked priming or continuous flash suppression) frequently include a supplemental “awareness test” which is designed to assess whether participants in the study were actually aware of stimuli which, supposedly, were being presented subliminally. For example, in an ancilliary awareness test for a masked priming study, participants might be asked to complete a set of trials in which they have to guess whether a temporally masked word is present. If a participant is not able to complete this task above chance, then the researchers will typically assume that that participant was not aware of the stimuli, and will not exclude their data from the results.
Here, I test the utility of a Bayesian approach to analyzing data from awareness tests. The key message is that, if participants are aware of the stimuli but not strongly so, then a naive Bayesian approach will still frequently classify participants as unaware. As such, an unthinking use of Bayes factors could severely contaminate the scientific literature on unconscious cognition.
There are two major problems with current approaches to awareness tests. First, most assessments of awareness use a statistical framework based on null hypothesis testing. Null hypothesis testing has many virtues, but in this case it is flawed because it cannot quantify evidence that participants are not aware of stimuli. It can only quantify evidence that participants are aware of stimuli (or more strictly, quantify whether the evidence is inconsistent with the null hypothesis that participants are not aware of the stimuli).
The second problem is discussed in recent papers by Vadillo, Shanks and colleagues: Awareness tests are typically underpowered. That is, the typical design of these tests means that, even when participants are aware of the stimuli, an analysis of the test will fail to produce a significant result more often than researchers would typically desire. While most researchers aim to power their studies to detect a true effect 80% of the time, Vadillo et al show that awareness tests typically only have 20% power. This critique interacts with the null hypothesis testing framework in an important fashion because, when a hypothesis test is not statistically significant, it is unclear whether the effect is a true null or not, and underpowered hypothesis tests will frequently fail to reach significance even when the null hypothesis is false.
Bayesian statistics present a potential solution to the first problem as they are able to quantify evidence for the null hypothesis. In the context of awareness tests, this means that they can not only assess whether participants are aware of a stimulus, but can also assess whether participants are not aware of a stimulus. Bayesian tests thus seem ideal.
However, the unthinking use of Bayesian tests will still interact with statistical power. Bayesian tests, like frequentist tests, can produce false postives (incorrectly claiming that there is strong evidence for the alternative hypothesis) but, unlike frequentist tests, they can also produce false “anti-positives”: incorrectly claiming that there is good evidence for the null (and of course, they can also produce more traditional null results, in which there is no good evidence either way). The likelihood of their producing false positives and false anti-positives will depend upon a variety of factors, but will particularly depend upon the size of the underlying effect and the size of the assessed sample, i.e., will depend upon statistical power.
Here, I used simulations to assess the efficacy of using Bayesian tests to assess awareness. I wanted to know what sort of sample size was necessary to ensure a high rate of true positives and a low rate of false “anti-positives”.
I simulated the binomial responses of thousands of participants who were taking part in a simple correct/incorrect awareness task. The probability that participants would be correct was always 0.55, and I varied the number of trials that each simiulated participant completed, from 20 trials to 800 trials. For each number of trials (i.e., power level), I simulated 2000 participants.
I analyzed each simulated participant’s resulting responses using a Bayesian binomial test against chance (0.5). For each number of trials, I then calculated the proportion of tests that returned a false anti-positive result (i.e., incorrectly claimed that there was substantial support for an underlying mean of 0.5 [B.F. <0.33]), the proportion of tests that correctly returned a true positive result (i.e., correctly claimed that there was strong support for an underlying mean greater than 0.5 [B.F. > 3]), and then calculated the ratio of false null results to true positive results. I also calculated the proportion of tests that reported no evidential value in the sample (i.e., B.F. between 0.33 and 3).
acc = 0.55 # Participants accuracy
iter = 2000 # Number of simulations
trials = c(20,40,60,80,100,150,200,250,300,350,400,500,800) # Number of trials each participant produces
null_power_sims = data.frame(trials = rep(trials,each = iter), iteration = rep(1:iter,length(trials)), bf = NA)
for (x in trials){
for (y in 1:iter){
null_power_sims[null_power_sims$trials == x & null_power_sims$iter == y,]$bf <-
exp(proportionBF(sum(rbinom(x,1,acc)),x,p=0.5)@bayesFactor$bf)
}
}
null_power_sims$Test_null <- ifelse(null_power_sims$bf < 0.33,1,0)
null_power_sims$Test_pos <- ifelse(null_power_sims$bf > 3,1,0)null_power_sims$Test_uncertain <- 0
null_power_sims[null_power_sims$bf > 0.333 & null_power_sims$bf < 3,]$Test_uncertain <- 1
test_unc_null <- ggplot(null_power_sims, aes(x = trials, y = Test_uncertain))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of an uncertain result")
test_null <- ggplot(null_power_sims, aes(x = trials, y = Test_null))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of a false anti-positive result")
test_pos <- ggplot(null_power_sims, aes(x = trials, y = Test_pos))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of a true positive result")
null_power_sum <- null_power_sims %>%
dplyr::select(trials,Test_null,Test_pos)%>%
dplyr::group_by(trials) %>%
dplyr::summarize(sum_null = sum(Test_null), sum_pos = sum(Test_pos)) %>%
dplyr::mutate(ratio_results = sum_null/sum_pos)
ratio_plot <- ggplot(null_power_sum, aes(x = trials, y = ratio_results))+
geom_point(cex = 3)+
theme_cowplot()+
ggtitle("Ratio of false anti-positives to true positives")
#plot_grid(test_null,test_pos,ratio_plot,labels = "AUTO")In this first analysis, I conducted a “cookbook” Bayesian test, that assessed the probability that the data came from a null hypothesis in which participants were at chance in the awareness test, against an alternative hypothesis in which participants were not at chance (and their actual score could thus lie anywhere from 0 to 1).
The results of this analysis are quite unnerving.
plot_grid(test_null,test_pos,ratio_plot,test_unc_null,labels = "AUTO")Panel A shows that for low sample sizes, there is a very low probability of a false anti-positive, but Panel B shows that for those same low sample sizes, there is also a very low probability of a true positive. That is to say, awareness tests that only use 20 to 40 trials are essentially useless, and cannot distinguish whether or not participants were aware.
Intuitively, one might think that increasing the sample size would increase the probability of a true positive, while also having little effect on the probability of a false anti-positive. But in fact, Panel A shows that when the sample size increases, there is also a drastic increase in *false anti-positives** (i.e., the test frequently concludes that participants were at chance in this task). For example, when participants complete a task with 200 or so trials, the probability of a false negative approaches 0.5. That is to say, almost half of the participants in this set of simulations would have been classified as “unaware of the stimuli”“, even though the design of the simulation meant that they were somewhat aware of the stimuli.
As sample size increases above approximately 150 trials, the probability of being incorrectly classified as “unaware” does decrease somewhat, and the probability of being correctly classified as “aware” does also increase. But the relative rates of decrease and increase are not high. When participants complete 800 trials, approximately 20% are still classified as “unaware”, while only about 55% are classified as “aware”.
Note that for most sample sizes, the Bayesian test concludes that there is no good evidence either way (Panel D). That is to say, it is uncertain as to whether or not participants were aware of the stimuli.
One potential handicap for the Bayesian test in this first case is that it has to compare a very well-specified null hypothesis (participants are at chance, 0.5) to an under-specified alternative hypothesis (participants are not at chance, but their performance will lie anywhere from 0 to 1). Does the situation improve if we use a stricter hypothesis?
onesided_power_sims = data.frame(trials = rep(trials,each = iter), iteration = rep(1:iter,length(trials)), bf = NA)
for (x in trials){
for (y in 1:iter){
onesided_power_sims[onesided_power_sims$trials == x & onesided_power_sims$iter == y,]$bf <-
exp(proportionBF(sum(rbinom(x,1,acc)),x,p=0.5,nullInterval = c(0.5,1))@bayesFactor$bf[1])
}
}
onesided_power_sims$Test_null <- ifelse(onesided_power_sims$bf < 0.33,1,0)
onesided_power_sims$Test_pos <- ifelse(onesided_power_sims$bf > 3,1,0)onesided_power_sims$Test_uncertain <- 0
onesided_power_sims[onesided_power_sims$bf > 0.333 & onesided_power_sims$bf < 3,]$Test_uncertain <- 1
test_unc_onesided <- ggplot(onesided_power_sims, aes(x = trials, y = Test_uncertain))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of an uncertain result")
test_null_onesided <- ggplot(onesided_power_sims, aes(x = trials, y = Test_null))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of a false anti-positive result")
test_pos_onesided <- ggplot(onesided_power_sims, aes(x = trials, y = Test_pos))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of a true positive result")
null_power_sum_onesided <- onesided_power_sims %>%
dplyr::select(trials,Test_null,Test_pos)%>%
dplyr::group_by(trials) %>%
dplyr::summarize(sum_null = sum(Test_null), sum_pos = sum(Test_pos)) %>%
dplyr::mutate(ratio_results = sum_null/sum_pos)
ratio_plot_onesided <- ggplot(null_power_sum_onesided, aes(x = trials, y = ratio_results))+
geom_point(cex = 3)+
theme_cowplot()+
ggtitle("Ratio of false anti-positives to true positives")
#plot_grid(test_null,test_pos,ratio_plot,labels = "AUTO")Here, we compare a null that performance is at chance, to an alternative that performance is above chance (in particular, that performance falls between 0.5 and 1).
plot_grid(test_null_onesided,test_pos_onesided,ratio_plot_onesided,test_unc_onesided,labels = "AUTO")The results are still not good! Although the overall rate of false anti-positives has declined,it is still the case that, at many power levels, nearly one quarter of participants are being classified as unaware of the stimuli. In addition, for our smallest sample sizes (20 and 40 trials), it is now the case that some participants are being incorrectly classified as unaware of the stimuli (which was not the case in the first simulation).
More positively, the probability of a true positive result has also increased somewhat, but the increase is small. At smaller sample sizes (fewer than 200 trials per participant), it is the case that participants are more likely to be incorrectly classified as unaware of the stimuli than they are to be correctly classified as aware of the stimuli.
As before, for most sample sizes, the Bayesian test concludes that there is no good evidence either way. That is to say, it is uncertain as to whether or not participants were aware of the stimuli.
In an awareness test, it is quite unlikely that participants would perform above 75% correct. How do things change if we assess the null against an even narrower hypothesis, where performance is above chance but not likely to be greater than 75% correct?
narrow_power_sims = data.frame(trials = rep(trials,each = iter), iteration = rep(1:iter,length(trials)), bf = NA)
for (x in trials){
for (y in 1:iter){
narrow_power_sims[narrow_power_sims$trials == x & narrow_power_sims$iter == y,]$bf <-
exp(proportionBF(sum(rbinom(x,1,acc)),x,p=0.5,nullInterval = c(0.5,0.75))@bayesFactor$bf[1])
}
}
narrow_power_sims$Test_null <- ifelse(narrow_power_sims$bf < 0.33,1,0)
narrow_power_sims$Test_pos <- ifelse(narrow_power_sims$bf > 3,1,0)narrow_power_sims$Test_uncertain <- 0
narrow_power_sims[narrow_power_sims$bf > 0.333 & narrow_power_sims$bf < 3,]$Test_uncertain <- 1
test_unc_narrow <- ggplot(narrow_power_sims, aes(x = trials, y = Test_uncertain))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of an uncertain result")
test_null_narrow <- ggplot(narrow_power_sims, aes(x = trials, y = Test_null))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of a false anti-positive result")
test_pos_narrow <- ggplot(narrow_power_sims, aes(x = trials, y = Test_pos))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of a true positive result")
null_power_sum_narrow <- narrow_power_sims %>%
dplyr::select(trials,Test_null,Test_pos)%>%
dplyr::group_by(trials) %>%
dplyr::summarize(sum_null = sum(Test_null), sum_pos = sum(Test_pos)) %>%
dplyr::mutate(ratio_results = sum_null/sum_pos)
ratio_plot_narrow <- ggplot(null_power_sum_narrow, aes(x = trials, y = ratio_results))+
geom_point(cex = 3)+
theme_cowplot()+
ggtitle("Ratio of false anti-positives to true positives")
#plot_grid(test_null,test_pos,ratio_plot,labels = "AUTO")Even in this case, with a very narrow hypothesis, participants are still incorrectly classified as “unaware” almost 25% of the time, particularly when participants complete a moderately large number of trials (between about 60 and 300 trials per participant).
With this narrow hypothesis, the ratio of false anti-positives to true positives is much better than in our previous simulations (more participants are classified as aware than unaware), but that fact does not provide a great deal of comfort because it is still the case that the rate of false anti-positives is too high, and the rate of true positives is too low.
Instead, under this analysis, like previous analyses, the Bayesian test most often calculates that there is no good evidence either way. That is to say, it is uncertain as to whether or not participants were aware of the stimuli.
plot_grid(test_null_narrow,test_pos_narrow,ratio_plot_narrow,test_unc_narrow,labels = "AUTO")z <- 0What happens with a very strong prior, that performance will not likely to be greater than 60% correct (i.e., 10 points better than the true mean)?
really_narrow_power_sims = data.frame(trials = rep(trials,each = iter), iteration = rep(1:iter,length(trials)), bf = NA)
for (x in trials){
for (y in 1:iter){
really_narrow_power_sims[really_narrow_power_sims$trials == x & really_narrow_power_sims$iter == y,]$bf <-
exp(proportionBF(sum(rbinom(x,1,acc)),x,p=0.5,nullInterval = c(0.5,0.6))@bayesFactor$bf[1])
}
}
really_narrow_power_sims$Test_null <- ifelse(really_narrow_power_sims$bf < 0.33,1,0)
really_narrow_power_sims$Test_pos <- ifelse(really_narrow_power_sims$bf > 3,1,0)really_narrow_power_sims$Test_uncertain <- 0
really_narrow_power_sims[really_narrow_power_sims$bf > 0.333 & really_narrow_power_sims$bf < 3,]$Test_uncertain <- 1
test_unc_really_narrow <- ggplot(really_narrow_power_sims, aes(x = trials, y = Test_uncertain))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of an uncertain result")
test_null_really_narrow <- ggplot(really_narrow_power_sims, aes(x = trials, y = Test_null))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of a false anti-positive result")
test_pos_really_narrow <- ggplot(really_narrow_power_sims, aes(x = trials, y = Test_pos))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of a true positive result")
null_power_sum_really_narrow <- really_narrow_power_sims %>%
dplyr::select(trials,Test_null,Test_pos)%>%
dplyr::group_by(trials) %>%
dplyr::summarize(sum_null = sum(Test_null), sum_pos = sum(Test_pos)) %>%
dplyr::mutate(ratio_results = sum_null/sum_pos)
ratio_plot_really_narrow <- ggplot(null_power_sum_really_narrow, aes(x = trials, y = ratio_results))+
geom_point(cex = 3)+
theme_cowplot()+
ggtitle("Ratio of false anti-positives to true positives")
z<-0plot_grid(test_null_really_narrow,test_pos_really_narrow,ratio_plot_really_narrow,test_unc_really_narrow,labels = "AUTO")z <- 1While it is unlikely that participants in awareness tests are ever completely unaware of the stimuli, it is still interesting to see how this approach might fare in that case – how often will it correctly classify people as not aware of the stimuli, and how often will it incorrectly classify them as aware of the stimuli.
acc = 0.5
unaware_power_sims = data.frame(trials = rep(trials,each = iter), iteration = rep(1:iter,length(trials)), bf = NA)
for (x in trials){
for (y in 1:iter){
unaware_power_sims[unaware_power_sims$trials == x & unaware_power_sims$iter == y,]$bf <-
exp(proportionBF(sum(rbinom(x,1,acc)),x,p=0.5,nullInterval = c(0.5,0.75))@bayesFactor$bf[1])
}
}
unaware_power_sims$Test_null <- ifelse(unaware_power_sims$bf < 0.33,1,0)
unaware_power_sims$Test_pos <- ifelse(unaware_power_sims$bf > 3,1,0)unaware_power_sims$Test_uncertain <- 0
unaware_power_sims[unaware_power_sims$bf > 0.333 & unaware_power_sims$bf < 3,]$Test_uncertain <- 1
test_unc_unaware <- ggplot(unaware_power_sims, aes(x = trials, y = Test_uncertain))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of an uncertain result")
test_null_unaware <- ggplot(unaware_power_sims, aes(x = trials, y = Test_null))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of a true anti-positive result")
test_pos_unaware <- ggplot(unaware_power_sims, aes(x = trials, y = Test_pos))+
ylim(c(0,1))+
stat_summary_bin(fun.data = mean_cl_boot)+
theme_cowplot()+
ggtitle("Probability of a false positive result")
null_power_sum_unaware <- unaware_power_sims %>%
dplyr::select(trials,Test_null,Test_pos)%>%
dplyr::group_by(trials) %>%
dplyr::summarize(sum_null = sum(Test_null), sum_pos = sum(Test_pos)) %>%
dplyr::mutate(ratio_results = sum_null/sum_pos)
ratio_plot_unaware <- ggplot(null_power_sum_unaware, aes(x = trials, y = ratio_results))+
geom_point(cex = 3)+
theme_cowplot()+
ggtitle("Ratio of true anti-positives to false positives")
#plot_grid(test_null,test_pos,ratio_plot,labels = "AUTO")Importantly, when there truly is no effect, we still need a large sample size for each participant to be confident that we can classify them as performing at chance levels. We only reach 80% power when each participant completes hundreds and hundreds of trials.
plot_grid(test_null_unaware,test_pos_unaware,ratio_plot_unaware,test_unc_unaware,labels = "AUTO")