Rationale and Functions

While reviewing a paper on palliative care for Indigenous and non-Indigenous Australians for the International Journal of Environmental Research and Public Health (IJERPH; Woods et al., 2020), I noticed that the authors stated that the confidence in their results was bolstered by their large sample size. I pointed out in my review that this was not necessarily the case, and given the merely near-significance of their results, a large sample size made them less, rather than more convincing due to the Jeffreys-Lindley-Bartlett (or more commonly, Lindley’s) paradox. Specifically, I wrote:

Because of the near-significance of the differences, a large overall sample size weakens the conclusions. The reason for this is Lindley’s Paradox, or that as sample size increases, a greater proportion of hypotheses become ‘significant’. In order to account for it, please scale the p values accordingly. If you don’t know how to do this, consult Good (1988) or Naaman (2016). A large sample size - since (with assumptions) it decreases sampling error - only increases the convincingness of a result with this in mind.

They responded by saying:

We accept in principle the validity of the Reviewer’s comment on the hazards of face-value p value interpretation, and (as detailed above) upon the recommendation of the Reviewer we have formally corrected the p-values for multiple tests. Specifically, we acknowledge the problems in this regard arising from large sample size, as identified by Lindley. We appreciate the Reviewer bringing to our attention the Naaman paper that provides a formal method to address this problem. However, this is not common practice, and having searched publications within Int J Environ Res Public Health and across the peer-reviewed health science literature more broadly, we have been unable to identify any instance in which such p-value scaling has been undertaken using the method developed by Naaman. We have been unable to find applicable codes in standard statistical software packages (including the Stata v.15 package used by the first author) that would expedite the implementation of Naaman’s (or equivalent) method.

In respect of this issue, we emphasise that in essence our study produced ‘null’ principal findings. We have concluded our Abstract by stating that our results ‘provide reassurance of reasonable equivalence’ of outcomes between the Indigenous and non-Indigenous patient populations under investigation. Consequently, the potential problem of spurious ‘significance’ (i.e., Type 1 error) of the type noted by the Reviewer is not relevant to the ‘take-home message’ of the study.

In consequence, we respectfully contend that the scaling procedure recommended by the Reviewer is not necessary.

I think this response is fine enough and I do not take much issue with it: it is respectfully-worded and they acknowledge the issue in the review, if not in the paper - that is better than most. Because they implemented the corrections for multiple comparisons which I suggested and they evidently went searching for an implementation of this method, I believe that they were not trying to avoid implementing it, but instead, they simply did not know how, which may be the normal situation for most researchers. This is OK. However, this is an example of a broader problem (i.e., one that is not relegated solely to these authors; mentioning them is only illustrative); that problem being that the difficulty of scripting functions and implementing a method is a barrier to progressive science that most researchers are unable to surmount even when the task is very simple. Before I get to the meat of this comment, I want to explain what Lindley’s paradox is; for this purpose, I quote Sprenger (2013, pp. 734-735; I recommend reading the whole issue this was published in) on its normal definition:

Lindley’s Paradox In a Gaussian model $N(\theta, \sigma^2)$ with known variance $\sigma^2$, $H_o: \theta = \theta_0, H_1: \theta \neq \theta_0$, assuming $p(H_0) > 0$ and any regular proper prior distribution on $\left \{\theta \neq \theta_0 \right \}$. Then, for any testing level $, then we can find a sample size $N(\alpha)$ and independent, identically distributed data $x = (x_1, ..., x_N)$ such that

The sample mean $\bar{x}$ is significantly different from $\theta_0$ at level $\alpha$;

$p(H_0|x)$, that is, the posterior probability that $\theta = \theta_0$, is at least as big as $1-\alpha$.

Or put differently by Meehl (1967, abstract) nearly fifty years earlier:

Because physical theories typically predict numerical values, an improvement in experimental precision reduces the tolerance range and hence increases corroborability. In most psychological research, improved power of a statistical design leads to a prior probability approaching $\frac{1}{2}$ of finding a significant difference in the theoretically predicted direction. Hence the corroboration yielded by “success” is very weak, and becomes weaker with increased precision.

One way around this issue is to use Bayes’ factors (as notably recommended by E.J. Wagenmakers, among others); however, the dishonest use of priors can make this problematic. A related issue is correcting for multiple comparisons, but this can also allow researchers too much flexibility because they don’t have to disclose the actual number of comparisons they performed, only as many as is required to maintain significance for their headline conclusion with their chosen correction method (which may also be incorrect). Good (1988, p. 391) suggested since “the Bayes factor against a sharp null hypothesis is roughly proportional to $1/(P\sqrt{N})$”, if you use a p-value, standardize it “to a fixed sample size, say $N = 100$, by replacing $P$ by $min(\frac{1}{2}P\sqrt{N/100})$”. Continuing, “Even a fixed standardized p-value does not correspond to the same weight of evidence for all occasions, but it is better in this respect than an ordinary p-value. I guess that standardized p-values will not become standard before the year 2000 [emphasis mine]. Of course if you are sure that there are only two simple statistical hypotheses, then there is little point in using p-values instead of Bayes factors…. [Such s]tandardized p-values exemplify the concept of a Bayes/non-Bayes compromise.” I’ve suggested in a publication in review (as of May 11, 2020), that this method can incorporate common rules-of-thumb found in whatever discipline it’s applied for. For example, Richard Gorsuch - the late esteemed factor analyst - suggested that the minimum sample size for a factor analysis should be $5 \times K$ where $K$ is the number of variables used to indicate the common factors which are being modeled. An interested psychometrician analyzing twelve variables, would thus scale their p-value threshold according to $P/\sqrt{N/60}$. But rules-of-thumb are ultimately arbitrary and selecting the value to divide $N$ by is as well.

A similarly-minded compromise solution was offered by Naaman (2016), called “almost-sure hypothesis testing”. The idea, briefly, is to scale the significance threshold, $\alpha$, by a bandwidth, which can also be adjusted to account for multiple comparisons. The bandwidth which he found to be empirically satisfactory via simulations was $h_1 = n^{-\frac{6}{5}}$; to incorporate multiple comparisons corrections like the Bonferroni, use the form $h \propto \frac{n^{-p}}{m}$ where $p>1$ and $m$ is the number of comparisons to be corrected for, though $h_2 = n^{-2}$ was noted to be better for small sample studies (i.e., $m \approx n$) since the probability of type-I error still exists, albeit as $mh \approx n^{-1}$. Since I like this method of scaling and I recommend using it, here are functions for it, assuming desired equivalent $\alpha = 0.95$:

#Naaman-adjusted Z cutoff

NZ <- function(N, S) {
  NZ = qnorm(1-(N^-(6/5))/S) #'1-' included instead of 'abs()'; change it if you don't like it
  return(NZ)}

#Naaman-adjusted p-value

NP <- function(N, S) {
  NP = 1-pnorm(qnorm(1-(N^(-6/5))/S))
  return(NP)}
  
#Test using Naaman's N's from Table 1

NT <- c(5, 10, 15, 25, 50, 100, 1000, 10000, 100000, 1000000)
SN <- NT; SZ1 <- NZ(NT, 1); SP1 <- NP(NT, 1); SZ2 <- NZ(NT, 2); SP2 <- NP(NT, 2)

print(df <- data.frame("n" = SN, "One-sided Z" = SZ1, "Two-sided Z" = SZ2, "One-sided p" = SP1, "Two-sided p" = SP2, "Digit Rule" = c(2, 3, 3, 3, 3, 4, 5, 6, 7, 8)))

##          n One.sided.Z Two.sided.Z One.sided.p Two.sided.p Digit.Rule
## 1        5        1.06        1.46    1.45e-01    7.25e-02          2
## 2       10        1.53        1.86    6.31e-02    3.15e-02          3
## 3       15        1.76        2.07    3.88e-02    1.94e-02          3
## 4       25        2.03        2.31    2.10e-02    1.05e-02          3
## 5       50        2.36        2.61    9.15e-03    4.57e-03          3
## 6      100        2.65        2.88    3.98e-03    1.99e-03          4
## 7     1000        3.48        3.66    2.51e-04    1.26e-04          5
## 8    10000        4.16        4.32    1.58e-05    7.92e-06          6
## 9   100000        4.75        4.89    1.00e-06    5.00e-07          7
## 10 1000000        5.28        5.41    6.31e-08    3.15e-08          8

#Naaman small sample-adjusted Z

NSZ <- function(N, S) {
  NSZ = qnorm(1-(N^-2)/S)
  return(NSZ)}

#Naaman small sample-adjusted p-value

NSP <- function(N, S) {
  NSP = 1-pnorm(qnorm(1-(N^-2)/S))
  return(NSP)}

#Test using provided N's

message(paste("The critical Z value using Naaman's small-sample bandwidth with n = 150 for a two-sided test is", NSZ(150, 2), "which corresponds to a p-value of", NSP(150, 2)))

## The critical Z value using Naaman's small-sample bandwidth with n = 150 for a two-sided test is 4.08307096003905 which corresponds to a p-value of 0.0000222222222222568

#Approximate Bonferroni Z cutoff

NMZ <- function(N, S, M){
  NBZ = qnorm(1-((N^-(6/5)/S))/M)
  return(NBZ)}

#Approximate Bonferroni p-value

NMP <- function(N, S, M){
  NPZ = 1-pnorm(qnorm(1-((N^-(6/5)/S))/M))
  return(NPZ)}

#Test using Naaman's provided N's

message(paste("The critical Z value using Naaman's multiple comparisons-adjusted bandwidth with n = 150 for a two-sided test with 100 multiple comparisons is", NMZ(150, 2, 100), "which corresponds to a p-value of", NMP(150, 2, 100)))

## The critical Z value using Naaman's multiple comparisons-adjusted bandwidth with n = 150 for a two-sided test with 100 multiple comparisons is 4.21960396403773 which corresponds to a p-value of 0.0000122365923861389

How this works can be shown graphically and in comparison to Good’s (1988) recommended scaling (with $N = 100$), for interest and to ease concerns about critical value divergence to $\infty$.

#Good's scaling for N = 100, baseline Naaman bandwidth, and multiple comparisons bandwidth with m = 5
#Reasonable convergence between adjustments occurs with a large enough n.

G88 <- function(x)(0.05/sqrt(x/100))
plot(G88(1:1000), type = 'l', main = "p-Value Scaling by N", xlab = "N", ylab = "Equivalent p-value to p = 0.05", col = "gold", lwd = 2); lines(NP(1:1000, 2), type = "l", lwd = 3, lty = 4, col = "orangered"); lines(NMP(1:1000, 2, 5), type = "l", lwd = 3, lty = 3, col="green")#; lines(CS(1:1000), type = "l", lty = 2, col = "steelblue"), where CS <- function(x)(0.05*(x/x)), or abline(h = 0.05, lty = 2, col = "steelblue") to compare to a static significance threshold

$N$ and $S$ in the Naaman functions are equal to the sample size and the sidedness, so if you want the significant $Z$ or $p$ for a one- or two-sided test, use 1 or 2, respectively, for $S$. There is a rule-of-thumb for the baseline of this scaling, which is that the number of digits in $N + 1$ is an appropriate $Z$ to use, although this can be imprecise and it obviously becomes increasingly conservative and thus less powerful with increasing $N$. Since type-I error is worse than type-II error (insofar as it is more difficult to substantively address and may be associated with a greater number of - potentially deleterious - second-order effects in the real world), this may not be minded much; in whatever case, though, scaling methods should be applied more often (I would prefer Bayes’ factors with results derived with multiple priors provided either in the main text or supplement of a paper with conclusions adjusted appropriately). The call for adjusting the significance threshold is doubly compatible with this proposal since the newly-selected level can be adjusted with $N$, and this can be used to forego a hard-and-fast newly-set threshold by simply scaling the old one. Regarding small sample sizes having adjusted significance thresholds which are less stringent than current ones, in those cases, I would avoid adjustment; the sampling error associated with small samples is typically too great to justify making it easier to achieve a significant result. Finally, due to additivity, the desired scaling for a different value of p or Z can be determined with the same functions, without modification, by way of an arithmetically trivial scaling of $S$ (though obviously this is a shortcut to avoid making a greater number of or more complex functions).

I might make this into a Shiny app later, but I would have to put a modicum of effort in to host it somewhere, so I’ll decline doing that for now. Anyone else may do it. Now that these functions are online, there is no excuse not to use this p-value scaling if I ask for it in a review, or if anyone else does, for that matter.

Turning back to the analysis which inspired this post, Woods et al. (2020) reported a total of three significant first episode results and one significant pre-death episode result. These had p values of 0.005, 0.003, 0.006, and 0.004, with $N = 109,132$ and $68,020$ for the different episode analyses; with one- or two-sided tests, none of their results were significant using almost-sure scaling.

NP(109132, c(1, 2)); NP(68020, c(1, 2))

## [1] 9.0e-07 4.5e-07

## [1] 1.59e-06 7.94e-07

References

Woods, J. A., Johnson, C. E., Ngo, H. T., Katzenellenbogen, J. M., Murray, K., & Thompson, S. C. (2020). Symptom-Related Distress among Indigenous Australians in Specialist End-of-Life Care: Findings from the Multi-Jurisdictional Palliative Care Outcomes Collaboration Data. International Journal of Environmental Research and Public Health, 17(9), 3131. https://doi.org/10.3390/ijerph17093131

Sprenger, J. (2013). Testing a Precise Null Hypothesis: The Case of Lindley’s Paradox. Philosophy of Science, 80(5), 733-744. JSTOR. https://doi.org/10.1086/673730

Meehl, P. E. (1967). Theory-Testing in Psychology and Physics: A Methodological Paradox. Philosophy of Science, 34(2), 103-115. https://doi.org/10.1086/288135

Good, I. J. (1988). The Interface Between Statistics and Philosophy of Science. Statistical Science, 3(4), 386-397. https://doi.org/10.1214/ss/1177012754

Naaman, M. (2016). Almost sure hypothesis testing and a resolution of the Jeffreys-Lindley paradox. Electronic Journal of Statistics, 10(1), 1526-1550. https://doi.org/10.1214/16-EJS1146

Almost-Sure Scaling

Jay Lasker

May 11, 2020

Rationale and Functions

References