Suppose that \(\theta\) is a parameter, \(\theta^*\) is a “SESOI” on that parameter, \(1 - \beta(\theta)\) is the power function for \(\theta\), \(\alpha = 1 - \beta(0)\), \(T\) is a test statistic relevant to inferences about \(\theta\), and \(t_\alpha\) is the \(\alpha\)-level critical value on the test statistic.

We choose \(\alpha=0.05\) for the sake of the example.

The SESOI can be understood in two important ways:

  1. Significant results. The SESOI is the effect size chosen so that we can reliably, correctly infer that \(\theta>0\) when \(\theta\geq \theta^*\) (where reliable means that the “error” rate \(\beta(\theta^*)\) is suitably low). We infer \(\theta>0\) when test statistic \(T>t_\alpha\). This decision will have maximum error rate \(\alpha\).

  2. Nonsignificant results. The SESOI is the effect size chosen so that we can infer that \(\theta\leq\theta^*\) at a small, maximum error rate when \(\theta\geq \theta^*\) (where the error rate \(\beta(\theta^*)\) should be low). We infer that \(\theta\leq\theta^*\) when \(T\leq t_\alpha\).

We often talk about (1), (2) is as or more important. A significant effect lets me infer that \(\theta\) is at least 0; a nonsignificant effect lets me infer that \(\theta\) is at most \(\theta^*\). This is necessary for the definition of the SESOI because we want to make sure that when we fail to find a significant result, we can be assured that we almost surely haven’t missed something important. \(\theta\) might be larger than 0, but the fact that we had good chance (at least \(1-\beta(\theta^*)\), the lower bound on the power) of rejecting when \(\theta>\theta^*\), leads us to believe that \(\theta\) is not, in fact, as large as \(\theta^*\) when we fail to reject.

Suppose we read a paper in which a one-sample design with \(N=25\) participants is used, and suppose a one-sided test is desired, \(H_0: \delta\leq0\). Further suppose they report using \(\alpha = 0.05\). The authors report no power analysis or minimally-interesting effect size.

The figure below shows the power curve for the design. Whether the authors have reported a SESOI or not is irrelevant; this power curve shows which effect sizes the design is sensitive to, and which it is not.

Figure 1. Power curve for the design described above. The CI above the plotting area shows the confidence $100(1 - 2\alpha)$% CI when the $t$ test statistic is the critical value, $T=t_{.05}$. The bent (red) segment shows the CESSI for the estimator $\hat{\delta}=T/\sqrt{N}$.

Figure 1. Power curve for the design described above. The CI above the plotting area shows the confidence \(100(1 - 2\alpha)\)% CI when the \(t\) test statistic is the critical value, \(T=t_{.05}\). The bent (red) segment shows the CESSI for the estimator \(\hat{\delta}=T/\sqrt{N}\).

We cannot read the authors’ minds, but we can note several things:

  1. The authors report \(\alpha=.05\), but no desired “reliability” for the power. Obviously, if they don’t define an SESOI, they have not gone through the process of thinking about tolerance of error when \(T<t_\alpha\) (that is, \(\beta\)). It is thus reasonable to assume that all tests they would perform would be at \(\alpha\).

  2. At level \(\alpha\), a non-significant result would imply an inference that \(\delta<0.692\). We can see this through through the power curve or through the \(100(1 - 2\alpha)\)% CI when the \(t\) test statistic is the critical value, shown at the top of the Figure 1. The CI\(_{90\%}\) is \([0,0.692]\). This is consistent with the logic of equivalence testing.

For this design, \(\delta^*=0.692\) has the properties of an SESOI, regardless of whether the authors explicitly chose it or not. We can look directly at the design and, using either the logic of confidence intervals or the power curve, arrive at the (implied) SESOI. We call the effect size to which there is \(1-\alpha\) power the SESOI\(_\alpha\).

We now turn to Lakens (2017) suggestion for using a previous design to infer an “implicit” SESOI:

“I think it is reasonable to assume that if you decide to collect data for a study where you plan to perform a null-hypothesis significance test, you are not interested in effect sizes that will never be statistically significant. …Unless you state otherwise…any effects smaller than [the critical effect size] this effect size are considered too small to be interesting. Obviously, you are free to explicitly state [that some larger] is already too small to matter for theoretical or practical purposes. But without such an explicit statement about what your SESOI is, we can infer it from your power analysis.” (Lakens, 2017)

In Lakens et al (2018), a similar paragraph suggests this in the context of equivalence testing (though the distinction is not important):

“Another justifiable option when choosing the SESOI on the basis of earlier work is to use the smallest observed effect size that could have been statistically significant in a previous study…The assumption here is that the original authors were interested in observing a significant effect, and thus were not interested in observed effect sizes that could not have yielded a significant result…With the SESOI set as the critical effect size, an equivalence test can reject all observed effect sizes that could have been detected in the earlier study.” (Lakens et al, 2018, p. 262)

We abbreviate this suggestion as the “CESSI (Critical Effect Size as Smallest of Interest) approach”.

The following assumes that we want SESOIs that are more than just guesses of what was in people’s minds. If that’s all the CESSI is, then neither Lakens (2017) nor Lakens et al (2018) justify them empirically. I assume that we want CESSIs to act as SESOIs in some positive way: they guide good statistical thinking and behaviour. Unfortunately, they don’t serve this purpose. There are several problems with CESSIs.

Use of the power curve

Lakens (2017) says “…without [] an explicit statement about what your SESOI is, we can infer it from your power analysis.” We see above that indeed we can; if a researcher is performing tests at level \(\alpha\), then the effect size for which the design will have power of \(1-\alpha\) acts as a sort of implicit SESOI. We can see this from the effect size and the confidence interval.

According to Lakens (2017), the CESSI is to be used when a researcher does not state a SESOI. But:

  1. When a researcher does not state a SESOI, we can still compute the effect size to which there is \(1-\alpha\) power. What a researcher says about their SESOI is irrelevant to an analysis of their design, and
  2. There would be nothing stopping us, even if they did state a SESOI, from claiming that thier implicit SESOI was something else.

Neither Lakens (2017) nor Lakens et al (2018) give a justification for computing the CESSI only in certain cases, and the CESSI and SESOI\(_\alpha\) will be off by a large factor, in general.

Only one of these values—the SESOI\(_\alpha\)— is computable from the power curve. The CESSI requires specification of an effect size estimator (to be discussed later). If we are to reason from the power curve itself, then the SESOI\(_\alpha\) is the only choice.

If there is no justification of the difference, what is to stop people from computing SESOIs on their own designs after they’ve run them? This would be disastrous, as they would be led to believe their designs are much better than they are.

CESSIs are poor SESOIs

Suppose that a person read a paper including particular design, computed the CESSI, and decides it corresponds to their SESOI. They then replicate the design as the previous authors had.

Will their experiment be a good one? No. For a median-unbiased estimator, the CESSI will be numerically the same as a true effect size to which there is 50%, as you note in the footnote of Lakens et al (2018). This, however, is the sign that something has gone wrong; the CESSI is not acting as a SESOI. We neither have a reliable design for detecting effect sizes of the numerical size of the CESSI, nor does a nonsignificant result assure us (at level \(\alpha\)) that the effect is numerically less than the CESSI.

Statistic vs parameter

There is nothing about a significance test that rests on an observed effect size. In fact, even mentioning observed effect sizes is confusing: all that matters is the sampling distribution of a relevant test statistic, assuming a true effect size. The observed test statistic may, or may not, be an effect size estimate. The test doesn’t care whether we regard it as an estimate or not; it only cares about its sampling distribution.

With this in mind, we first focus on the important part of the first sentence: “use [as an SESOI] the smallest observed effect size that could have been statistically significant in a previous study”. Let us replace this with formal language: “use [as an SESOI] the smallest observed test statistic that would yield an inference of \(\theta>0\). Notice that now that we’ve removed the equivocation of “effect size” (to mean both a statistic and a parameter) there is no logical connection between the two for the purposes of the test. How do I “use” a test statistic as a SESOI? I can’t. They’re not the same kind of number.

The confusion is most clear in the last sentence of the quote: “With the SESOI [parameter] set as the critical effect size [statistic], an equivalence test can reject [we reject hypotheses about parameters] all observed effect sizes [statistics] that could have been detected [we detect true effect sizes: parameters] in the earlier study.”

To be clear: a test does not “reject” or “detect” observed effect sizes. None of the justification for the substitution of a statistic for a parameter in this paragraph works. There is no particular reason why, just because a test statistic would not lead to an inference that \(\theta>0\), we should then substitute its numerical value into a set of true effect sizes that are not “meaningful”.

Non-unique SESOIs

There is no line that separates an “estimator” from other functions of data in the theory of point estimation. There are ideas that help us evaluate functions of data as estimators (e.g., bias, variance, efficiency, loss, risk, admissibility, consistency, etc). But there is no single estimator.

Suppose I were to decide that despite the previous section, and substitute a statistic for a parameter. The question then becomes: which one? We have to take for granted that there is a single estimator we could choose. If we only know one estimator, this might seem reasonable.

But from the perspective of our test, the choice is irrelevant; all we need is a good test statistic. Any estimate that preserves the information in our test statistic will yield precisely the same inference, because all inferences flow through its sampling distribution.

Suppose that we have a one-way design with \(J=3\) groups and \(N=9\) in each group. We choose \(\alpha=0.05\) and a one-way ANOVA test. The critical \(F\) statistic is \(f_{0.05}=3.403\).

The power curve is shown in the figure below. Note that this power curve does not depend at all on what effect size estimator we might choose. If we are to “infer [the SESOI] from your power analysis,” we cannot depend on an estimator, because the power analysis does not depend on the choice of estimator.

Figure 4. Power curve for the one-way ANOVA design. The CI at the top is the $100(1 - 2\alpha)$ CI that would result from a just-significant test statistic.

Figure 4. Power curve for the one-way ANOVA design. The CI at the top is the \(100(1 - 2\alpha)\) CI that would result from a just-significant test statistic.

Suppose we follow Lakens et al and base the SESOI on the “just significant” test statistics (in this case, statistics we also use as estimators). We could use a several estimators. We consider two:

\[\begin{eqnarray*} \eta^2 &=& \frac{F}{F+\nu_2/\nu_1}\\ \hat{\omega}^2&=&\frac{F - 1}{F + (\nu_2+1)/\nu_1} \end{eqnarray*}\]

where \(\nu_1\) and \(\nu_2\) are the numerator and denominator degrees of freedom, respectively. \(F=3.403\) implies that \(\eta^2=0.221\) and \(\hat{\omega}^2=0.151\). This is quite a difference, due to the differing properties of these estimators.

Of course, \(\eta^2\) is a biased estimator. The sampling distributions reveal three things. First, the median-bias of both statistics (\(\eta^2\)’s sampling distribution, in particular, is biased high); second, the difference in what values of the statistic would lead to inferences that \(\omega^2>0\); and third, that this difference is due to the fact that the test automatically adjusts for the sampling distributions.

Figure 3. Sampling distributions of $\eta^2$ (top, red) and $\hat{\omega}^2$ (bottom, blue) when $\omega^2=0$ (solid) and $\omega^2=0.394$ (dashed). The vertical dotted line shows the estimate associated with the critical $F$ value.

Figure 3. Sampling distributions of \(\eta^2\) (top, red) and \(\hat{\omega}^2\) (bottom, blue) when \(\omega^2=0\) (solid) and \(\omega^2=0.394\) (dashed). The vertical dotted line shows the estimate associated with the critical \(F\) value.

One might object to including the horribly biased \(\eta^2\), but the bias is beside the point; the point is that something does not matter to the test—the choice of effect size estimator—does to the CESSI. A test using \(\eta^2\) as the test statistic is just as good as one using \(\hat{\omega}^2\). Furthermore, the upward bias in \(\eta^2\) actually improves the CESSI as an SESOI (raising its corresponding power).

The power curve below shows the two CESSIs are different, by a substantial factor even. That they are different at all is a problem. That the \(\eta^2\) critical value is 46.215% larger than the \(\hat{\omega}^2\) is very problematic. The corresponding powers for the two SESOIs are 0.638 (\(\eta^2\)) and 0.437 (\(\hat{\omega}^2\)). The odds of concluding that \(\omega^2>0\) at the \(\eta^2\) CESSI are 2.271 times as high as for the \(\hat{\omega}^2\) CESSI. The less-biased \(\hat{\omega}^2\) leads to a particularly terrible SESOI, at substantially less than 50% power.

Figure 4. Power curve for the one-way ANOVA design. The bent segments show the power for the two CESSIs (red: $\eta^2$; blue: $\hat{\omega}^2$). The vertical green line shows the SESSOI$_\alpha$. The CI at the top is the $100(1 - 2\alpha)$ CI that would result from a just-significant test statistic.

Figure 4. Power curve for the one-way ANOVA design. The bent segments show the power for the two CESSIs (red: \(\eta^2\); blue: \(\hat{\omega}^2\)). The vertical green line shows the SESSOI\(_\alpha\). The CI at the top is the \(100(1 - 2\alpha)\) CI that would result from a just-significant test statistic.

The SESOI\(_\alpha\), on the other hand, does not depend on the chosen estimator. It can be read directly off the power curve. It is based on widely-agreed principles (that the test should be at level \(\alpha\)), and corresponds to how you would determine whether a design is good in an a priori power analysis.

It also reveals what we suspect from the small sample sizes: this design is terrible, because we can only reliably detect that \(\omega^2>0\) when \(\omega^2\geq0.394\)! A nonsignificant result would only let us infer that \(\omega^2<0.394\) (at level \(\alpha\)), which is not a terribly helpful inference. Using the \(\hat{\omega}^2\) CESSI as an SESOI would lead us to believe that the design was better than it was.