In this note, we show that
- MCLE can be derived using sufficient statistics only;
- Instead of solving a two-parameter equation, MCLE can be solved from
a 1-dim equation;
- MCLE of \(\mu\) approaches to \(-\infty\) when observed phase 2 mean barely
exceeds the go/no-go threshold
- We give approximated \(\widehat
\mu\) when that happens
- A fast simulation method exists
- MCLE is more likely to be ill-posed when true effect is small
- MCLE is applicable in practice when go decision is made, even if
true effect is unknown
1. Problem setup
Here we consider the one-sample problem. The result can be applied to
the two-sample problem as in a two-arm randomized trial by replacing
\(n\) with \(n/2\).
Suppose
\[
X_1,\ldots,X_n \overset{iid}{\sim} N(\mu,\sigma^2).
\]
Only the summary statistics are available:
\[
Y=\bar X,\qquad
S^2=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar X)^2.
\]
Let the observed values be
\[
y=\bar x,\qquad s^2=S^2_{\text{obs}}.
\]
The phase 2 trial moves to phase 3 only if
\[
Y>c,
\]
where \(c\) is the go/no-go
threshold. The objective is to estimate \((\mu,\sigma^2)\) conditional on the
selection event
\[
A=\{Y>c\}.
\]
Assume throughout that
\[
y>c,\qquad s^2>0,\qquad n>1.
\]
2. Conditional likelihood based on \((\bar
X,S^2)\)
Under the normal model,
\[
Y\sim N\left(\mu,\frac{\sigma^2}{n}\right),
\qquad
\frac{(n-1)S^2}{\sigma^2}\sim \chi^2_{n-1},
\]
and \(Y\) and \(S^2\) are independent.
Because the selection event \(A=\{Y>c\}\) depends only on \(Y\), the conditional likelihood based on
\((y,s^2)\) is
\[
L_c(\mu,\sigma\mid y,s^2,Y>c)
=
\frac{
f_Y(y\mid \mu,\sigma) f_{S^2}(s^2\mid \sigma)
}{
P_{\mu,\sigma}(Y>c)
}.
\]
Define
\[
a=\frac{\sqrt n(c-\mu)}{\sigma}.
\]
Then
\[
P_{\mu,\sigma}(Y>c)=1-\Phi(a).
\]
Ignoring constants not involving \((\mu,\sigma)\),
\[
L_c(\mu,\sigma)
\propto
\frac{
\sigma^{-n}
\exp\left[
-\frac{(n-1)s^2+n(y-\mu)^2}{2\sigma^2}
\right]
}{
1-\Phi(a)
}.
\]
Equivalently,
\[
\ell_c(\mu,\sigma)
=
-n\log\sigma
-
\frac{(n-1)s^2+n(y-\mu)^2}{2\sigma^2}
-
\log\{1-\Phi(a)\}
+
\text{constant}.
\]
3. Fast simulation of summary statistics conditional on \(A=\{Y>c\}\)
To simulate selected phase 2 trials efficiently, there is no need to
generate individual observations \(X_1,\ldots,X_n\). It is enough to simulate
the sufficient statistics \((Y,S^2)\)
directly.
Under the normal model,
\[
Y\sim N\left(\mu,\frac{\sigma^2}{n}\right),
\qquad
\frac{(n-1)S^2}{\sigma^2}\sim \chi^2_{n-1},
\]
and \(Y\) is independent of \(S^2\). Since the selection event
\[
A=\{Y>c\}
\]
depends only on \(Y\), conditioning
on \(A\) only changes the distribution
of \(Y\). The conditional distribution
of \(S^2\) remains unchanged.
Therefore,
\[
\boxed{
Y\mid A
\sim
N\left(\mu,\frac{\sigma^2}{n}\right)
\text{ truncated below at } c
}
\tag{1}
\]
and
\[
\boxed{
\frac{(n-1)S^2}{\sigma^2}\mid A
\sim
\chi^2_{n-1}.
}
\tag{2}
\]
Moreover,
\[
\boxed{
Y\mid A
\quad\text{and}\quad
S^2\mid A
\text{ are independent.}
}
\tag{3}
\]
Thus, selected summary statistics can be simulated as follows:
- Simulate
\[
Y \sim N\left(\mu,\frac{\sigma^2}{n}\right)
\]
conditional on \(Y>c\).
- Independently simulate
\[
W\sim \chi^2_{n-1}.
\]
- Set
\[
S^2=\frac{\sigma^2 W}{n-1}.
\]
Equivalently, let
\[
\alpha=\frac{\sqrt n(c-\mu)}{\sigma}.
\]
If \(U\sim
\mathrm{Uniform}(\Phi(\alpha),1)\), then
\[
\boxed{
Y
=
\mu+\frac{\sigma}{\sqrt n}\Phi^{-1}(U)
}
\tag{4}
\]
has the desired conditional distribution \(Y\mid Y>c\). Independently,
\[
\boxed{
S^2
=
\frac{\sigma^2}{n-1}\chi^2_{n-1}.
}
\tag{5}
\]
This directly simulates \((Y,S^2)\mid
Y>c\). It avoids rejection sampling and is much faster when
\(P(Y>c)\) is small.
Using the method derived in sections below, We simulate the following
scenarios to assess MCLE’s performance:
- \(n = 25\), equivalent to 50
patients per arm in the two-sample problem
- \(\sigma^2 = 1\)
- \(c = .33\)
- \(\mu = 0, 0.05, 0.1, \ldots, 0.95,
1.0\)
- 1000 replicates for each \(\mu\)
The results are summarized below.

A few observations:
Left: Ill-posed MCLE is defined as MCLE \(< -10\). The proportion of ill-posed
increases dramatically when the true effect is away from the go
threshold \(c = 0.33\). Intuitively,
the more closer to the null, the higher chance that the observed effect
\(Y\) is close to the threshold, given
a go decision is made. Note that it is more likely to have \(Y\) near than far away from \(c\), which leads to ill-posed MCLE.
- The percentage of ill-posed is lower than 50% across simulated
scenarios. Thus, the calculation of median bias of MCLE is not affected
by ill-posed estimates.
Middle: As expected, the observed effect is highly biased when
the true effect is close to null.
Right: a few conclusions can be made for MCLE.
- MCLE tends to underestimate the effect.
- MCLE is less biased with greater true effect.
- MCLE is low biased (< -14% = -0.05/0.35) when true effect is
greater than go threshold. This make MCLE usable when planning
confirmatory trials.
- MCLE is highly biased when close to null. However, in
practice this is good as “go” is a false positive.
The plot of the percentage of ill-posed MCLE reflects an interesting
observation that we will discuss further in section 4.
4. Conditional Density Ratio of \(Y\) After Selection
Recall that
\[
Y\sim N\left(\mu,\frac{\sigma^2}{n}\right).
\]
The phase 2 trial is selected only if
\[
Y>c.
\]
For a fixed value of \(\mu\), the
conditional density of \(Y\) given
\(Y>c\) is
\[
f_\mu(y\mid Y>c)
=
\frac{
\frac{\sqrt n}{\sigma}
\phi\left(\frac{\sqrt n(y-\mu)}{\sigma}\right)
}{
1-\Phi\left(\frac{\sqrt n(c-\mu)}{\sigma}\right)
},
\qquad y>c.
\]
We compare the conditional density under \(\mu=0\) with that under \(\mu>0\), while keeping \(n\), \(\sigma\), and \(c\) fixed.
Main conclusion
Define the density ratio
\[
R(y)
=
\frac{
f_0(y\mid Y>c)
}{
f_\mu(y\mid Y>c)
},
\qquad y>c,\quad \mu>0.
\]
Then \(R(y)\) is strictly decreasing
in \(y\). Moreover, \(R(c)>1\) and \(R(y)\to 0\) as \(y\to\infty\). Therefore, there exists a
unique crossing point \(y^\star>c\)
such that
\[
f_0(y\mid Y>c)>f_\mu(y\mid Y>c),
\qquad c<y<y^\star,
\]
while
\[
f_0(y\mid Y>c)<f_\mu(y\mid Y>c),
\qquad y>y^\star.
\]
In words, after conditioning on \(Y>c\), the model with \(\mu=0\) puts relatively more mass near the
threshold \(c\), while the model with
\(\mu>0\) puts relatively more mass
farther into the right tail. This is observed in the above plot in the
left.
Proof
From the conditional density formula,
\[
R(y)
=
\frac{
1-\Phi\left(\frac{\sqrt n(c-\mu)}{\sigma}\right)
}{
1-\Phi\left(\frac{\sqrt n c}{\sigma}\right)
}
\exp\left\{
-\frac{n\mu y}{\sigma^2}
+
\frac{n\mu^2}{2\sigma^2}
\right\}.
\]
Hence
\[
\log R(y)
=
\text{constant}
-
\frac{n\mu}{\sigma^2}y.
\]
Since \(\mu>0\),
\[
\frac{d}{dy}\log R(y)
=
-\frac{n\mu}{\sigma^2}
<0.
\]
Therefore, \(R(y)\) is strictly
decreasing in \(y\).
Next, evaluate the ratio at \(y=c\).
We have
\[
\log R(c)
=
\log\left\{1-\Phi\left(\frac{\sqrt n(c-\mu)}{\sigma}\right)\right\}
-
\log\left\{1-\Phi\left(\frac{\sqrt n c}{\sigma}\right)\right\}
-
\frac{n\mu c}{\sigma^2}
+
\frac{n\mu^2}{2\sigma^2}.
\]
Using
\[
\lambda(a)=\frac{\phi(a)}{1-\Phi(a)},
\]
and
\[
\frac{d}{d\mu}
\log\left\{
1-\Phi\left(\frac{\sqrt n(c-\mu)}{\sigma}\right)
\right\}
=
\frac{\sqrt n}{\sigma}
\lambda\left(\frac{\sqrt n(c-\mu)}{\sigma}\right),
\]
we can write
\[
\begin{aligned}
\log R(c)
&=
\int_0^\mu
\frac{\sqrt n}{\sigma}
\lambda\left(\frac{\sqrt n(c-u)}{\sigma}\right)
\,du
-
\int_0^\mu
\frac{n(c-u)}{\sigma^2}
\,du \\
&=
\int_0^\mu
\frac{\sqrt n}{\sigma}
\left[
\lambda\left(\frac{\sqrt n(c-u)}{\sigma}\right)
-
\frac{\sqrt n(c-u)}{\sigma}
\right]
\,du.
\end{aligned}
\]
For the standard normal inverse Mills ratio,
\[
\lambda(a)>a
\qquad \text{for all } a\in\mathbb R.
\]
Thus the integrand is positive for every \(u\in[0,\mu]\). Therefore,
\[
\log R(c)>0,
\]
which implies
\[
R(c)>1.
\]
Finally, since
\[
\log R(y)
=
\text{constant}
-
\frac{n\mu}{\sigma^2}y,
\]
we have
\[
R(y)\to 0
\qquad \text{as } y\to\infty.
\]
Since \(R(y)\) is continuous and
strictly decreasing, with \(R(c)>1\)
and \(R(y)\to 0\), there is a unique
crossing point \(y^\star>c\).
Interpretation
If \(\mu=0\), passing the threshold
\(Y>c\) is relatively surprising.
Conditional on this event, \(Y\) is
more likely to be just above \(c\),
resulting a higher chance for ill-posed MCLE.
If \(\mu>0\), passing the
threshold is less surprising, and the conditional distribution puts
relatively more mass farther above \(c\).
This provides a useful intuition for selection bias adjustment: a
phase 2 result that barely exceeds the go/no-go threshold is more
consistent with a smaller true effect combined with selection-induced
upward fluctuation than with a genuinely larger true effect.
5. Score equations
Now we illustrate how to obtain MCLE using sufficient statistics
\((Y, S^2)\). Let
\[
\lambda(a)=\frac{\phi(a)}{1-\Phi(a)}
\]
be the inverse Mills ratio.
Taking score equations with respect to \(\mu\) and \(\sigma\) gives
\[
\boxed{
\frac{\sqrt n(y-\mu)}{\sigma}
=
\lambda(a)
}
\tag{6}
\]
and
\[
\boxed{
\sigma^2
=
\frac{(n-1)s^2}{n+a\lambda(a)-\lambda(a)^2}
}
\tag{7}
\]
Therefore, once \(a\) is known,
\(\mu\) and \(\sigma^2\) can be recovered directly.
6. One-dimensional equation for \(a\)
Equations (6) and (7) are subject to two parameters \((\mu, \sigma^2)\). Here we show that MCLE
can be solved from a 1-dim equation instead.
From
\[
a=\frac{\sqrt n(c-\mu)}{\sigma}
\]
and the first score equation,
\[
\lambda(a)=\frac{\sqrt n(y-\mu)}{\sigma},
\]
subtracting gives
\[
\lambda(a)-a
=
\frac{\sqrt n(y-c)}{\sigma}.
\]
Thus
\[
\sigma^2
=
\frac{n(y-c)^2}{\{\lambda(a)-a\}^2}.
\]
Combining this with the score-based expression for \(\sigma^2\) yields the one-dimensional
equation
\[
\boxed{
n(y-c)^2\{n+a\lambda(a)-\lambda(a)^2\}
-
(n-1)s^2\{\lambda(a)-a\}^2
=
0
}
\tag{8}
\]
Let \(\widehat{a}\) be the solution.
Then the conditional MLEs are
\[
\boxed{
\widehat{\sigma}^2
=
\frac{(n-1)s^2}{n+\widehat{a}\lambda(\widehat{a})-\lambda(\widehat{a})^2}
}
\tag{9}
\]
Equivalently,
\[
\boxed{
\widehat{\sigma}^2
=
\frac{n(y-c)^2}{\{\lambda(\widehat{a})-\widehat{a}\}^2}
}
\tag{10}
\]
The corresponding estimator of \(\mu\) is
\[
\boxed{
\widehat{\mu}
=
c-\frac{\widehat{a}\widehat{\sigma}}{\sqrt n}
}
\tag{11}
\]
Equivalently,
\[
\boxed{
\widehat{\mu}
=
y-\frac{\widehat{\sigma}}{\sqrt n}\lambda(\widehat{a})
}
\tag{12}
\]
7. Behavior as \(y\to c^+\)
Now consider the case where the observed phase 2 mean barely exceeds
the go/no-go threshold:
\[
y-c\downarrow 0.
\]
Then
\[
r=
\frac{(n-1)s^2}{n(y-c)^2}
\to\infty.
\]
Let
\[
m(a)=\lambda(a)-a.
\]
Since \(\lambda(a)=a+m(a)\), the
one-dimensional equation is equivalent to
\[
\boxed{
(r+1)m(a)^2+a m(a)-n=0
}
\tag{16}
\]
This form makes the boundary behavior transparent.
First, \(\widehat{a}\) cannot remain
bounded as \(r\to\infty\). If \(a\) were bounded, then \(m(a)=\lambda(a)-a\) would be bounded away
from zero, so the term \((r+1)m(a)^2\)
would diverge. Therefore, any solution must diverge.
It cannot diverge to \(-\infty\),
because when \(a\to-\infty\),
\[
\lambda(a)\to 0,
\qquad
m(a)=\lambda(a)-a\to\infty,
\]
which again makes \((r+1)m(a)^2\)
diverge. Hence the relevant solution satisfies
\[
\boxed{
\widehat{a}\to+\infty
}
\tag{17}
\]
For large positive \(a\),
\[
\lambda(a)-a
=
\frac{1}{a}+O(a^{-3}).
\]
Therefore,
\[
m(a)\sim \frac{1}{a},
\qquad
a m(a)\to 1.
\]
Substituting this into the equation above gives
\[
\frac{r+1}{a^2}+1-n\approx 0.
\]
Thus
\[
\widehat{a}^2
\sim
\frac{r}{n-1}.
\]
Using the definition of \(r\),
\[
\boxed{
\widehat{a}
\sim
\frac{s}{\sqrt n(y-c)}
}
\tag{18}
\]
Now,
\[
\widehat{\sigma}
=
\frac{\sqrt n(y-c)}{\lambda(\widehat{a})-\widehat{a}}.
\]
Since
\[
\lambda(\widehat{a})-\widehat{a}
\sim
\frac{1}{\widehat{a}},
\]
we obtain
\[
\widehat{\sigma}
\sim
\sqrt n(y-c)\widehat{a}
\sim
s.
\]
Therefore,
\[
\boxed{
\widehat{\sigma}^2\to s^2
}
\tag{19}
\]
Finally,
\[
\widehat{\mu}
=
c-\frac{\widehat{a}\widehat{\sigma}}{\sqrt n}.
\]
Using the asymptotic expression for \(\widehat{a}\) and \(\widehat{\sigma}\to s\), we obtain
\[
\boxed{
\widehat{\mu}
\sim
c-\frac{s^2}{n(y-c)}
}
\tag{20}
\]
Hence,
\[
\boxed{
\widehat{\mu}\to-\infty
\qquad \text{as } y\to c^+
}
\tag{21}
\]
In words, when the observed phase 2 mean barely passes the threshold,
the conditional MLE attributes most of the apparent success to the
conditioning event \(Y>c\). The
estimate of \(\sigma^2\) remains close
to the observed sample variance, but the estimate of \(\mu\) is pulled sharply downward and has no
finite limit as \(y\downarrow c\).