In this note, we show that
- MCLE can be derived using two-sample sufficient statistics
only;
- Under unequal variances, the likelihood depends on \((Y,S_T^2,S_C^2)\) rather than \((Y,S_p^2)\);
- Unlike the equal-variance case, MCLE generally cannot be reduced to
a single 1-dim equation in \(a\);
- A fast simulation method still exists because the selection event
depends only on \(Y\);
- MCLE of the treatment effect \(\delta\) approaches to \(-\infty\) when the observed phase 2
treatment effect barely exceeds the go/no-go threshold;
- MCLE is more likely to be ill-posed when the true treatment effect
is small.
1. Problem setup
Here we consider the two-sample problem without assuming equal
variances.
Suppose
\[
X_{Ti}\overset{iid}{\sim}N(\mu_T,\sigma_T^2),
\qquad
i=1,\ldots,n_T,
\]
and
\[
X_{Cj}\overset{iid}{\sim}N(\mu_C,\sigma_C^2),
\qquad
j=1,\ldots,n_C.
\]
The treatment effect is
\[
\delta=\mu_T-\mu_C.
\]
Only the two-sample summary statistics are available. Define
\[
Y=\bar X_T-\bar X_C.
\]
Let the observed values be
\[
y=\bar x_T-\bar x_C,
\qquad
s_T^2=S_{T,\text{obs}}^2,
\qquad
s_C^2=S_{C,\text{obs}}^2.
\]
Define
\[
\nu_T=n_T-1,
\qquad
\nu_C=n_C-1.
\]
The phase 2 trial moves to phase 3 only if
\[
Y>c,
\]
where \(c\) is the go/no-go
threshold. The objective is to estimate
\[
(\delta,\sigma_T^2,\sigma_C^2)
\]
conditional on the selection event
\[
A=\{Y>c\}.
\]
Assume throughout that
\[
y>c,\qquad s_T^2>0,\qquad s_C^2>0,\qquad n_T>1,\qquad
n_C>1.
\]
2. Conditional likelihood based on \((Y,S_T^2,S_C^2)\)
Under the two-sample normal model,
\[
Y\sim N(\delta,V),
\]
where
\[
V=
\frac{\sigma_T^2}{n_T}
+
\frac{\sigma_C^2}{n_C}.
\]
Also,
\[
\frac{\nu_T S_T^2}{\sigma_T^2}\sim\chi^2_{\nu_T},
\qquad
\frac{\nu_C S_C^2}{\sigma_C^2}\sim\chi^2_{\nu_C}.
\]
Moreover,
\[
Y,\quad S_T^2,\quad S_C^2
\]
are mutually independent.
Because the selection event \(A=\{Y>c\}\) depends only on \(Y\), the conditional likelihood based on
\((y,s_T^2,s_C^2)\) is
\[
L_c(\delta,\sigma_T^2,\sigma_C^2\mid y,s_T^2,s_C^2,Y>c)
=
\frac{
f_Y(y\mid \delta,\sigma_T^2,\sigma_C^2)
f_{S_T^2}(s_T^2\mid \sigma_T^2)
f_{S_C^2}(s_C^2\mid \sigma_C^2)
}{
P_{\delta,\sigma_T,\sigma_C}(Y>c)
}.
\]
Define
\[
a=\frac{c-\delta}{\sqrt V}.
\]
Then
\[
P_{\delta,\sigma_T,\sigma_C}(Y>c)=1-\Phi(a).
\]
Ignoring constants not involving \((\delta,\sigma_T^2,\sigma_C^2)\),
\[
\boxed{
L_c(\delta,\sigma_T^2,\sigma_C^2)
\propto
\frac{
V^{-1/2}
\exp\left[
-\frac{(y-\delta)^2}{2V}
\right]
(\sigma_T^2)^{-\nu_T/2}
\exp\left[
-\frac{\nu_T s_T^2}{2\sigma_T^2}
\right]
(\sigma_C^2)^{-\nu_C/2}
\exp\left[
-\frac{\nu_C s_C^2}{2\sigma_C^2}
\right]
}{
1-\Phi(a)
}.
}
\tag{1}
\]
Equivalently,
\[
\boxed{
\begin{aligned}
\ell_c(\delta,\sigma_T^2,\sigma_C^2)
=&
-\frac12\log V
-\frac{(y-\delta)^2}{2V} \\
&-\frac{\nu_T}{2}\log\sigma_T^2
-\frac{\nu_Ts_T^2}{2\sigma_T^2} \\
&-\frac{\nu_C}{2}\log\sigma_C^2
-\frac{\nu_Cs_C^2}{2\sigma_C^2} \\
&-\log\{1-\Phi(a)\}
+\text{constant}.
\end{aligned}
}
\tag{2}
\]
This is the main difference from the equal-variance case: there is no
single pooled variance parameter. Instead, the likelihood depends
separately on \(\sigma_T^2\) and \(\sigma_C^2\).
3. Fast simulation of summary statistics conditional on \(A=\{Y>c\}\)
To simulate selected phase 2 trials efficiently, there is no need to
generate individual observations. It is enough to simulate the
sufficient statistics
\[
(Y,S_T^2,S_C^2)
\]
directly.
Under the unequal-variance normal model,
\[
Y\sim N(\delta,V),
\qquad
V=
\frac{\sigma_T^2}{n_T}
+
\frac{\sigma_C^2}{n_C},
\]
and
\[
\frac{\nu_T S_T^2}{\sigma_T^2}\sim\chi^2_{\nu_T},
\qquad
\frac{\nu_C S_C^2}{\sigma_C^2}\sim\chi^2_{\nu_C}.
\]
Because \(Y,S_T^2,S_C^2\) are
mutually independent and the selection event depends only on \(Y\), conditioning on \(A=\{Y>c\}\) only changes the
distribution of \(Y\). The
distributions of \(S_T^2\) and \(S_C^2\) remain unchanged.
Therefore,
\[
\boxed{
Y\mid A
\sim
N(\delta,V)
\text{ truncated below at } c
}
\tag{3}
\]
and
\[
\boxed{
\frac{\nu_T S_T^2}{\sigma_T^2}\mid A
\sim
\chi^2_{\nu_T},
\qquad
\frac{\nu_C S_C^2}{\sigma_C^2}\mid A
\sim
\chi^2_{\nu_C}.
}
\tag{4}
\]
Moreover,
\[
\boxed{
Y\mid A,\quad S_T^2\mid A,\quad S_C^2\mid A
\text{ are mutually independent.}
}
\tag{5}
\]
Thus, selected summary statistics can be simulated as follows:
- Compute
\[
V=
\frac{\sigma_T^2}{n_T}
+
\frac{\sigma_C^2}{n_C},
\qquad
\alpha=\frac{c-\delta}{\sqrt V}.
\]
- Simulate
\[
U\sim \mathrm{Uniform}(\Phi(\alpha),1).
\]
- Set
\[
\boxed{
Y
=
\delta+\sqrt V\,\Phi^{-1}(U).
}
\tag{6}
\]
- Independently simulate
\[
W_T\sim\chi^2_{\nu_T},
\qquad
W_C\sim\chi^2_{\nu_C}.
\]
- Set
\[
\boxed{
S_T^2=\frac{\sigma_T^2}{\nu_T}W_T,
\qquad
S_C^2=\frac{\sigma_C^2}{\nu_C}W_C.
}
\tag{7}
\]
This directly simulates \((Y,S_T^2,S_C^2)\mid Y>c\). It avoids
rejection sampling and is much faster when \(P(Y>c)\) is small.

4. Conditional Density Ratio of \(Y\) After Selection
Recall that
\[
Y\sim N(\delta,V),
\qquad
V=
\frac{\sigma_T^2}{n_T}
+
\frac{\sigma_C^2}{n_C}.
\]
The phase 2 trial is selected only if
\[
Y>c.
\]
For fixed \((\delta,\sigma_T^2,\sigma_C^2)\), the
conditional density of \(Y\) given
\(Y>c\) is
\[
f_\delta(y\mid Y>c)
=
\frac{
V^{-1/2}\phi\left(\frac{y-\delta}{\sqrt V}\right)
}{
1-\Phi\left(\frac{c-\delta}{\sqrt V}\right)
},
\qquad y>c.
\]
For this comparison, keep \(V\)
fixed and compare \(\delta=0\) with
\(\delta>0\).
Main conclusion
Define the density ratio
\[
R(y)
=
\frac{
f_0(y\mid Y>c)
}{
f_\delta(y\mid Y>c)
},
\qquad y>c,\quad \delta>0.
\]
Then \(R(y)\) is strictly decreasing
in \(y\). Moreover, \(R(c)>1\) and \(R(y)\to0\) as \(y\to\infty\). Therefore, there exists a
unique crossing point \(y^\star>c\).
In words, after conditioning on \(Y>c\), the model with \(\delta=0\) puts relatively more mass near
the threshold \(c\), while the model
with \(\delta>0\) puts relatively
more mass farther into the right tail.
Proof
From the conditional density formula,
\[
R(y)
=
\frac{
1-\Phi\left(\frac{c-\delta}{\sqrt V}\right)
}{
1-\Phi\left(\frac{c}{\sqrt V}\right)
}
\exp\left[
-\frac{\delta y}{V}
+
\frac{\delta^2}{2V}
\right].
\]
Hence
\[
\log R(y)
=
\text{constant}
-
\frac{\delta}{V}y.
\]
Since \(\delta>0\),
\[
\frac{d}{dy}\log R(y)
=
-\frac{\delta}{V}<0.
\]
Therefore, \(R(y)\) is strictly
decreasing in \(y\).
To show \(R(c)>1\), use
\[
\lambda(a)=\frac{\phi(a)}{1-\Phi(a)}
\]
and write
\[
\log R(c)
=
\int_0^\delta
\frac{1}{\sqrt V}
\left[
\lambda\left(\frac{c-u}{\sqrt V}\right)
-
\frac{c-u}{\sqrt V}
\right]
\,du.
\]
Since \(\lambda(a)>a\) for all
\(a\), the integrand is positive. Thus
\(R(c)>1\). Finally, since \(\log R(y)=\text{constant}-\delta y/V\), we
have \(R(y)\to0\) as \(y\to\infty\).
5. Score equations
Now we illustrate how to obtain MCLE using sufficient statistics
\((Y,S_T^2,S_C^2)\). Let
\[
\lambda(a)=\frac{\phi(a)}{1-\Phi(a)}
\]
be the inverse Mills ratio.
The score equation for \(\delta\)
is
\[
\boxed{
\frac{y-\delta}{\sqrt V}
=
\lambda(a).
}
\tag{8}
\]
This is the same form as in the one-sample and equal-variance
two-sample cases, except that
\[
V=
\frac{\sigma_T^2}{n_T}
+
\frac{\sigma_C^2}{n_C}.
\]
For the variance score equations, write
\[
\theta_T=\sigma_T^2,
\qquad
\theta_C=\sigma_C^2.
\]
For \(i\in\{T,C\}\), let \(n_i\), \(\nu_i\), \(s_i^2\), and \(\theta_i\) denote the corresponding
group-specific quantities. Then
\[
\boxed{
-\frac{\nu_i}{2\theta_i}
+
\frac{\nu_i s_i^2}{2\theta_i^2}
+
\frac{1}{2n_iV}
\left[
\frac{(y-\delta)^2}{V}
-1
-a\lambda(a)
\right]
=
0.
}
\tag{9}
\]
Using the \(\delta\)-score
equation,
\[
\frac{(y-\delta)^2}{V}
=
\lambda(a)^2.
\]
Therefore, the variance score equations can also be written as
\[
\boxed{
-\frac{\nu_i}{2\theta_i}
+
\frac{\nu_i s_i^2}{2\theta_i^2}
+
\frac{1}{2n_iV}
\left[
\lambda(a)^2
-1
-a\lambda(a)
\right]
=
0,
\qquad i=T,C.
}
\tag{10}
\]
6. Reparameterization using \(a\)
As before,
\[
a=\frac{c-\delta}{\sqrt V}
\]
and the \(\delta\)-score equation
gives
\[
\lambda(a)=\frac{y-\delta}{\sqrt V}.
\]
Subtracting the two equations gives
\[
\lambda(a)-a
=
\frac{y-c}{\sqrt V}.
\]
Thus
\[
\boxed{
V
=
\frac{(y-c)^2}{\{\lambda(a)-a\}^2}.
}
\tag{11}
\]
Given \(a\), the treatment effect is
recovered as
\[
\boxed{
\delta
=
c-a\sqrt V.
}
\tag{12}
\]
Equivalently,
\[
\boxed{
\delta
=
y-\sqrt V\,\lambda(a).
}
\tag{13}
\]
However, unlike the equal-variance case, \(V\) does not identify the two variance
components. It only imposes
\[
\frac{\sigma_T^2}{n_T}
+
\frac{\sigma_C^2}{n_C}
=
V.
\]
Therefore, the unequal-variance MCLE generally cannot be reduced to a
single 1-dim equation in \(a\). The
safest implementation is to directly maximize the conditional
log-likelihood in equation (2), for example over
\[
(\delta,\log\sigma_T,\log\sigma_C).
\]
7. Behavior as \(y\to c^+\)
Now consider the case where the observed phase 2 treatment effect
barely exceeds the go/no-go threshold:
\[
y-c\downarrow 0.
\]
The same boundary mechanism still applies. The relevant solution
has
\[
\boxed{
\widehat a\to+\infty.
}
\tag{14}
\]
For large positive \(a\),
\[
\lambda(a)-a
=
\frac{1}{a}+O(a^{-3}).
\]
From equation (11),
\[
V
=
\frac{(y-c)^2}{\{\lambda(a)-a\}^2}.
\]
Therefore, near the boundary,
\[
V\sim (y-c)^2a^2.
\]
In this regime, the variance adjustment due to selection vanishes
asymptotically, and the variance estimates approach their observed
sample variance values:
\[
\boxed{
\widehat\sigma_T^2\to s_T^2,
\qquad
\widehat\sigma_C^2\to s_C^2.
}
\tag{15}
\]
Define the observed Welch-type variance of \(Y\) as
\[
V_{\text{obs}}
=
\frac{s_T^2}{n_T}
+
\frac{s_C^2}{n_C}.
\]
Then
\[
\boxed{
\widehat V\to V_{\text{obs}}.
}
\tag{16}
\]
Since
\[
\widehat V
\sim
(y-c)^2\widehat a^2,
\]
we obtain
\[
\boxed{
\widehat a
\sim
\frac{\sqrt{V_{\text{obs}}}}{y-c}.
}
\tag{17}
\]
Finally,
\[
\widehat\delta
=
c-\widehat a\sqrt{\widehat V}.
\]
Using \(\widehat V\to
V_{\text{obs}}\), we obtain
\[
\boxed{
\widehat\delta
\sim
c-\frac{V_{\text{obs}}}{y-c}.
}
\tag{18}
\]
Hence,
\[
\boxed{
\widehat\delta\to-\infty
\qquad \text{as } y\to c^+.
}
\tag{19}
\]
In words, when the observed phase 2 treatment effect barely passes
the threshold, the conditional MLE attributes most of the apparent
success to the conditioning event \(Y>c\). The two variance estimates remain
close to the observed sample variances, but the treatment effect
estimate is pulled sharply downward and has no finite limit as \(y\downarrow c\).
8. Practical implementation
The unequal-variance case is best implemented by directly maximizing
equation (2). A stable parameterization is
\[
(\delta,\log\sigma_T,\log\sigma_C),
\]
which automatically enforces \(\sigma_T>0\) and \(\sigma_C>0\).
The same fast simulation idea remains valid:
\[
Y
=
\delta+\sqrt V\,\Phi^{-1}(U),
\qquad
U\sim\mathrm{Uniform}\left(
\Phi\left(\frac{c-\delta}{\sqrt V}\right),
1
\right),
\]
where
\[
V=\frac{\sigma_T^2}{n_T}+\frac{\sigma_C^2}{n_C}.
\]
Independently,
\[
S_T^2=\frac{\sigma_T^2}{\nu_T}\chi^2_{\nu_T},
\qquad
S_C^2=\frac{\sigma_C^2}{\nu_C}\chi^2_{\nu_C}.
\]
The R script accompanying this note implements this approach and
reports both the observed selected estimate and the unequal-variance
MCLE across simulation scenarios.