The main challenge in defining a “non-informative” prior is that the concept of uniformity (e.g., saying “all values of \(\theta\) are equally likely”) is not invariant to transformation.
Example:
Parameter 1: Probability of success, \(\theta\). If I say \(\pi(\theta)=1\) for \(\theta \in [0,1]\), this is a uniform prior.
Parameter 2: Log-odds, \(\phi = \log\left(\frac{\theta}{1-\theta}\right)\). If \(\theta\) is uniform, what is the prior for \(\phi\)? Using the change-of-variables formula, \(\pi(\phi)\) is not uniform. It will be a logistic distribution.
So, which parameter truly gets the “non-informative” uniform prior? Saying \(\pi(\theta) \propto 1\) is arbitrary. A different researcher, working naturally with \(\phi\), would get different results if they used \(\pi(\phi) \propto 1\). A good non-informative prior should not depend on the choice of parameterization.
Jeffreys’ prior, \(\pi(\theta) \propto \sqrt{\mathcal{I}(\theta)}\), is designed to solve this exact problem. It is the unique prior that is invariant under reparameterization.
Here’s the “why” behind your explanation:
We want a prior for \(\theta\) that, after we transform to some special parameter \(\phi\), becomes truly uniform (\(\pi(\phi) \propto 1\)). This special parameter \(\phi\) is the one for which the Fisher information is constant (i.e., \(\mathcal{I}(\phi) = \text{constant}\)).
Let’s define a new parameter \(\phi = g(\theta)\) such that the Fisher information for \(\phi\) is 1 (or any constant). It turns out that the function \(g(\theta) = \int \sqrt{\mathcal{I}(\theta)} \, d\theta\) achieves this.
Why? Because the Fisher information transforms in a specific way. If \(\phi = g(\theta)\), then:
\[ \mathcal{I}(\phi) = \mathcal{I}(\theta) \left( \frac{d\theta}{d\phi} \right)^2 \]
If we choose \(\phi = \int \sqrt{\mathcal{I}(\theta)} \, d\theta\), then \(\frac{d\phi}{d\theta} = \sqrt{\mathcal{I}(\theta)}\), so \(\frac{d\theta}{d\phi} = 1 / \sqrt{\mathcal{I}(\theta)}\). Plugging this in:
\[ \mathcal{I}(\phi) = \mathcal{I}(\theta) \times \left( \frac{1}{\sqrt{\mathcal{I}(\theta)}} \right)^2 = \mathcal{I}(\theta) \times \frac{1}{\mathcal{I}(\theta)} = 1 \]
So, in the \(\phi\) parameterization, the Fisher information is constant (1).
Jeffreys’ prior for \(\phi\) is \(\pi(\phi) \propto \sqrt{\mathcal{I}(\phi)} = \sqrt{1} = 1\). This is a uniform (and therefore “non-informative” in the intuitive sense) prior for \(\phi\).
Now, what is the induced prior for \(\theta\)? Using the change-of-variables rule from \(\phi\) back to \(\theta\):
\[ \pi(\theta) = \pi(\phi) \cdot \left| \frac{d\phi}{d\theta} \right| \propto 1 \cdot \sqrt{\mathcal{I}(\theta)} \]
Thus, \(\pi(\theta) \propto \sqrt{\mathcal{I}(\theta)}\) is the prior for \(\theta\) that corresponds to a uniform prior for the “variance-stabilizing” parameter \(\phi\).
For a binomial distribution, \(f(x \mid \theta) = \binom{n}{x} \theta^x (1-\theta)^{n-x}\).
Fisher Information: It is known that \(\mathcal{I}(\theta) = \frac{1}{\theta(1-\theta)}\).
Jeffreys’ Prior: \(\pi(\theta) \propto \sqrt{\mathcal{I}(\theta)} = \sqrt{\frac{1}{\theta(1-\theta)}} = [\theta(1-\theta)]^{-1/2}\).
Identify the distribution: This matches the kernel of a Beta distribution with parameters \(\alpha = 1/2\) and \(\beta = 1/2\).
Find the special parameter \(\phi\): \(\phi = \int \sqrt{\mathcal{I}(\theta)} \, d\theta = \int \frac{1}{\sqrt{\theta(1-\theta)}} \, d\theta = 2\arcsin(\sqrt{\theta})\).
Here, \(\phi\) is the arcsine of the square root of \(\theta\).
For \(\phi\), the prior is \(\pi(\phi) \propto 1\). Saying “I know nothing about \(\theta\)” is equivalent to saying “I know nothing about the transformed parameter \(\phi = 2\arcsin(\sqrt{\theta})\).”