Frequentist Modelling: Estimators

Author

D.McCabe

Published

September 23, 2024

Overview

Estimators are the building blocks of Frequentist modelling. Estimators cannot exist in isolation—they are always tied to a statistical model. They are statistics that approximate model or population parameters or the confidence interval of population parameters. Estimation first involves sampling data from a hypothesised parametric distribution (statistical model) with unknown parameter(s) \(\theta_0\). We then use an estimator \(\hat{\theta}\) to compute a point estimate \(\theta\). This estimate is dependent on the sample data so is only an approximation \(\theta\approx\theta_0\).

The accuracy of the estimate depends on both the systematic bias and the variance of the chosen estimator… Generally a trade-off is required to balance both and minimise overall MSE.

  • Bias: estimators are generally asymptotically unbiased under common conditions i.e. estimator consistency - this is an assumption upon which CI theory rests.

  • Variance: The standard error of an estimator can be derived analytically or numerically (e.g., via bootstrapping) by estimating the sampling distribution of the estimator. The standard error (and associated p-values) derived from the sampling distribution tells us about the variability of the estimator - regardless of the specific estimated value.”

  • Confidence Intervals: These are derived from the sampling distribution and provide a range of values that, with a specified confidence level (e.g., 95%), would contain the true parameter in repeated samples. Although we centre the interval on the estimated parameter, the frequentist interpretation is that over many repeated experiments, a given proportion (e.g., 95%) of such intervals will contain the true parameter value. Likewise, if we were to centre the confidence range on the true population value repeat estimates would fall into the range with the same probability

Note

An unbiased estimator estimator of \(\mu\) i.e. \(E[\bar{x}]=\mu\) isn’t generally the best estimator as in certain context unbiasedness is not prioritised (generally we minimise MSE instead of bias or variance idividually). In Bayesian statistics where the true population parameter is undefined bias is likewise undefined.

  • Reducing bias often increases variance
  • Reducing variance often increases bias
Note

The sampling distribution of a statistic is the distribution of the output value which is distinct from the population distribution, which describes the random variable that generates the input data which is called the data-generating function or process. In this document - and in statistical practice generally - sampling distribution always refers to the distribution of the statistic (or estimator), not the distribution of the raw data.

Estimators in Statistical Models

A complete set of estimators defines a statistical model, collectively characterising the data-generating process (McCabe, 2024d).

The model functions as a consistency framework, where agreement with observed data imposes constraints among the estimators (McCabe, 2024b).

  • In regression modelling: In a second-order polynomial model, the intercept, linear, and quadratic estimators are jointly constrained by the data—changing one typically alters the others due to the model’s structure.
  • In distributional modelling: In a normal distribution model, the mean and standard deviation estimators are similarly interdependent, jointly determined by the data.

Residual and consumed degrees of freedom quantify model flexibility and constraintS respectively. Typically, residual degrees of freedom are consumed incrementally—often one per parameter—during sequential estimation of model terms. However, additional model constraints, such as those imposed in regularised regression, can further reduce the effective degrees of freedom (McCabe, 2024e).

Statistical Model Notation Examples

Basic i.i.d. Model

We express a basic i.i.d. model as in terms of its estimators and typically box it to group the estimators in a model:

\[ \boxed{ \begin{aligned} \hat{\mu} &= \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \\ \hat{\sigma}^2 &= \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 \quad \text{(MLE)} \\ S^2 &= \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2 \quad \text{(Unbiased)} \end{aligned} } \]

Regression Example

We express a basic model as in terms of its parameterised constraint equation as well as it’s estimators and any assumptions:

\[ \boxed{ \begin{aligned} \text{Model:} \quad & Y = X\beta + \varepsilon, \quad \varepsilon \sim N(0, \sigma^2 I) \\ \text{Estimator of } \beta: \quad & \hat{\beta} = (X^\top X)^{-1} X^\top Y \\ \text{Estimator of } \sigma^2: \quad & \hat{\sigma}^2 = \frac{1}{n - p} \| Y - X\hat{\beta} \|^2 \end{aligned} } \]

Properties of Estimators

A model’s output can vary significantly due to estimates that depend not only on the sample data but also on the properties of the estimators themselves. Estimators are characterised by properties such as bias, variance, and higher-order moments and behaviours (Wikipedia contributors, 2024). No estimator is perfect — their performance characteristics involve trade-offs and must be suited to:

  • Data conditions: outliers, noise, sample size, etc.
  • Goals and priorities: prediction accuracy, low variance, unbiasedness, interpretability, etc.
  • Violated assumptions: an estimator that performs well under one model may fail under another.

An estimator \(\hat{\theta}\) is a random variable - a function with sample data as input. Mean Squared Error is the expected square of estimation error: \[\text{MSE}(\hat{\theta})=E\left[\left(\hat{\theta}-\theta\right)^2\right]\]

The error measures estimator quality, furthermore any estimate can be decomposed between bias and variance: \[\text{MSE}(\hat{\theta}) = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta})\]

The bias is the difference between the expected value \(E[\hat{\theta}]\) of the estimator and the true parameter value \(\theta\): \[\text{Bias}(\hat{\theta})=E[\hat{\theta}]-\theta\]

As a random variable, every estimator has variance, regardless of bias. The variance of an estimator is the measure of how much the estimator would vary across different input samples: \[\text{Var}(\hat{\theta})=E[(\hat{\theta}-E[\hat{\theta}])^2]\]

A consistent estimator converges on the true value with increasing sample size. The estimator \(\hat{\theta}\) of a model parameter \(\theta\) is consistent if \[\lim_{n\to\infty} \text{Bias}(\hat{\theta})=0\quad\text{and}\quad\lim_{n\to\infty} \text{Var}(\hat{\theta})=0\]

Consistancy can be proven using laws of large numbers

e.g. sample variance for simple i.i.d. model

Let \(X_1, X_2, \ldots, X_n\) be i.i.d. random variables with;

  • estimated \(\mu = \mathbb{E}[X_i]\)
  • variance \(\sigma^2 = \text{Var}(X_i)\) where…
    • \(S_n^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X}_n)^2\)
    • \(\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i\).

We can be almost certain of convergence of

  1. \(\bar{X}_n \xrightarrow{a.s.} \mu\) as per LoLN and equally…
  2. \(\frac{1}{n}\sum_{i=1}^n X_i^2 \xrightarrow{a.s.} \mathbb{E}[X^2]\) .

Now by substituting this into the sample variance formula: \[ \begin{align} S_n^2 &= \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X}_n)^2 \\ &= \frac{1}{n-1} \sum_{i=1}^n \left( X_i^2 - 2X_i\bar{X}_n + \bar{X}_n^2 \right) \\ &= \frac{1}{n-1} \left( \sum_{i=1}^n X_i^2 - 2\bar{X}_n \sum_{i=1}^n X_i + n\bar{X}_n^2 \right) \\ &= \frac{1}{n-1} \left( \sum_{i=1}^n X_i^2 - 2n\bar{X}_n^2 + n\bar{X}_n^2 \right) \\ &= \frac{1}{n-1} \left( \sum_{i=1}^n X_i^2 - n\bar{X}_n^2 \right) \\ &= \frac{n}{n-1} \left( \frac{1}{n} \sum_{i=1}^n X_i^2 - \bar{X}_n^2 \right) \end{align} \]

Now, take the limit of each of the terms as \(n\) approaches infinity:

  • \(\frac{n}{n-1} \to 1\)
  • \(\frac{1}{n} \sum X_i^2 \xrightarrow{a.s.} \mathbb{E}[X^2]\)
  • \(\bar{X}_n^2 \xrightarrow{a.s.} \mu^2\)

Therefore: \(S_n^2 \xrightarrow{a.s.} \mathbb{E}[X^2] - \mu^2 = \sigma^2\)

…the sample variance is a consistent estimator of the population variance.


Among all unbiased estimators, the one with the smallest variance is called efficient. The relative efficiency of two estimatots \(\hat{\theta}_1\) and \(\hat{\theta}_2\) is determined using: \[\text{Eff}(\hat{\theta}_1,\hat{\theta}_2)=\frac{\text{Var}(\hat{\theta}_1)}{\text{Var}(\hat{\theta}_2)}\qquad\text{where}\: \begin{cases} \text{Eff}>1,\:\text{then }\hat{\theta}_2\text{ is more efficient}\\ \text{Eff}<1,\:\text{then }\hat{\theta}_1\text{ is more efficient}\\ \end{cases}\]

Confidence Interval Estimation

Diatribe: Confidence intervals are well-defined and provide a way to compare results across studies and they are more informative than point estimats in NHST (McCabe (2024d)) however, they are conceptually misleading and IMO visually uninformative. The confidence interval is random (limits are defined by a pair of random variables)– a repeat experiment with different observed data will produce a new interval. Monte Carlo Simulation offers an alternative and IMO for many practical purposes, preferable way to gauge variance in a model.

Formal Definition: A \((1-\alpha)\) confidence interval for \(\theta\) is an interval statistic \(I_x\) such that: \[P\left(I_x\:\text{contains}\:\theta_0|\theta=\theta_0\right)=1-\alpha\qquad\text{for all possible values of}\:\theta_0\]

Practice Applet

This applet runs simulations:

  • Runs an \(N=1\) trial with each trial drawing \(n\) samples from a \(X_i\sim N(\mu,\sigma^2)\) distribution in each trial
  • the sample mean \(\bar{x}\) estimation with known variance with the trial’s level \(c\) confidence interval \(z\) trial
  • the sample mean \(\bar{x}\) estimation with unknown variance with the \(t\) trial’s level \(c\) confidence interval
  • I’m not exactly sure what being shown when the number of trials is increased.

Key takeaway:

MIT CI Applet

Construction Examples

Mean Estimation of i.i.d. Data with known variance

The confidence intervals are calculated using z-distribution if variance is known: \[ \begin{align} \mu&\in\left[\:\bar{X}-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}},\quad\bar{X}+\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\:\right]\\ &\in\left[\:\bar{X}-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}},\quad\bar{X}+\frac{z_{(1-\alpha/2)}\cdot\sigma}{\sqrt{n}}\:\right] \end{align} \] …more commonly we use the t-distribution for estimated variance: \[ \mu\in\left[\:\bar{X}-\frac{t_{\alpha/2}\cdot\sigma}{\sqrt{n}},\quad\bar{X}+\frac{t_{\alpha/2}\cdot\sigma}{\sqrt{n}}\:\right] \]

Data \(x_1, x_2,\dots,x_n\) is i.i.d. \(\mu\) is unknown (to be estimated \(\hat{\mu}\)) and \(\sigma\) is known.

Calculate the sample mean as a random variable: \[\bar{X}=\frac{1}{n}\sum_{i=1}^n{X_i}\] In accordance with CTL - See statement of the central limit theorem (McCabe, 2024c);

  • the mean of the sum \(\mathbb{E}\left(\sum_{i=1}^n{X_i}\right) = n\mu\), therefore \(\mathbb{E}[\bar{X}]=\mu\)
  • the variance of the sum \(\text{Var}\left(aX_i\right) = a^2\text{Var}(X_i)\) (See proof of “Scaling for independent events”), \(\Rightarrow\text{Var}\left(\frac{1}{n}X_i\right) = \frac{1}{n^2}\sigma^2\)
  • the variance of the sum \(\text{Var}\left(\sum{Y_i}\right) = \sum{\text{Var}(Y_i)}=n\cdot\text{Var}(Y_i)\) (See proof of “Linearity for independent events”)
  • scaling and summing… \[ \begin{align} \text{Var}\left(\bar{X}\right)&=\text{Var}\left(\frac{1}{n}\sum_{i=1}^n{X_i}\right)\\ \text{Var}\left(\bar{X}\right)&=\frac{1}{n^2}\text{Var}\left(\sum_{i=1}^n{X_i}\right)\quad\text{from scaling}\\ &=\frac{1}{n^2}\left(\sum_{i=1}^n\text{Var}\left(X_i\right)\right)\quad\text{from summing/linearity}\\ &=\frac{1}{n^2}\cdot\quad n\left(\sigma^2\right)\\ &=\frac{\sigma^2}{n}\\ \:\\ \therefore \text{SD}\left(\bar{X}\right)&=\frac{\sigma}{\sqrt{n}} \end{align} \] In conclusion, this follows a normal distribution, \(\bar{X}\sim N(\mu,\frac{\sigma^2}{n})\).

We standardise the \(\bar{X}\) distribution (note that the confidence interval sides we are simply converting an arbitrary normal statistic into a z statstic):

\[N(0,1)=\boxed{Z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}}\] …and actually we can see readily see that the expression for the confidence interval delta are simply an “arbitrary normal statistic” converted into a z statstic: \[\mu\in=\left[\:\bar{X}-x_{\alpha/2},\quad\bar{X}+x_{\alpha/2}\:\right]\\ \Rightarrow \mu\in\left[\:\bar{X}-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}},\quad\bar{X}+\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\:\right] \]


Now we calculate the critical values \(\pm z_{\alpha/2}\) for the two tailed confidence interval using the desired significance level \(\alpha\) and reducing the middle term of the inequality performing algebraic pivoting: \[ \begin{align} P\left(-z_{\alpha/2}\le Z \le z_{\alpha/2}\right)&=1-\alpha \\ P\left(-z_{\alpha/2}\le \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \le z_{\alpha/2}\right)&=1-\alpha\\ P\left(-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\le\bar{X}-\mu \le \frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\right)&=1-\alpha\\ P\left(-\bar{X}-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\le-\mu \le -\bar{X}+\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\right)&=1-\alpha\\ P\left(\bar{X}+\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\ge\mu \ge \bar{X}-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\right)&=1-\alpha\\ P\left(\bar{X}-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\le\mu \le \bar{X}+\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\right)&=1-\alpha \end{align} \]

In this final expression we clearly see the confidence interval:

\[\mu\in\left[\:\bar{X}-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}},\quad\bar{X}+\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\:\right]\] or

\[\mu\in\bar{X}\pm\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\]

Variance Estimation of i.i.d. Data

The confidence intervals are calculated using the chi squared distribution: \[\sigma_0^2\in\left[\:\frac{\nu\cdot \hat{\sigma}^2}{c_{\alpha/2}},\frac{\nu\cdot \hat{\sigma}^2}{c_{1-\alpha/2}}\:\right] \]


The variance distribution was calculated as part of this example:

\[\boxed{\hat{\sigma}^2_{\text{known mean}}\sim\frac{\sigma^2}{n}\chi^2_{n}}\qquad\boxed{\hat{\sigma}^2_{\text{unbiased}}\sim\frac{\sigma^2}{n-1}\chi^2_{n-1}}\qquad\hat{\sigma}^2_{general}\sim\frac{\sigma^2}{\nu}\chi^2_\nu\] So given a null hypothesis \(H_0:\sigma=\sigma_0\) the test statistic \(\boxed{c=\frac{\nu\cdot s^2}{\sigma_0^2}}\) has rejection region:

\[ c_{1-\alpha/2}<\frac{\nu\cdot s^2}{\sigma_0^2}<c_{\alpha/2}\\ \frac{c_{1-\alpha/2}}{\nu\cdot s^2}<\frac{1}{\sigma_0^2}<\frac{c_{\alpha/2}}{\nu\cdot s^2}\\ \frac{\nu\cdot s^2}{c_{1-\alpha/2}}>\sigma_0^2>\frac{\nu\cdot s^2}{c_{\alpha/2}} \] Therefore the non-rejection region is given by: \[\sigma_0^2\in\left[\:\frac{\nu\cdot \hat{\sigma}^2}{c_{\alpha/2}},\frac{\nu\cdot \hat{\sigma}^2}{c_{1-\alpha/2}}\:\right] \]

Mean Estimation of i.i.d. Data with Unknown Variance

Warning

Shouldn’t this be \(n-1\) DoF?

Here; - the sampling distribution of the sample mean follows a scaled and shifted Student’s t distribution: \[\bar{X}\sim \lambda \mathcal{Z}=\mu+\frac{S}{\sqrt{n}}t_{n-1}\] - the sampling distribution of the sample variance follows a scaled Chi Squared distribution: \[S^2\sim\frac{\sigma^2}{(n-1)}\chi^2_{n-1}\]

The confidence intervals are calculated using Student’s t-distribution and the sample variance: \[\mu\in\left[\:\bar{X}-\frac{t_{n-1,\alpha/2}\cdot S}{\sqrt{n}},\quad\bar{X}+\frac{t_{n-1,\alpha/2}\cdot S}{\sqrt{n}}\:\right] \]

Data \(x_1, x_2,\dots,x_n\) is i.i.d. both \(\mu\) and \(\sigma\) are unknown.

Calculate the sample mean as a random variable which folows a normal distribution in accordance with CTL (McCabe, 2024c): \[\bar{X}=\frac{1}{n}\sum_{i=1}^n{X_i}\] The sample variance is given by: \[ S=\frac{1}{n-1}\sum_{i=1}^n(X_i-\mu)^2\] We can standardise the \(\bar{X}\) distribution but it doesn’t quite follow a z statitstic due to the \(S\) estimator in the denominator \(Z\ne\frac{\bar{X}-\mu}{S/\sqrt{n}}\). The numerator alone gives us a a z-distribution while the denominator gives us a chis-squared distribution see the sampling distribution histogram of the sample variance.

Instead the normalised mean distribution follows a Students-t distribution (McCabe, 2024a), that is a normal distribution with extended wings: \[\boxed{t_\nu=\frac{\bar{X}-\mu}{S/\sqrt{n}}}\]

Tip

Actually we can see readily see that the expression for the confidence interval delta are simply an “arbitrary normal statistic” converted into a t statstic: \[\mu\in=\left[\:\bar{X}-x_{\alpha/2},\quad\bar{X}+x_{\alpha/2}\:\right]\\ \Rightarrow \mu\in\left[\:\bar{X}-\frac{t_{\alpha/2}\cdot S}{\sqrt{n}},\quad\bar{X}+\frac{t_{\alpha/2}\cdot S}{\sqrt{n}}\:\right]\]


Now we calculate the two tail critical values \(\pm t_{\alpha/2}\) for the two tailed confidence interval using the desired significance level \(\alpha\) and reducing the middle term of the inequality: \[ \begin{align} P\left(-t_{\alpha/2}\le t_\nu \le t_{\alpha/2}\right)&=1-\alpha \\ P\left(-t_{\alpha/2}\le \frac{\bar{X}-\mu}{S/\sqrt{n}} \le t_{\alpha/2}\right)&=1-\alpha\\ P\left(\bar{X}-\frac{t_{\alpha/2}\cdot S}{\sqrt{n}}\le\mu \le \bar{X}+\frac{t_{\alpha/2}\cdot S}{\sqrt{n}}\right)&=1-\alpha\\ \end{align} \]

In this final expression we clearly see the confidence interval:

\[\mu\in\left[\:\bar{X}-\frac{t_{\alpha/2}\cdot S}{\sqrt{n}},\quad\bar{X}+\frac{t_{\alpha/2}\cdot S}{\sqrt{n}}\:\right]\] or

\[\mu\in\bar{X}\pm\frac{t_{\alpha/2}\cdot S}{\sqrt{n}}\]

Converting Test Statistics to CI

To test a hypothesis e.g. \(\mu_0\) we pivot the \(1-\alpha\) confidence interval on the calculated/sample data estimate \(\mu\). If \(\mu_0\) lies within the confidence interval we do not reject the null hypothesis otherwise we reject the null hypothesis.

Algebraic pivoting inverts a test statistic inequality to form a confidence interval, and vice versa - the statments below are equivalent:

  1. Language of a test statistic: \(\bar{x}\) is in the interval \(\mu_0\pm\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\)
  2. Language of Confidence intervals: \(\mu_0\) is in the interval \(\bar{x}\pm\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\)

visualisation of pivoting as done in (Orloff & Kamrin, 2022)

\[ \begin{align} 1-\alpha&=P\left(-z_{\alpha/2}\le Z \le z_{\alpha/2}\right) \\ &=P\left(-z_{\alpha/2}\le \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \le z_{\alpha/2}\right)\\ &=P\left(-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\le\bar{X}-\mu \le \frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\right)\\ &=P\left(-\bar{X}-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\le-\mu \le -\bar{X}+\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\right)\\ &=P\left(\bar{X}+\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\ge\mu \ge \bar{X}-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\right)\\ &=P\left(\bar{X}-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\le\mu \le \bar{X}+\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\right) \end{align} \] Plug in the numbers… \[ \begin{align} 1-\alpha&=P\left(\bar{X}-\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\le\mu \le \bar{X}+\frac{z_{\alpha/2}\cdot\sigma}{\sqrt{n}}\right)\\ 1-0.3&=P\left(1.48-\frac{1.04\cdot 1.5}{\sqrt{4}}\le\mu \le 1.48+\frac{1.04\cdot1.5}{\sqrt{4}}\right)\\ \:\\ \therefore\qquad 0.7&=P\left(0.7\le\mu \le 2.26\right)\\ \end{align} \]

Caution

Always take care when interpreting: A \(70\)% confidence interval for \(\mu\), based on our sample with \(\bar{X}=1.48\), \(\sigma=1.5\), and \(n=4\) is: \(\left[0.7, 2.26\right]\). This interval was constructed using a method that, in repeated sampling, would capture the true \(\mu\) 70% of the time.

Appendix

Study 1: Sampling Distribution of Variance Estimator for i.i.d. Normal Data with Unknown Mean

Here I’ll model normally distributed i.i.d. data using:

  1. a model with an unbiased estimator for variance (McCabe, 2024b): \[\boxed{\begin{align} \hat{\mu}&=\bar{X}=\frac{1}{n}\sum_{i=1}^nX_i\\ \hat{\sigma}_{unbiased}&=\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2 \end{align}}\]
  2. another model with a biased estimator for variance and unknown mean: \[\boxed{\begin{align} \hat{\mu}&=\bar{X}=\frac{1}{n}\sum_{i=1}^nX_i\\ \hat{\sigma}_{biased}&=\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2 \end{align}}\]
  3. another using unbiased variance for known mean: \[\boxed{\hat{\sigma}_{\mu}=\frac{1}{n}\sum_{i=1}^n(X_i-\mu)^2}\]

I’ll get an idea of the continuous sampling distribution of each of the estimators by looking at discrete histograms of the simulation results…

Data \(x_1, x_2,\dots,x_n\) is i.i.d. both \(\mu\) and \(\sigma\) are unknown.

The likelihood function for the data takes the form of the normal probability distribution): \[\mathcal{L}(\mu,\sigma^2)=\Pi_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(X_i-\mu)^2}{2\sigma^2}\right)\] Taking log-likelihood:

\[ \begin{align} \mathcal{l}(\mu,\sigma^2)&=\log\left(\Pi_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(X_i-\mu)^2}{2\sigma^2}\right)\right)\\ &=\sum_{i=1}^n\log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(X_i-\mu)^2}{2\sigma^2}\right)\right)\\ &=\sum_{i=1}^n\left(\log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)-\frac{(X_i-\mu)^2}{2\sigma^2}\right)\\ &=n\log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)-\sum_{i=1}^n\left(\frac{(X_i-\mu)^2}{2\sigma^2}\right)\\ &=\frac{n}{2}\log\left(\frac{1}{2\pi\sigma^2}\right)-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i-\mu)^2\\ &=-\frac{n}{2}\log\left(2\pi\sigma^2\right)-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i-\mu)^2\\ &=-\frac{n}{2}\log\left(2\pi\right)-\frac{n}{2}\log\left(\sigma^2\right)-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i-\mu)^2 \end{align} \] Find maximum likelihood estimate \(\mu\) by taking partial derivatives and setting to zero:

\[ \frac{\partial\mathcal{l}}{\partial\sigma}=\frac{1}{\sigma^2}\sum_{i=1}^n(X_i-\mu)\\ 0=\frac{1}{\sigma^2}\sum_{i=1}^n(X_i-\mu)\\ \Rightarrow 0=\sum_{i=1}^n(X_i-\mu)\\ \Rightarrow \boxed{\mu=\frac{1}{n}\sum_{i=1}^nX_i\equiv\bar{X}} \]

Find maximum likelihood estimate \(\hat{\sigma}\) by taking partial derivatives and setting to zero:

\[ \frac{\partial\mathcal{l}}{\partial\sigma}=-\frac{n}{\sigma^2}+\frac{1}{\sigma^3}\sum_{i=1}^n(X_i-\mu)^2\\ 0=-\frac{n}{\hat{\sigma}^2}+\frac{1}{\sigma^3}\sum_{i=1}^n(X_i-\mu)^2\\ 0=-n+\frac{1}{\sigma}\sum_{i=1}^n(X_i-\mu)^2\\ \Rightarrow \boxed{\sigma(\mu)=\frac{1}{n}\sum_{i=1}^n(X_i-\mu)^2} \] Since the mean estimation already consumed one Dof the numerator becomes… \[ \boxed{S(\bar{X})=\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2}\\\]
# Monte Carlo Simulation: Variance Estimator Sampling Distribution for i.i.d. Model with Unknown Variance

n <- 5                      # sample size
sigma_true <- 1             # true standard deviation
n_trials <- 500000           # number of simulations
alpha <- 0.05

chi2_lower <- qchisq(1 - alpha/2, df = n - 1) # lower quantile for chi-squared
chi2_upper <- qchisq(alpha/2, df = n - 1)     # higher quantile for chi-squared

mle_var <- numeric(n_trials)       # biased unknown mean
unbiased_var <- numeric(n_trials)  # unbiased unknown mean
mu_var <- numeric(n_trials)        # unbiased known mean
lower_ci <- numeric(n_trials)
upper_ci <- numeric(n_trials)
covered <- logical(n_trials)

# Monte Carlo loop
for (i in 1:n_trials) {
  sample <- rnorm(n, mean = 0, sd = sigma_true)
  sample_mean <- mean(sample)
  ssq <- sum((sample - sample_mean)^2)

  # Save Variance Estimates
  mle_var[i] <- ssq / n
  unbiased_var[i] <- ssq / (n - 1)
  mu_var[i] <- sum((sample - 0)^2)/n
  
  # Save CI Estimates
  lower_ci[i] <- (n - 1) * unbiased_var[i] / chi2_lower
  upper_ci[i] <- (n - 1) * unbiased_var[i] / chi2_upper

  # Check if true variance is within the CI
  covered[i] <- (sigma_true^2 >= lower_ci[i]) & (sigma_true^2 <= upper_ci[i])
}
True Variance: 1.0000
MLE Sample variance mean = 0.8001, MLE Variance variance = 0.3196, MSE = 0.3596
Unbiased Sample variance mean = 1.0002, Unbiased Variance variance = 0.4995, MSE = 0.4995
'Known mean' variance mean = 1.0007,'Known mean' variance variance = 0.3999, MSE = 0.3999
95% Confidence interval (Unbiased Sample variance) coverage rate: 0.9500

The results show on avarage;

  • the unbiased estimator matches the real value (unbiased)
  • the biased estimator do not match the true variance
  • the sampling distribution for the \(\hat{\sigma}^2\) estimator is positively skewed… (actually these are all chi-squared distributions…).

Note

The general form of our variance estimate: \[ \text{Var}(X)\stackrel{\mathrm{def}}{=}\frac{1}{\nu}\sum_{i=1}^n(X_i-\hat{\mu})^2 \] Introducing a a factor of \(1/\sigma^2\) to each side to get the distributions for our variance estimators:

\[ \begin{align} \text{when }&\hat{\mu}=\mu:\\\:\\ &\text{multiplying by }\nu\text{ dividing estimate by true }\sigma^2\dots\\ &\frac{\text{Var}(X)\cdot n}{\sigma^2}=\sum_{i=1}^n\left(\frac{X_i-\mu}{\sigma}\right)^2=\sum_{i=1}^nZ_i^2\\ &\quad\underbrace{\sum_{i=1}^nZ_i^2\sim\chi^2_{n}}_{\text{by definition}}\qquad\Rightarrow\:\boxed{\hat{\sigma}^2_{\text{known mean}}\sim\frac{\sigma^2}{n}\chi^2_{n}}\\ \:\\\:\\\:\\ \text{when }&\hat{\mu}=\bar{X}:\\\:\\ &\text{multiplying by }\nu\text{ dividing estimate by true }\sigma^2\dots\\ &\frac{\text{Var}(X)\cdot\nu}{\sigma^2}=\sum_{i=1}^n\left(\frac{X_i-\bar{X}}{\sigma}\right)^2\\ &\text{recognising }\chi^2\text{ and removing 1 dof to account for mean estimation}\dots\\ &\sum_{i=1}^n\left(\frac{X_i-\bar{X}}{\sigma}\right)^2\sim\chi^2_{n-1}\\ &\Rightarrow\:\boxed{\hat{\sigma}^2_{\text{biased}}\sim\frac{\sigma^2}{n}\chi^2_{n-1}}\quad\Rightarrow\:\boxed{\hat{\sigma}^2_{\text{unbiased}}\sim\frac{\sigma^2}{n-1}\chi^2_{n-1}} \end{align} \]

Reminder: the sum of squared deviations of \(n\) i.i.d. standard normal variables from their mean follows a chi-squared distribution \(\sum_{i=1}^nZ_i^2\sim\chi^2_{\nu}\). Recall it’s the chi-squared statistic that is used in NHSTesting if data matches a curve/distribution.

More explicitly we have \(X_1, X_2, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)\) i.i.d. and normally distributed \(X_i\sim N(\mu,\sigma^2)\):

Standardise the normal distribution and re-express \(X_i\) in terms of \(Z_i\): \[Z_i = \frac{X_i - \mu}{\sigma} \quad \Rightarrow \quad Z_i \sim \mathcal{N}(0, 1)\quad\therefore\:X_i = \mu + \sigma Z_i\]

…so the sample mean interms of standard normal variables: \[\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i = \mu + \sigma \bar{Z}, \quad \text{where } \bar{Z} = \frac{1}{n} \sum_{i=1}^n Z_i\]

Now rewrite the sum of squares in terms of standard notmal statistics:

\[\sum_{i=1}^n (X_i - \bar{X})^2 = \sum_{i=1}^n \left[ \sigma (Z_i - \bar{Z}) \right]^2 = \sigma^2 \sum_{i=1}^n (Z_i - \bar{Z})^2\\ \Rightarrow\quad\boxed{\frac{1}{\sigma^2} \sum_{i=1}^n (X_i - \bar{X})^2 = \sum_{i=1}^n (Z_i - \bar{Z})^2}\]

The Chi-Squared Distribution \(\sum_{i=1}^n (Z_i - \bar{Z})^2 \sim \chi^2_{n-1}\)

\[\Rightarrow\quad\frac{1}{\sigma^2} \sum_{i=1}^n (X_i - \bar{X})^2 \sim \chi^2_{n-1}\]

Although there are \(n\) terms in the sum, the deviations from the mean satisfy: \[\sum_{i=1}^n (Z_i - \bar{Z}) = 0\] This is a linear constraint, so only \(n - 1\) of the \((Z_i - \bar{Z})\) are linearly independent (one value is decided by the others). Hence the distribution has \(n - 1\) degrees of freedom.

Study 2: Sampling Distribution of Normalised Mean Estimator for i.i.d. Normal Data with Known & Unknown Variance

Here I’ll randomly generate a number of normally distributed random variables:

  1. the mean statistic (same regardless of whether variance is known or unknown) which I’ll safely assume will give us a normal distribution in accordance with the CTL: \[\boxed{\hat{\mu} = \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\quad\text{where}\:X_i\sim N(2,2^2)}\]
  2. I’ll generate a pseudo statistic \(T'\) which normalises the mean estimator using the sample variance value \(S^2\) and real population mean \(\mu\) (it can’t be a real statistic since it uses the known true \(\mu\)): \[ \boxed{ \begin{aligned} &T=\frac{\bar{X}-\mu}{S}\quad\text{where}\dots\\ &\qquad\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\\ &\qquad S^2 =\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2 \end{aligned} }\]
  3. I’ll generate another pseudo statistic \(Z'\) which normalises the mean estimator using the real population mean value \(\mu\) and the real population variance value \(\sigma^2\) (think about \(T\) as a NHST z-statistic to test null hypothesis \(\mu_0=\mu\)): \[\boxed{T=\frac{\bar{X}-\mu}{\sigma}\quad\text{where}\quad\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i}\]

Again I’ll get an idea of the continuous sampling distribution of each of the estimators by looking at discrete histograms of the simulation results…

# Monte Carlo Simulation: Mean Estimator Sampling Distribution for i.i.d. Model with Unknown Variance

# Parameters
n <- 3                     # sample size
mu_true <- 2.0             # true mean
sigma_true <- 2.0          # true standard deviation
n_trials <- 500000          # number of simulations
alpha <- 0.05


# Storage for estimators
sample_mean <- numeric(n_trials)
sample_stddev <- numeric(n_trials)
mean_normal_known_var <- numeric(n_trials)
mean_normal_unknown_var <- numeric(n_trials)


# Monte Carlo loop
for (i in 1:n_trials) {
  sample <- rnorm(n, mean = mu_true, sd = sigma_true)
  
  sample_mean[i] <- mean(sample)
  sample_stddev[i] <- sqrt(var(sample))
  
  mean_normal_known_var[i] <- (sample_mean[i]-mu_true)/sigma_true
  mean_normal_unknown_var[i] <- (sample_mean[i]-mu_true)/sample_stddev[i]
}
True Mean: 2.0000, True std.dev: 2.0000
Average Sample mean = 2.0021, Average Sample std.dev = 1.7742, MSE = 2.2411

Note

The general form of the mean estimate normalisation is: \[ Y=\frac{\bar{X}-\mu}{\hat{\sigma}} \]

When we normailse using the true variance the probability distrbution of \(Y\) is simply a standard normal distribution:

\[ \begin{align} \text{when }\hat{\sigma}=\sigma:&\\ &Y=\frac{\bar{X}-\mu}{\hat{\sigma}}\quad\Rightarrow\quad\boxed{\frac{\bar{X}-\mu}{\sigma}\sim Z} \end{align} \] When we normalise using the sample variance, the probability distrbution of \(Y\) gives the expression for Student’s t distribution exactly as we first presented it McCabe (2024a), also see that the unbiased variance estimates have a \(\frac{\sigma^2}{n-1}\chi^2_{n-1}\) distribution:

\[ \begin{align} \text{when }&\hat{\sigma}^2=\text{Var}(X):\\ &\quad Y=\frac{\bar{X}-\mu}{\hat{\sigma}}\quad\Rightarrow\quad\frac{\bar{X}-\mu}{S}\\ &\text{the numerator follows a scaled z distribution}\\ &\quad\bar{X}-\mu\sim \sigma Z\\ &\text{the numerator follows a scaled chi-squared distribution as seen in the first experiment}\\ &\quad S^2\sim \frac{\sigma^2}{n-1}\chi^2_{n-1}\quad\Rightarrow\quad S\sim\sigma\sqrt{\frac{\chi_{n-1}^2}{n-1}}\\\:\\ &\text{this give Student's-t distribution as presented in reference:}\\ &\qquad \frac{\bar{X}-\mu}{S}\sim\frac{Z}{\sqrt{\frac{\chi_{n-1}^2}{n-1}}}\qquad \text{where}\quad\underbrace{t_{n-1}=\frac{Z}{\sqrt{\frac{\chi_{n-1}^2}{n-1}}}}_{\text{by definition}}\\ &\therefore\qquad\boxed{\frac{\bar{X}-\mu}{S}\sim t_{n-1}} \end{align} \] >TODO: i think the dof are out - the studentised mean is: \(t=\frac{\bar{X}-\mu}{S/\sqrt{n}}\)

Study 3: Variance of the Sampling Distribution of Normalised Mean Estimator

Here I’ll randomly sample a number of standard normal distributed random variables \(Z_i\sim N(0,1)\) and repeatedly estimate the mean using three different sample sizes; 2, 4, 8.

Once again I’ll get an idea of the continuous sampling distribution of each of the estimators by looking at discrete histograms of the simulation results. We’ll look at the variance of the each respective sampling distribution - no prizes for guessing the variance of the sampling distribution reduces with sample size

# Monte Carlo Simulation: Different Mean Estimator Sampling Distributions

# Parameters
n_trials <- 500000         # number of simulations
alpha <- 0.05

# Storage for estimators
sample_mean_2 <- numeric(n_trials)
sample_mean_4 <- numeric(n_trials)
sample_mean_8 <- numeric(n_trials)

# Monte Carlo loop
for (i in 1:n_trials) {
  sample <- rnorm(8)
  sample_mean_2[i] <- mean(sample[1:2])
  sample_mean_4[i] <- mean(sample[1:4])
  sample_mean_8[i] <- mean(sample[1:8])
}
Variances are as follows:
     2 sample mean estimator 0.5000
     4 sample mean estimator 0.2501
     8 sample mean estimator 0.1246

takeaway: Increasing the sample size generally reduces the variability (standard error) of an estimate. This effect holds for well-behaved, consistent estimators under standard assumptions. However, it is not universally true - some estimators or data scenarios may not exhibit this property.

Note

Using properties of variance and assuming independence:

\[ \operatorname{Var}(\bar{X}_n) = \operatorname{Var}\left( \frac{1}{n} \sum_{i=1}^n X_i \right) = \frac{1}{n^2} \sum_{i=1}^n \operatorname{Var}(X_i) = \frac{1}{n^2} \cdot n \cdot \sigma^2 = \frac{\sigma^2}{n} \]

References

McCabe, D. (2024a). Catalogue of continuous probability distributions. https://rpubs.com/mccabe08/1264465.
McCabe, D. (2024b). Degrees of freedom in statistical modelling. https://rpubs.com/mccabe08/1312256.
McCabe, D. (2024c). Derivation - the central limit theorem. https://rpubs.com/mccabe08/1265303.
McCabe, D. (2024d). Null hypothesis significance testing (neyman-pearson) paradigm. https://rpubs.com/mccabe08/1285775.
McCabe, D. (2024e). Ridge and lasso regression DoF. https://rpubs.com/mccabe08/1317718.
Orloff, Dr. J., & Kamrin, Dr. J. F. (2022). MIT 18.05 discrete mathematics, probability and statistics lecture notes. https://www.zotero.org/groups/5801297/mit_ocw
Wikipedia contributors. (2024). Estimator – Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Estimator. https://en.wikipedia.org/wiki/Estimator