Populations and Samples: An Introduction to Estimators

Author

Charles Eliot

Here Be Dragons. Beware.

Here’s something you would see in an introductory statistics course.

We start with this basic definition of the variance of a random variable¹:

\[ var(X)=E\big[(X-E(X))^2\big] \]

Then we expand the square and split the result into separate expectations²:

\[ var(X)= E\big[X^2+E(X)^2-2E(X)X\big]= \\ E[X^2]+E[X]^2-2E[X]^2 = \\ E[X^2]-E[X]^2 \]

So far so good, but why does R’s built-in var() function return a different value?

dice <- c(1, 2, 3, 4, 5, 6)
weights <- c(1, 1, 1, 1, 1, 1) / 6

E_X <- sum(dice * weights)
E_X2 <- sum((dice^2) * weights)

cat(paste("Variance, method 1 = ", round(sum((weights) * (dice - E_X)^2), 3), "\n"))
cat(paste("Variance, method 2 = ", round(E_X2 - E_X^2, 3), "\n"))
cat(paste("Variance, var()    = ", round(var(dice), 3), "\n"))

Variance, method 1 =  2.917 
Variance, method 2 =  2.917 
Variance, var()    =  3.5

For the answer, we’ll need to dig deeper into how we figure out the properties of populations from samples of their members.

Population Means and Sample Means

The expectation value for a discrete random variable \(X\) is \(E[X] = \sum_{i}{P(x_i)x_i}\), where \(P(x_i)\) is the probability that \(X\) takes the value \(x_i\), and \(i\) covers all the possible values of \(X\). There is also an integral version for continuous variables. This expectation value gets a special symbol \(\mu\) and is called the population mean.

If we take \(n\) samples \(x_i\) from \(X\), the average of the samples is \(\frac{1}{n}\sum_{i=1}^nx_i\). We call this the sample mean and give it the symbol \(\bar{x}\).

Unbiased Estimators

Does calculating the sample mean \(\bar{x}\) for some sample provide a good estimation of the population mean \(\mu\)?

To answer this important question³, we’ll need to say precisely what we mean when we say a calculation - like taking the average \(\bar{x}\) of some samples - is a “good” estimator for a parameter of the entire possible population like \(\mu\).

Let’s say we want our estimators to be unbiased, which we define to mean that the expectation value of the estimator gives the correct value for the population parameter.⁴ ⁵

Is the Sample Mean an Unbiased Estimator for the Population Mean?

We’ll use our definition and compute the expectation value of the averages of sets \(X\) each of \(n\) samples \(x_1 ... x_n\).

\[ E[\bar{X}] = E\bigg[\frac{1}{n}\sum_{i=1}^nx_i\bigg] = \frac{1}{n}E\bigg[\sum_{i=1}^nx_i\bigg] = \frac{1}{n}\sum_{i=1}^nE[x_i] = \frac{1}{n}\sum_{i=1}^n\mu = \mu \]

Each \(x_i\) is a realization of a random variable from a distribution with a population mean of \(\mu\), so \(E[x_i]=\mu\).

As hoped, we’ve shown that a sample mean is an unbiased estimator for the population mean:

\[ E[\bar{X}]=\mu \]

Demonstration That the Sample Mean Is an Unbiased Estimator for the Population Mean

Let’s give this a try on a common distribution: the exponential distribution.

library(ggplot2)

df <- data.frame(X = 1:10, Y = 1:10)

g <- ggplot(df, aes(x = X, y = Y))
g <- g + xlim(0, 4)
g <- g + stat_function(fun = dexp, n = 101, args = list(rate = 1))
g

\(\mu=\frac{1}{rate}\) for the exponential function, so let’s confirm that taking the average of the average of lots of n-fold samples gives us progressively better estimates of \(\mu\):

set.seed(314159)

rate <- 2
n <- 10

mean(sapply(1:10, \(i) {mean(rexp(n, rate = rate))}))
mean(sapply(1:100, \(i) {mean(rexp(n, rate = rate))}))
mean(sapply(1:1000, \(i) {mean(rexp(n, rate = rate))}))
mean(sapply(1:10000, \(i) {mean(rexp(n, rate = rate))}))

[1] 0.5687035
[1] 0.514899
[1] 0.5056512
[1] 0.4993165

An Unbiased Estimator for the Population Variance

The population variance of a discrete distribution is \(\sigma^2=\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2\) , so let’s start with that as our candidate estimator for the population variance \(\sigma^2\).

\[ s_{candidate}^2=\frac{1}{n}\sum_{i=1}^{n}{(x_i-\bar{x})^2} \]

To test if this is an unbiased estimator, we’re going to evaluate the expectation value of the candidate estimator:

\[ E[s_{candidate}^2] = E\bigg[\frac{1}{n}\sum{(x_i-\bar{x})^2}\bigg] = \frac{1}{n}E\bigg[\sum{(x_i-\bar{x})^2}\bigg] \]

\[ E[s_{candidate}^2] = \frac{1}{n}\bigg(E\bigg[\sum{(x_i^2-2\bar{x}x_i+\bar{x})^2}\bigg]\bigg) \]

\[ E[s_{candidate}^2] = \frac{1}{n}\bigg(E\bigg[\sum{x_i^2-2\bar{x}\sum{x_i}+\sum\bar{x}^2}\bigg]\bigg) = \frac{1}{n}\bigg[\bigg({\sum{E[x_i^2}]\bigg)-nE[\bar{x}^2}]\bigg] \]

Let’s pause for a moment to derive some useful expressions:

\[ var(X) = \sigma^2 = E[X^2]-E[X]^2 = E[X^2]-\mu^2 \]

\[ E[X^2] = \sigma^2+\mu^2 \]

Let’s now turn to understanding the variance of the sample mean.

\[ var(\bar{X}) = E[\bar{X}^2]-(E[\bar{X}])^2 \]

We already demonstrated that \(E[\bar{X}]=\mu\) when we showed that the sample average is an unbiased estimator for the population mean. And by the definition of standard error⁶ \(\sigma_{\bar{X}}^2=\frac{\sigma^2}{n}\), so:

\[ \frac{\sigma^2}{n}=E[\bar{X}^2]-\mu^2 \]

or:

\[ E[\bar{X^2}]=\frac{\sigma^2}{n}+\mu^2 \]

Substituting these two expressions into the earlier formula for the expectation of our candidate estimator, we get:

\[ E[s_{candidate}^2]=\frac{1}{n}\bigg[\bigg({\sum{E[x_i^2}]\bigg)-nE[\bar{x}^2}]\bigg] = \frac{1}{n}\bigg[n(\sigma^2+\mu^2)-n(\frac{\sigma^s}{n}+\mu^2)\bigg] = \frac{n-1}{n}\sigma^2 \]

This tells us our candidate estimator for the variance is not unbiased, because it systematically underestimates the actual variance \(\sigma^2\) by a factor of \(\frac{n-1}{n}\).

On the other hand, an appropriately scaled version of our candidate estimator is unbiased:

\[ s_{unbiased}^2=\frac{n}{n-1}\frac{1}{n}\sum_{i=1}^{n}{(x_i-\bar{x})^2} = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i-\bar{x})^2} \]

The moral of the story is that the formula for the sample variance must have \(n-1\) in its denominator so it can be an unbiased estimator for the population variance.

Sampling Examples

dice <- c(1, 2, 3, 4, 5, 6)

cat(paste0("Population mean = ", (sum(1/6 * dice)), "\n"))
cat(paste0("Population variance = ", round(sum((dice - mean(dice))^2 / length(dice)), 3), "\n"))

Population mean = 3.5
Population variance = 2.917

Bigger sample sizes give better estimates for the population mean:

print(mean(sample(dice, 10, replace = TRUE)))
print(mean(sample(dice, 100, replace = TRUE)))
print(mean(sample(dice, 1000, replace = TRUE)))
print(mean(sample(dice, 10000, replace = TRUE)))

[1] 3.5
[1] 3.82
[1] 3.548
[1] 3.523

Dividing by N systematically underestimates the population variance:

f <- function(reps) {
  mean(sapply(1:reps, \(i) {
  data <- sample(dice, 10, replace = TRUE)
  sum((data - mean(data))^2 / length(data))
}))}

print(f(10))
print(f(100))
print(f(1000))
print(f(10000))

[1] 2.71
[1] 2.5627
[1] 2.61851
[1] 2.632593

Dividing by (N-1) produces an unbiased estimator for the population variance:

g <- function(reps) {
  mean(sapply(1:reps, \(i) {
  data <- sample(dice, 10, replace = TRUE)
  sum((data - mean(data))^2 / (length(data) - 1))
}))}

print(g(10))
print(g(100))
print(g(1000))
print(g(10000))

[1] 2.942222
[1] 2.9
[1] 2.886433
[1] 2.908854

So finally we understand why R’s var() command gives the result it does: it’s computing the sample variance, not the population variance:

cat(paste0("var(dice) = ", round(var(dice), 3), "\n"))

cat("\n")

cat(paste0("Sample variance     = ", round(sum((dice - mean(dice)) ^ 2) / (length(dice) - 1), 3), "\n"))
cat(paste0("Population variance = ", round(sum((dice - mean(dice)) ^ 2) / length(dice), 3), "\n"))

var(dice) = 3.5

Sample variance     = 3.5
Population variance = 2.917

Degrees of Freedom

Say we already know the population mean \(\mu\) and don’t need to compute the sample mean \(\bar{x}\) as an intermediate step in our estimation process. Does this affect the choice of estimator?

Let’s start with the same candidate estimator we used before, but we’ll substitute the known \(\mu\) for the calculated \(\bar{x}\).

\[ s_{candidate}^2=\frac{1}{n}\sum_{i=1}^{n}{(x_i-\mu)^2} \]

As before we’ll test whether this is an unbiased estimator by evaluating its expectation value:

\[ E[s_{candidate}^2] = E\bigg[\frac{1}{n}\sum{(x_i-\mu)^2}\bigg] = \frac{1}{n}E\bigg[\sum{(x_i-\mu)^2}\bigg] \]

\[ E[s_{candidate}^2] = \frac{1}{n}\bigg(E\bigg[\sum{(x_i^2-2{\mu}x_i+\mu)^2}\bigg]\bigg) \]

\[ E[s_{candidate}^2] = \frac{1}{n}\bigg(E\bigg[\sum{x_i^2-2\mu\sum{x_i}+\sum\mu^2}\bigg]\bigg) = \frac{1}{n}\bigg(\bigg({\sum{E[x_i^2}]\bigg)-2n{\mu}E[\bar{x}]+nE[\mu^2}]\bigg) \]

And since we know \(E(\bar{x})=\mu\) and \(E(\mu^2)=\mu^2\)

\[ E[s_{candidate}^2] = \frac{1}{n}\bigg[\bigg(\sum{E[x_i^2]}\bigg)-n\mu^2\bigg] \]

We also know

\[ E[X^2]=\sigma^2+\mu^2 \]

\[ E[s_{candidate}^2] = \frac{1}{n}\bigg[n(\sigma^2+\mu^2)-n\mu^2\bigg] = \sigma^2 \]

Conclusion: if the population mean is already known, \(\frac{1}{n}\sum_i(x_i-\mu)^2\) (denominator = \(n\), not \(n-1\)) is an unbiased estimator for the population variance.

This example illustrates how degrees-of-freedom (DF) impact statistical calculations. When we already know the population mean \(\mu\), the \(n\) samples constitute \(n\) independent variables, \(x_1 ... x_n\). If we need to compute the sample mean as an intermediate result, the \(n\) samples only account for \(n-1\) independent variables (because if we are given \(x_1 ... x_{n-1}\) and \(\bar{x}=\frac{1}{n}\sum{x_i}\) we can compute the final \(x_n\), so \(x_n\) is no longer an independent variable). I’ll write more about degrees-of-freedom in another article.

Footnotes

Except for this footnote, we’re going to skip by the question of what a random variable actually is. Let’s go with this: a random variable is a function that returns values from a set - the sample space - at a rate based on probabilities associated with the members of the set. The expectation value of a random variable is an idealized average value based on invoking the random variable function infinitely many times.↩︎
Here we’re relying heavily on the linearity of expectations: \(E(A+B) = E(A) + E(B)\), and \(E(cA) = cE(A)\). The behavior of expectations is part of the general topic of the algebra of random variables. https://en.wikipedia.org/wiki/Algebra_of_random_variables recounts some of the basics, without going into the underlying mathematical foundation. The references in the Wikipedia article, especially Peter Whittle’s book “Probability via Expectation”, are good places to start deeper study. (Whittle, Peter (2000). Probability via Expectation (4th ed.). Springer. 978-0-387-98955-6.)↩︎
You might say this is the most fundamental question in statistics because it goes to how we can compute something we can’t observe directly based on imperfect observations of things we can observe. The story of how investigators gradually learned that aggregate measures like averages were capable of pulling good answers out of messy data is fascinating. If you’re interested, I recommend Stigler, Stephen (2016), The Seven Pillars of Statistical Wisdom. Harvard. 978-0-674-08891-7.↩︎
Note how goodness is a property of the process, not the individual estimates. When we refer to an “unbiased estimator”, we’re saying that a process of estimation produces statistically unbiased results over time, not that any particular estimate is good. This practice of focusing on the behaviors of measurement processes repeated over time is central to the frequentist approach to statistics.↩︎
Being unbiased is only one characteristic an estimator might display, and it’s not necessarily the most valuable. Other characteristics of estimators we might consider include consistency, accuracy and efficiency (see https://en.wikipedia.org/wiki/Estimator, and especially the cited references, to learn more).↩︎
See, for example, https://en.wikipedia.org/wiki/Standard_error.↩︎