Estimating with uncertainty

Alban Guillaumet, Troy University

“Absolute certainty is a privilege of uneducated minds and fanatics. It is, for scientific folk, an unattainable ideal.”

- Cassius J. Keyser

Objectives

  • Sampling distribution of an estimate
  • Standard error of an estimate
  • Confidence interval

Sampling distribution of an estimate

Estimation is the process of inferring a population parameter from sample data. The crucial question is: “In the face of chance, how much can we trust an estimate?” To answer this question, we need to learn about how the sampling process might affect the estimates we get.

Definition: The sampling distribution is the probability distribution of all values for an estimate that we might obtain when we sample a population.

Sampling distribution of an estimate

Probability distribution of gene lengths in the human genome (n = 20290, whole population! Mean = 2622.0, SD = 2036.9)

–> Probability of obtaing a gene of a given length when sampling a single gene at random

alt text

Sampling distribution of an estimate

In real life, we do NOT usually know the probability distribution of the population. Let's take advantage of this opportunity to illustrate the process of sampling.

How?

  • All genes were listed on a file, one gene per line from line 1 to 20290
  • A computer program was used to generate 100 random integers between 1 and 20290 with no duplicates
  • Each random number was used to draw a gene number according to its line number

Sampling distribution

Probability distribution of gene lengths in the whole population of n = 20290 genes (Parameters: Mean = 2622.0, SD = 2036.9)

alt text

Frequency distribution of gene lengths in a unique random sample of n = 100 genes (Estimates: Mean = 2411.8, SD = 1463.5)

alt text

Sampling distribution of an estimate

  • What would happen if you took a second random sample of 100 genes? And a third?

  • Each time you would generate a new estimate of the same parameters, slightly different from their actual value

  • If we were able to repeat this sampling an infinite number of times, we could create the probability distribution of our estimate, called the sampling distribution

Sampling distribution of an estimate

Definition: The sampling distribution is the probability distribution of all values for an estimate that we might obtain when we sample a population.

Sampling distribution of an estimate

  • The sampling distribution represents the “population” of values for an estimate.

  • Taking a random sample of n observations from a population and calculating \( \bar{Y}\ \) is equivalent to randomly sampling a single value of \( \bar{Y}\ \) from its sampling distribution

Sampling distribution of an estimate

Sampling distribution of mean gene length when n = 100.

Although the population mean \( \mu\ \) is a constant (2622.0), its estimate \( \bar{Y}\ \) is a variable! We don't usually see the sampling distribution of \( \bar{Y}\ \) because we have a single sample, and thus a single \( \bar{Y}\ \).

alt text

Sampling distribution of an estimate

Sampling distribution of mean gene length when n = 100.

Notice also that the sampling distribution for \( \bar{Y}\ \) is centered exactly on the true mean \( \mu\ \), i.e. \( \bar{Y}\ \) is an unbiased estimate of \( \mu\ \).

alt text

Sampling distribution of an estimate

What happens when the sample size augments?

Sampling distribution of an estimate

Increasing sample size reduces the spread of the sampling distribution of an estimate, increasing precision.

alt text

Measuring the uncertainty of an estimate

How can the sampling distribution be used to measure the uncertainty of an estimate?

Definition:The standard error is the usual way to indicate the uncertainty of an estimate. The standard error of an estimate is the standard deviation of the estimate’s sampling distribution.

The smaller the standard error, the less uncertainty there is about the target parameter in the population.

Standard error of the mean

Definition: The standard error of the mean is estimated from the data as the sample standard deviation (s) divided by the square root of the sample size (n).

alt text

Online Tutorial - Estimating with Uncertainty

Standard error and confidence interval

Next, we'll see how the standard error estimated from a sample can be used to calculate a range likely to contain the value of the true population parameter (the confidence interval)

Confidence interval

Definition: A confidence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter.

The 95% confidence interval provides a most plausible range for a parameter.

Confidence interval

For a population having a normal distribution (bell curve), a rough approximation to the 95% confidence interval for a mean is calculated as the sample mean plus and minus 2 standard error of the mean.

alt text

Confidence interval

In general, the 2SE rule of thumb is not exact; rigorously, we must multiply SE by a value slightly different than 2 corresponding to the “5% critical value” of a Student’s t-distribution with n-1 degrees of freedom: e.g, 2.78 (n = 5, df = 4), 2.26 (n=10), 2.00 (n=61), 1.96 (infinity).

alt text

The Student's t-distribution

Definition: For \( Y \sim N(\mu,\sigma^2) \), the statistic defined by
\[ t = \frac{\bar{Y}-\mu}{\mathrm{SE}_{\bar{Y}}} \] has a Student’s \( t \)-distribution with \( n-1 \) degrees of freedom.

The Student's t-distribution

Therefore, we have:

alt text

Therefore:

alt text

Online Tutorials - Estimating with Uncertainty

Error bars

  • Standard deviation, standard error and confidence interval for the mean are often illustrated graphically with error bars.

  • Error bars are lines on a graph that extent outward from the sample estimate to illustrate the precision of estimates, reflecting uncertainty about the value of the parameter being estimated

  • Use error bar only to illustrate the precision of estimates, not variability in the data

Error bars

  • Strip chart of locust serotonin data with error bars extending one SE above, and one SE below, the mean

alt text

Error bars (n = 100)