Alban Guillaumet, Troy University
“Absolute certainty is a privilege of uneducated minds and fanatics. It is, for scientific folk, an unattainable ideal.”
- Cassius J. Keyser
Estimation is the process of inferring a population parameter from sample data. The crucial question is: “In the face of chance, how much can we trust an estimate?” To answer this question, we need to learn about how the sampling process might affect the estimates we get.
Definition: The
sampling distribution is the probability distribution of all values for an estimate that we might obtain when we sample a population.
–> Probability of obtaing a gene of a given length when sampling a single gene at random
In real life, we do NOT usually know the probability distribution of the population. Let's take advantage of this opportunity to illustrate the process of sampling.
How?
Probability distribution of gene lengths in the whole population of n = 20290 genes (Parameters: Mean = 2622.0, SD = 2036.9)
Frequency distribution of gene lengths in a unique random sample of n = 100 genes (Estimates: Mean = 2411.8, SD = 1463.5)
What would happen if you took a second random sample of 100 genes? And a third?
Each time you would generate a new estimate of the same parameters, slightly different from their actual value
If we were able to repeat this sampling an infinite number of times, we could create the probability distribution of our estimate, called the sampling distribution
Definition: The
sampling distribution is the probability distribution of all values for an estimate that we might obtain when we sample a population.
The sampling distribution represents the “population” of values for an estimate.
Taking a random sample of n observations from a population and calculating \( \bar{Y}\ \) is equivalent to randomly sampling a single value of \( \bar{Y}\ \) from its sampling distribution
Although the population mean \( \mu\ \) is a constant (2622.0), its estimate \( \bar{Y}\ \) is a variable! We don't usually see the sampling distribution of \( \bar{Y}\ \) because we have a single sample, and thus a single \( \bar{Y}\ \).
Notice also that the sampling distribution for \( \bar{Y}\ \) is centered exactly on the true mean \( \mu\ \), i.e. \( \bar{Y}\ \) is an unbiased estimate of \( \mu\ \).
What happens when the sample size augments?
Increasing sample size reduces the spread of the sampling distribution of an estimate, increasing precision.
How can the sampling distribution be used to measure the uncertainty of an estimate?
Definition:The
standard error is the usual way to indicate the uncertainty of an estimate. The standard error of an estimate is the standard deviation of the estimate’s sampling distribution.
The smaller the standard error, the less uncertainty there is about the target parameter in the population.
Definition: The
standard error of the mean is estimated from the data as the sample standard deviation (s) divided by the square root of the sample size (n).
Next, we'll see how the standard error estimated from a sample can be used to calculate a range likely to contain the value of the true population parameter (the confidence interval)
Definition: A
confidence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter.
The
95% confidence interval provides a most plausible range for a parameter.
For a population having a normal distribution (bell curve), a rough approximation to the 95% confidence interval for a mean is calculated as the sample mean plus and minus 2 standard error of the mean.
In general, the 2SE rule of thumb is not exact; rigorously, we must multiply SE by a value slightly different than 2 corresponding to the “5% critical value” of a Student’s t-distribution with n-1 degrees of freedom: e.g, 2.78 (n = 5, df = 4), 2.26 (n=10), 2.00 (n=61), 1.96 (infinity).
Definition: For \( Y \sim N(\mu,\sigma^2) \), the statistic defined by
\[ t = \frac{\bar{Y}-\mu}{\mathrm{SE}_{\bar{Y}}} \] has aStudent’s \( t \)-distribution with \( n-1 \) degrees of freedom.
Therefore, we have:
Standard deviation, standard error and confidence interval for the mean are often illustrated graphically with error bars.
Error bars are lines on a graph that extent outward from the sample estimate to illustrate the precision of estimates, reflecting uncertainty about the value of the parameter being estimated
Use error bar only to illustrate the precision of estimates, not variability in the data