“I believe that we do not know anything for certain, but everything probably.”
- Christiaan Huygens
Precision vs Accuracy
Random sampling
The main assumptions of all statistical techniques is that your data come from a random sample.
Definition: In a random sample, each member of a population has an equal and independent chance of being selected.
Random sampling
minimizes bias (equal) and
makes it possible to measure the amount of (quantify precision) sampling error (independent)
Random sampling
Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height and weight). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).
Random sampling
Unbiased sample
Unbiased sample (n=10)
Pseudoreplicated sample
Pseudoreplicated sample (n=10): Lack of independence
Biased sample
Biased sample (increased chance of selection for larger x values)
100 samples of size 10
TL;DR #1: Pseudoreplication (lack of independence) affects precision
100 samples of size 10
TL;DR #2: Bias (lack of equality) affects accuracy
Language: Sampling Distributions
Definition: The sampling distribution is the population distribution of all values for an estimate that we might obtain when we sample a population.
Definition: The standard error of an estimate is the standard deviation of the estimate’s sampling distribution.
Language: Sampling Distributions
Definition: The standard error of the mean is given by \[
\sigma_{\overline{Y}} = \frac{\sigma}{\sqrt{n}}
\] with the approximate standard error of the mean given by \[
\mathrm{SE}_{\overline{Y}} = \frac{s}{\sqrt{n}}
\]
“Chalk” talk - Sampling distributions and 95% confidence intervals
Language: Confidence Intervals
Definition: A confidence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter.
Definition: A 95% confidence interval provides a most-plausible range for a parameter. Values lying within the interval are most plausible, whereas those outside are less plausible, based on the data.
First, calculate the statistics by group needed for the error bars: the mean and standard error. Here, summarize and group_by are used to obtain each quantity by treatment group.
(locustStats <-summarize(group_by(locustData, treatmentTime), mean =mean(serotoninLevel), sd =sd(serotoninLevel), n =n(), se = sd/sqrt(n)))
# A tibble: 3 × 5
treatmentTime mean sd n se
<int> <dbl> <dbl> <int> <dbl>
1 0 6.36 4.82 10 1.52
2 1 8.04 4.96 10 1.57
3 2 10.8 5.33 10 1.68