M. Drew LaMar
September 16, 2022
“I believe that we do not know anything for certain, but everything probably.”
- Christiaan Huygens
The main assumptions of all statistical techniques is that your data come from a random sample.
Definition: In a
random sample , each member of a population has an equal and independent chance of being selected.
Random sampling
Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height
and weight
). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).
Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height
and weight
). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).
Unbiased sample (n=10)
Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height
and weight
). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).
Pseudoreplicated sample (n=10): Lack of independence
Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height
and weight
). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).
Biased sample (increased chance of selection for larger x values)
Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height
and weight
). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).
TL;DR #1: Pseudoreplication (lack of independence) affects precision
Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height
and weight
). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).
TL;DR #2: Bias (lack of equality) affects accuracy
Definition: The
sampling distribution is the population distribution of all values for an estimate that we might obtain when we sample a population.
Definition: The
standard error of an estimate is the standard deviation of the estimate’s sampling distribution.
Definition: The
standard error of the mean is given by
\[ \sigma_{\overline{Y}} = \frac{\sigma}{\sqrt{n}} \] with theapproximate standard error of the mean given by \[ \mathrm{SE}_{\overline{Y}} = \frac{s}{\sqrt{n}} \]
Definition: A
confidence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter.
Definition: A
95% confidence interval provides a most-plausible range for a parameter. Values lying within the interval are most plausible, whereas those outside are less plausible, based on the data.
Read and inspect the data.
locustData <- read.csv(here::here("Datasets/chapter02/chap02f1_2locustSerotonin.csv"))
head(locustData)
serotoninLevel treatmentTime
1 5.3 0
2 4.6 0
3 4.5 0
4 4.3 0
5 4.2 0
6 3.6 0
str(locustData)
'data.frame': 30 obs. of 2 variables:
$ serotoninLevel: num 5.3 4.6 4.5 4.3 4.2 3.6 3.7 3.3 12.1 18 ...
$ treatmentTime : int 0 0 0 0 0 0 0 0 0 0 ...
First, calculate the statistics by group needed for the error bars: the mean and standard error. Here, tapply
is used to obtain each quantity by treatment group.
meanSerotonin <- tapply(locustData$serotoninLevel,
locustData$treatmentTime,
mean)
sdSerotonin <- tapply(locustData$serotoninLevel,
locustData$treatmentTime,
sd)
nSerotonin <- tapply(locustData$serotoninLevel,
locustData$treatmentTime,
length)
seSerotonin <- sdSerotonin / sqrt(nSerotonin)
Draw the strip chart and then add the error bars.
\[ \bar{Y} \pm SE_{\bar{Y}} \]
offsetAmount <- 0.2
stripchart(serotoninLevel ~ treatmentTime,
data = locustData,
method = "jitter",
vertical = TRUE)
segments(1:3 + offsetAmount,
meanSerotonin - seSerotonin,
1:3 + offsetAmount,
meanSerotonin + seSerotonin)
points(meanSerotonin ~ c(c(1,2,3) + offsetAmount),
pch = 16,
cex = 1.2)
Draw the strip chart and then add the error bars.
\[ \bar{Y} \pm SE_{\bar{Y}} \]
Different error bars!!! \[ \bar{Y} \pm sd \\ \bar{Y} \pm SE_{\bar{Y}} \\ \bar{Y} \pm 2\times SE_{\bar{Y}} \]