Estimating with Uncertainty

M. Drew LaMar
September 16, 2022

“I believe that we do not know anything for certain, but everything probably.”

- Christiaan Huygens

Precision vs Accuracy

Random sampling

The main assumptions of all statistical techniques is that your data come from a random sample.

Definition: In a random sample, each member of a population has an equal and independent chance of being selected.


Random sampling

  1. minimizes bias (equal) and
  2. makes it possible to measure the amount of (quantify precision) sampling error (independent)

Random sampling

Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height and weight). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).

plot of chunk unnamed-chunk-1

Unbiased sample

Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height and weight). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).

plot of chunk unnamed-chunk-2

Unbiased sample (n=10)

Pseudoreplicated sample

Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height and weight). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).

plot of chunk unnamed-chunk-3

Pseudoreplicated sample (n=10): Lack of independence

Biased sample

Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height and weight). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).

plot of chunk unnamed-chunk-4

Biased sample (increased chance of selection for larger x values)

100 samples of size 10

Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height and weight). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).

plot of chunk unnamed-chunk-5

TL;DR #1: Pseudoreplication (lack of independence) affects precision

100 samples of size 10

Suppose we have 1000 households with 5 members per household. We measure two variables from each person (e.g. height and weight). Presumably, those two variables will be similar for members of the same household (i.e. members from the same household are dependent samples).

plot of chunk unnamed-chunk-6

TL;DR #2: Bias (lack of equality) affects accuracy

Language: Sampling Distributions

Definition: The sampling distribution is the population distribution of all values for an estimate that we might obtain when we sample a population.

Definition: The standard error of an estimate is the standard deviation of the estimate’s sampling distribution.

Definition: The standard error of the mean is given by
\[ \sigma_{\overline{Y}} = \frac{\sigma}{\sqrt{n}} \] with the approximate standard error of the mean given by \[ \mathrm{SE}_{\overline{Y}} = \frac{s}{\sqrt{n}} \]

Sampling distributions tutorial

"Chalk" talk - Sampling distributions and 95% confidence intervals

Language: Confidence Intervals

Definition: A confidence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter.

Definition: A 95% confidence interval provides a most-plausible range for a parameter. Values lying within the interval are most plausible, whereas those outside are less plausible, based on the data.

Confidence intervals tutorial

Error bars

How to do these in R?

Read and inspect the data.

locustData <- read.csv(here::here("Datasets/chapter02/chap02f1_2locustSerotonin.csv"))
head(locustData)
  serotoninLevel treatmentTime
1            5.3             0
2            4.6             0
3            4.5             0
4            4.3             0
5            4.2             0
6            3.6             0
str(locustData)
'data.frame':   30 obs. of  2 variables:
 $ serotoninLevel: num  5.3 4.6 4.5 4.3 4.2 3.6 3.7 3.3 12.1 18 ...
 $ treatmentTime : int  0 0 0 0 0 0 0 0 0 0 ...

Error bars

First, calculate the statistics by group needed for the error bars: the mean and standard error. Here, tapply is used to obtain each quantity by treatment group.

meanSerotonin <- tapply(locustData$serotoninLevel, 
                        locustData$treatmentTime, 
                        mean)
sdSerotonin <- tapply(locustData$serotoninLevel, 
                      locustData$treatmentTime, 
                      sd)
nSerotonin <- tapply(locustData$serotoninLevel, 
                     locustData$treatmentTime, 
                     length)
seSerotonin <- sdSerotonin / sqrt(nSerotonin)

Error bars

Draw the strip chart and then add the error bars.

\[ \bar{Y} \pm SE_{\bar{Y}} \]

offsetAmount <- 0.2
stripchart(serotoninLevel ~ treatmentTime, 
           data = locustData, 
           method = "jitter", 
           vertical = TRUE)

segments(1:3 + offsetAmount, 
         meanSerotonin - seSerotonin, 
         1:3 + offsetAmount, 
         meanSerotonin + seSerotonin)

points(meanSerotonin ~ c(c(1,2,3) + offsetAmount), 
       pch = 16, 
       cex = 1.2)

Error bars

Draw the strip chart and then add the error bars.

\[ \bar{Y} \pm SE_{\bar{Y}} \]

plot of chunk unnamed-chunk-9

Error bars can mean different things!!!

plot of chunk unnamed-chunk-10

Different error bars!!! \[ \bar{Y} \pm sd \\ \bar{Y} \pm SE_{\bar{Y}} \\ \bar{Y} \pm 2\times SE_{\bar{Y}} \]