C. Donovan
“Experts can't agree how many beluga sturgeon are left in the sea. At stake is the future of one of the world's most sought-after fish and its coveted 'black gold'.” (New Scientist, 20 Sept 2003).
CITES says 11.6 million in 2002; Wildlife Conservation Society says maybe less than 0.5 million.
Why do we sample?
We will progressively encounter notation for our ideas. This allows precise and concise mathematical representations. Such as Sigma notation - simply a compact sum (the \( \Sigma \) is just a big “S” for Sum)
\[ \sum_{i=1}^n x_i = x_1 + x_2 + ... + x_n \]
So we just increment the \( i \) summing along the way - here on \( x \) from 1 to n. Or say this:
\[ \sum_{i=1}^4 i = 1 + 2 + 3 + 4 = 10 \]
Or the sample mean mentioned above
\[ \bar{x} = n^{-1}\sum_{i=1}^n x_i \]
Or say this:
\[ \sum_{i=1}^4 i = 1 + 2 + 3 + 4 = 10 \]
Or the sample mean mentioned above
\[ \bar{x} = n^{-1}\sum_{i=1}^n x_i \]
[drawing ensues]
Vapers rise 'to more than three million' in Britain (BBC, 14/09/2018)
Vapers rise 'to more than three million' in Britain (BBC, 14/09/2018)
https://www.bbc.co.uk/news/health-45513762
Basically data collection without a plan:
Pundits, twitter, online surveys, (& Fox News :))
For a particular question/estimate, we should design data collection to best achieve it. Principally we want:
(as per our bulls-eye earlier)
All good samples have two features:
Sampling is a field in itself, here we consider only 3 basic strategies:
In general, given \( N \) units in the population and \( n \) units in the sample, a SRS has the property that each sample of size \( n \) is chosen with the same probability.
This is really easy - simple even. In R say - sampling things with equal probability:
# I can generate random integers easy enough
sample(1:10, 5, replace = F)
[1] 6 4 3 8 10
# So I can sample rows of a dataset or subject identifiers similarly
IDs <- c("subect1", "subject2", "subject3", "subject4")
sample(IDs, 2, replace = F)
[1] "subject2" "subject3"
Computers are not random. However, they can be effectively so.
# unpredictable - 4 numbers from a uniform (0,1) distribution
runif(4)
[1] 0.9706928 0.4710283 0.4226242 0.2470565
runif(4)
[1] 0.4785024 0.7913874 0.4597954 0.5372739
But - with a seed, they are predictable.
# know the "starting point" aka a seed, we can reproduce
set.seed(2343)
runif(4)
[1] 0.20467634 0.09047926 0.61101041 0.17877428
set.seed(2343)
runif(4)
[1] 0.20467634 0.09047926 0.61101041 0.17877428
Useful for confirming someones calculations which have a stochastic component.
Suppose there are \( N \)=1000 individuals and want to take a sample of size \( n \)=200. A systematic sample can be taken as follows:
For example:
# get a random start - we'll aim to get about 6 from 30 subjects
# i.e. we're going to step in 5s
x <- sample(1:5, 1)
# take regular steps
seq(x, 30, by = 5)
[1] 3 8 13 18 23 28
Why bother with this method over SRS?
Divide the population into different categories or “strata”, then take different SRSs from each stratum.
For example, divide the population of the university into 4 strata: undergraduate students (5408), postgraduate students (1065), academic and research staff (649), and support (1109).
Why do this?
Very common in environmental studies e.g. fisheries stock assessment - different areas are sampled with different intensity.
These usually arise from not sampling the population we thought, or our means of measuring alters the result.
We've covered:
Next: