MT5762 Lecture 2

C. Donovan

Recall - Abundance of prized sturgeon

“Experts can't agree how many beluga sturgeon are left in the sea. At stake is the future of one of the world's most sought-after fish and its coveted 'black gold'.” (New Scientist, 20 Sept 2003).

CITES says 11.6 million in 2002; Wildlife Conservation Society says maybe less than 0.5 million.

  • Why would estimates from different sources vary so much?
  • How would you approach such a problem?
  • What if you had very limited resources?

General steps in a statistical investigation

  • Clearly state the questions to be answered. What are the objectives?
  • Collect or generate data. (How? How much?)
  • Screen, explore, manipulate and store data.
  • Calculate formal statistical summaries, carry out tests/fit models -> estimation, inference and prediction.
  • Explore sensitivity of previous analyses to assumptions, to individual observations.
  • Communicate findings, build into software, deliverables etc.

Sampling

Why do we sample?

  • because we must! Life would be easier without the uncertainty it introduces.

Preliminary terminology

  • Sampling unit an individual thing or person on which measurements can be made
  • population the overall collection of units about which we want to present answers for e.g. all adult males or all first-year students at St. Andrews University
  • Sampling protocol or design the procedure for selecting units from the population of interest
  • Sample subset of the population for which measurements on units are made
  • Census when the entire population is sampled
  • Variable a characteristic defined for each unit (e.g. age, sex, weight, blood type) the realisation of which is typically denoted by lower case Roman letters, e.g. \( y \), \( x \).

Preliminary terminology

  • Parameter a numerical summary for the entire population of interest; e.g. the proportion of the UK voting population favouring a particular political party - it is customary to use greek letters for population parameters e.g. \( \mu \)
  • Estimate/Statistic a numerical summary of a variable for the sample; e.g. the proportion of people surveyed favouring a political party - notationally usually based about lower case Roman letters e.g. \( \bar{x} \) for a sample mean. Note these are often estimates of a population parameter: \( \bar{x} \) estimates \( \mu \).
  • Bias (in an estimate) systematic error in one direction, positive or negative. Many causes are possible - poor data collection is one.
  • Imprecision (in an estimate) magnitude of the chance or random error, not systematic

Preliminary notation

We will progressively encounter notation for our ideas. This allows precise and concise mathematical representations. Such as Sigma notation - simply a compact sum (the \( \Sigma \) is just a big “S” for Sum)

\[ \sum_{i=1}^n x_i = x_1 + x_2 + ... + x_n \]

So we just increment the \( i \) summing along the way - here on \( x \) from 1 to n. Or say this:

\[ \sum_{i=1}^4 i = 1 + 2 + 3 + 4 = 10 \]

Or the sample mean mentioned above

\[ \bar{x} = n^{-1}\sum_{i=1}^n x_i \]

Preliminary notation

Or say this:

\[ \sum_{i=1}^4 i = 1 + 2 + 3 + 4 = 10 \]

Or the sample mean mentioned above

\[ \bar{x} = n^{-1}\sum_{i=1}^n x_i \]

Precision versus accuracy

[drawing ensues]

Example - in the papers today

Vapers rise 'to more than three million' in Britain (BBC, 14/09/2018)

  • A 10% rise in e-cigarette use between 2017 and 2018 - from 2.9m to 3.2m
  • Non-smokers picking this up, smokers are converting.

BBC website 14/09/2018

Example - in the papers today

  • Where are these numbers coming from and what do they mean?
  • Clearly it isn't a comprehensive census
  • Sampling is involved

Example - in the papers today

Vapers rise 'to more than three million' in Britain (BBC, 14/09/2018)

https://www.bbc.co.uk/news/health-45513762

  • What is the population?
  • What is the sample?
  • What is the variable?
  • What is the parameter?
  • What is the estimate?

Common unwise data collection strategies

Basically data collection without a plan:

  • Anecdotal evidence
  • Self-selected or voluntary response samples

Pundits, twitter, online surveys, (& Fox News :))

For a particular question/estimate, we should design data collection to best achieve it. Principally we want:

  • High precision/low uncertainty
  • High accuracy/low bias

(as per our bulls-eye earlier)

Good sampling practice

All good samples have two features:

  • planned randomness;
  • the probability of any given sample being selected can be calculated.

Sampling is a field in itself, here we consider only 3 basic strategies:

  1. Simple Random Samples
  2. Systematic Random Samples
  3. Stratified Random Samples

Simple Random Sample (SRS)

In general, given \( N \) units in the population and \( n \) units in the sample, a SRS has the property that each sample of size \( n \) is chosen with the same probability.

Simple Random Sample (SRS)

This is really easy - simple even. In R say - sampling things with equal probability:

# I can generate random integers easy enough

  sample(1:10, 5, replace = F)
[1]  6  4  3  8 10
# So I can sample rows of a dataset or subject identifiers similarly

  IDs <- c("subect1", "subject2", "subject3", "subject4")

  sample(IDs, 2, replace = F)
[1] "subject2" "subject3"

A note on random numbers

Computers are not random. However, they can be effectively so.

  • We use pseudo-Random Number Generators (RNGs). There are many types, but all effectively unpredictable without knowing a starting point
# unpredictable - 4 numbers from a uniform (0,1) distribution
  runif(4)
[1] 0.9706928 0.4710283 0.4226242 0.2470565
  runif(4)
[1] 0.4785024 0.7913874 0.4597954 0.5372739

A note on random numbers

But - with a seed, they are predictable.

# know the "starting point" aka a seed, we can reproduce
  set.seed(2343)
  runif(4)
[1] 0.20467634 0.09047926 0.61101041 0.17877428
  set.seed(2343)
  runif(4)
[1] 0.20467634 0.09047926 0.61101041 0.17877428

Useful for confirming someones calculations which have a stochastic component.

Systematic Samples

Suppose there are \( N \)=1000 individuals and want to take a sample of size \( n \)=200. A systematic sample can be taken as follows:

  • Calculate \( k=N/n \); here this is 1000/200 = 5.
  • Randomly pick a number between 1 and \( k \) and call it \( x \); e.g., 3.
  • Sample the \( x \) th individual, then the \( x+k \) th, the \( x+2k \) th; e.g., 3, 8, 13, 18.

Systematic Samples

For example:

# get a random start - we'll aim to get about 6 from 30 subjects
# i.e. we're going to step in 5s
  x <- sample(1:5, 1)

# take regular steps
  seq(x, 30, by = 5)
[1]  3  8 13 18 23 28

Systematic Samples

Why bother with this method over SRS?

  • Often easier.
  • Often cheaper.
  • Often roughly same quality as SRS.
  • Will do better than a SRS if there is a trend in the values.

Example

  • We want to take a sample of customers visiting a bank.
  • Much easier practically to pick every 5th person, say, than refer to a SRS of bank customers.
  • In addition, SRS could pick a lot in the morning, which could be a particular subset of the customers, due to a “gradient” WRT time.

Stratified Random Samples

Divide the population into different categories or “strata”, then take different SRSs from each stratum.

For example, divide the population of the university into 4 strata: undergraduate students (5408), postgraduate students (1065), academic and research staff (649), and support (1109).

Why do this?

  • Can get greater precision when estimating a parameter than for a SRS with the same sample size \( n \).
  • Often more convenient.

Very common in environmental studies e.g. fisheries stock assessment - different areas are sampled with different intensity.

Sampling biases

These usually arise from not sampling the population we thought, or our means of measuring alters the result.

  • Just bad sampling design Part of the population is not represented as intended e.g. not proportionately
  • Questionnaire/questioner biases questions are leading/misleading, interviewer intimidates/influences subjects
  • Non-response/response biases people lie, forget, get bored, or refuse to participate
  • Self-selection biases unselected people choose to participate - what population is this?
  • Survivorship biases Only selecting survivors/winners, but inferring to a larger population

Recap and look-forwards

We've covered:

  • Why we need to sample and broadly implications of this
  • Some jargon and notation
  • Three types of basic sampling schemes
  • A range of common biases that arise from the data collection

Next:

  • Experiments versus observational studies