Introduction to Statistical Modeling

Part 1

What is statistics?

  • Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, presenting, and organizing data.

  • Tools and methods for making sense of data and drawing conclusions or making decisions based on that data.

  • Often referred to as decision making science or the language of science

Two types of statistics:

Descriptive: summarising and visualising data

Inferential: drawing conclusions / making inference from data

Why do we need inferential statistics?

  • Interest in studying populations
  • Estimate some parameter of our population
    • Height of people aged between 21 and 35
    • Abundance of animal species
  • Predictions about population
    • Election outcome
    • Species’ extinction risk
  • Test hypotheses about our population
    • Is medicine A more effective than medicine B?

  • If we could, we would observe the entire population and simply draw conclusions directly
  • Virtually impossible!
  • Could you count all the individuals of a wildlife population?
  • Survey to such an extent that you need to 100% certain of capturing all individuals

So, what we can do, is take a sample from our population:

Calculate a statistic from that sample:

And use inferential statistics to draw conclusions about our population parameter!

Is our sample representative?

  • How can we be sure that our sample is a reasonable and reliable representation of our population?

  • Inherent uncertainty introduced by process of sampling

A working example

  • Question: How many bites do whales have on them, on average?

  • Population: whales

  • Parameter: average number of bites

  • Parameter of interest is unknowable! Impossible to count all the individuals in population

  • So, we take a random sample of 100 whales

  • Count the number of bites on each whale and calculate the sample mean

Results

Is this sample representative?

  • Average number of bites 2.43 and variance is 6.32
  • What happens if we take another sample and our new sample mean is 2.81?
    • a difference of about 15% from original sample mean
  • How confident can we then be in our sample statistics?
  • If we were to take another sample, our mean would differ again!
  • What do to about this uncertainty in our sample statistics? How can we account for this variability in our statistic from sample to sample?

Repeated sampling

Take another 250 samples of 100 whales and for all the sample we calculate the sample mean.

What can we do with this information?

  • The mean of all the sample means is 2.35
  • The standard deviation of all the sample means is 0.21
  • Most importantly, we can look at the full sampling distribution of sample means

The sampling distribution

Confidence intervals

Confidence intervals

So by taking many samples, calculating our statistics and then investigating its sampling distribution we can quantify the uncertainty introduced by the sampling process.

Summary

  • Sampling introduces uncertainty in our statistics

  • We want to associate uncertainty (“confidence”) intervals with all our sample statistics

  • We need the knowledge of the full sampling distribution and this requires us to take many, many samples

  • This is a problem!

  • What to do?

Part 2

The Amazing Central Limit Theorem

  • We can get the full sampling distribution from only a single sample!
  • Theorem states that as the sample size (n) increases, the sampling distribution of the sample mean starts to approximate the normal distribution (the well known bell-curve distribution)
  • The approximation is invariably good for sample sizes \(\geq\) 30
  • AND this holds regardless of the distribution of the population.
  • Crazy right?

Using our example

More cool things

Sample means are normally distributed, but with what mean and variance?

  • With some math, the CLT theorem also tells us that:

  • \[\bar{X} \sim N(\mu, \frac{\sigma^2}{n})\] where \(\mu\) is our population mean and \(\sigma^2\) is our population variance!

  • Experimental proof

The standard error of the mean

  • The standard deviation of our sample mean (\(\frac{\sigma^2}{n}\)) is a super important statistic

  • Its formula tells us that as sample size (n) increases the SE decreases!

  • and therefore, our distribution becomes narrower and “closer” to the population mean

  • We also use the SE to calculate the range within which a particurlar percentage of sample means will occur!

More cool things

  • So the CLT tells us that the mean of our sample means is the same as the population mean from which we have sampled with large sample sizes

  • and this distribution is normal, so majority of sample means will fall within two or three standard errors of \(\mu\)

  • The sample mean is therefore likely to be “close” to \(\mu\) and simply put, it is a good statistic to use to estimate the population mean.

  • The sample mean is an unbiased estimator of the population mean: \(\bar{x} \rightarrow \mu\)

  • and so is our sample variance: \(s^2 \rightarrow \sigma^2\)

Back to our example

Original sample had a mean of 2.43 and variance of 6.32 based on a 100 whales.

  • Using just this sample and the CLT, the distribution of the average number of bites is:

  • \[ N \sim (\mu, \frac{\sigma^2}{n}) \Longleftrightarrow N \sim(\bar{x}, \frac{s^2}{n}) \Longleftrightarrow N \sim (2.43, \frac{6.32}{100})\]

Back to our example

Let’s plot this against the sampling distribution we got from from our 250 repeated samples.

  • Construct a range around our population mean within which 95% of the sample means would be expected to occur

  • Use this range to decide whether the mean of a new sample is significantly different (or not) to the population mean

Summary

  • The full sampling distribution is all we need to do statistical inference
  • The Central Limit Theorem provides this distribution without needing to take repeated samples
  • We can now move on to inference - hypothesis tests, parameter estimation, confidence intervals

Part 3

Hypothesis tests

  • Start with a null hypothesis: \(H_0: \mu_0 = 3\)

  • We know that \(\bar{X} \sim N(\mu,\frac{\sigma^2}{n})\)

  • Assuming \(H_0\) is true then we would expect \(\bar{X} \sim N(3,\frac{\sigma^2}{n})\)

  • \(\bar{X}-3\) measures “sampling error”

  • If \(\bar{X}-3\) is “big” then \(H_0\) is unlikely to hold \(\Rightarrow\) evidence against \(H_0\)

  • How big is big enough?

Using the sample mean

  • We got \(\bar{x} = 2.43\)

  • Assuming \(H_0\) is true, our sampling error (the difference between sample mean and hypothesized mean) is: \[ \bar{x} - \mu_0 = 2.43 - 3 = -0.57\]

  • What is the probability of getting 2.43 or more extreme if our null hypothesis were true? I.e. from the distribution \(N(3,\frac{s^2}{n})\)

  • So an even bigger sampling error - in this case, the probability of getting 2.43 or less than under the null hypothesis?

  • Can calculate this and we get 0.011 - tiny probability!

  • Tells us that getting -0.57 or more extreme is likely not just due to variation introduced by sampling

  • This is our p-value, it is a statement about sampling variability and not the truth about \(H_0\)!

  • Provides evidence against \(H_0\)

Decision rules

  • Let’s generalise things a bit more

  • Constructing rejection regions

Use the properties of normal distribution to construct a rejection region:

Conclusion

  • If sample mean falls within the rejection region, there are two possibilities:
  1. Purely just by chance (constructed region so that 5% of the time this would happen)

  2. Or population mean does not have the hypothesized value

  • We choose the second option, and use this result as the basis for rejecting the null hypothesis

Confidence intervals

  • Using similar logic, we can construct confidence intervals for our sample mean

  • NB to note that this interval either contains the true population or does not

  • In other words, the probability that it contains the population mean is either 1 or 0

  • Not very intuitive!

Confidence Intervals

  • We construct a 95% confidence interval

  • Tells us that 95% of such confidence intervals will contain the population mean

  • So if we took many samples and calculated the confidence interval for each, 95% of those intervals would contain \(\mu\)

  • So there is basis for subjective optimisim or confidence that this one will be one of those! \[\bar{x} \pm 1.96\frac{\sigma}{\sqrt{n}}\]

Summary

  • Seen how to generate hypothesis test results and confidence intervals for the population mean \(\mu\)

  • Based, via the central limit theorem, on the normal distribution

  • Other quantities of interests have other distributions

  • Hence other types of statistical tests \(\Rightarrow\) t-tests, F-tests, etc.