Introduction to Statistical Modeling

Part 1

What is statistics?

Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, presenting, and organizing data.
Tools and methods for making sense of data and drawing conclusions or making decisions based on that data.
Often referred to as decision making science or the language of science

Two types of statistics:

Descriptive: summarising and visualising data

Inferential: drawing conclusions / making inference from data

Why do we need inferential statistics?

Interest in studying populations
Estimate some parameter of our population
- Height of people aged between 21 and 35
- Abundance of animal species
Predictions about population
- Election outcome
- Species’ extinction risk
Test hypotheses about our population
- Is medicine A more effective than medicine B?

If we could, we would observe the entire population and simply draw conclusions directly
Virtually impossible!
Could you count all the individuals of a wildlife population?
Survey to such an extent that you need to 100% certain of capturing all individuals

So, what we can do, is take a sample from our population:

Calculate a statistic from that sample:

And use inferential statistics to draw conclusions about our population parameter!

Is our sample representative?

How can we be sure that our sample is a reasonable and reliable representation of our population?
Inherent uncertainty introduced by process of sampling

A working example

Question: How many bites do whales have on them, on average?
Population: whales
Parameter: average number of bites
Parameter of interest is unknowable! Impossible to count all the individuals in population
So, we take a random sample of 100 whales
Count the number of bites on each whale and calculate the sample mean

Results

Is this sample representative?

Average number of bites 2.43 and variance is 6.32
What happens if we take another sample and our new sample mean is 2.81?
- a difference of about 15% from original sample mean
How confident can we then be in our sample statistics?
If we were to take another sample, our mean would differ again!
What do to about this uncertainty in our sample statistics? How can we account for this variability in our statistic from sample to sample?

Repeated sampling

Take another 250 samples of 100 whales and for all the sample we calculate the sample mean.

What can we do with this information?

The mean of all the sample means is 2.35
The standard deviation of all the sample means is 0.21
Most importantly, we can look at the full sampling distribution of sample means

The sampling distribution

Confidence intervals

So by taking many samples, calculating our statistics and then investigating its sampling distribution we can quantify the uncertainty introduced by the sampling process.

Summary

Sampling introduces uncertainty in our statistics
We want to associate uncertainty (“confidence”) intervals with all our sample statistics
We need the knowledge of the full sampling distribution and this requires us to take many, many samples
This is a problem!
What to do?

Part 2

The Amazing Central Limit Theorem

We can get the full sampling distribution from only a single sample!
Theorem states that as the sample size (n) increases, the sampling distribution of the sample mean starts to approximate the normal distribution (the well known bell-curve distribution)
The approximation is invariably good for sample sizes \(\geq\) 30
AND this holds regardless of the distribution of the population.
Crazy right?

Using our example

More cool things

Sample means are normally distributed, but with what mean and variance?

With some math, the CLT theorem also tells us that:
\[\bar{X} \sim N(\mu, \frac{\sigma^2}{n})\] where \(\mu\) is our population mean and \(\sigma^2\) is our population variance!
Experimental proof

The standard error of the mean

The standard deviation of our sample mean (\(\frac{\sigma^2}{n}\)) is a super important statistic
Its formula tells us that as sample size (n) increases the SE decreases!
and therefore, our distribution becomes narrower and “closer” to the population mean
We also use the SE to calculate the range within which a particurlar percentage of sample means will occur!

More cool things

So the CLT tells us that the mean of our sample means is the same as the population mean from which we have sampled with large sample sizes
and this distribution is normal, so majority of sample means will fall within two or three standard errors of \(\mu\)
The sample mean is therefore likely to be “close” to \(\mu\) and simply put, it is a good statistic to use to estimate the population mean.
The sample mean is an unbiased estimator of the population mean: \(\bar{x} \rightarrow \mu\)
and so is our sample variance: \(s^2 \rightarrow \sigma^2\)

Back to our example

Original sample had a mean of 2.43 and variance of 6.32 based on a 100 whales.

Using just this sample and the CLT, the distribution of the average number of bites is:
\[ N \sim (\mu, \frac{\sigma^2}{n}) \Longleftrightarrow N \sim(\bar{x}, \frac{s^2}{n}) \Longleftrightarrow N \sim (2.43, \frac{6.32}{100})\]

Back to our example

Let’s plot this against the sampling distribution we got from from our 250 repeated samples.

Construct a range around our population mean within which 95% of the sample means would be expected to occur
Use this range to decide whether the mean of a new sample is significantly different (or not) to the population mean

Summary

The full sampling distribution is all we need to do statistical inference
The Central Limit Theorem provides this distribution without needing to take repeated samples
We can now move on to inference - hypothesis tests, parameter estimation, confidence intervals

Part 3

Hypothesis tests

Start with a null hypothesis: \(H_0: \mu_0 = 3\)
We know that \(\bar{X} \sim N(\mu,\frac{\sigma^2}{n})\)
Assuming \(H_0\) is true then we would expect \(\bar{X} \sim N(3,\frac{\sigma^2}{n})\)
\(\bar{X}-3\) measures “sampling error”
If \(\bar{X}-3\) is “big” then \(H_0\) is unlikely to hold \(\Rightarrow\) evidence against \(H_0\)
How big is big enough?

Using the sample mean

We got \(\bar{x} = 2.43\)
Assuming \(H_0\) is true, our sampling error (the difference between sample mean and hypothesized mean) is: \[ \bar{x} - \mu_0 = 2.43 - 3 = -0.57\]
What is the probability of getting 2.43 or more extreme if our null hypothesis were true? I.e. from the distribution \(N(3,\frac{s^2}{n})\)
So an even bigger sampling error - in this case, the probability of getting 2.43 or less than under the null hypothesis?
Can calculate this and we get 0.011 - tiny probability!
Tells us that getting -0.57 or more extreme is likely not just due to variation introduced by sampling
This is our p-value, it is a statement about sampling variability and not the truth about \(H_0\)!
Provides evidence against \(H_0\)

Decision rules

Let’s generalise things a bit more
Constructing rejection regions

Use the properties of normal distribution to construct a rejection region:

Conclusion

If sample mean falls within the rejection region, there are two possibilities:

Purely just by chance (constructed region so that 5% of the time this would happen)
Or population mean does not have the hypothesized value

We choose the second option, and use this result as the basis for rejecting the null hypothesis

Confidence intervals

Using similar logic, we can construct confidence intervals for our sample mean
NB to note that this interval either contains the true population or does not
In other words, the probability that it contains the population mean is either 1 or 0
Not very intuitive!

Confidence Intervals

We construct a 95% confidence interval
Tells us that 95% of such confidence intervals will contain the population mean
So if we took many samples and calculated the confidence interval for each, 95% of those intervals would contain \(\mu\)
So there is basis for subjective optimisim or confidence that this one will be one of those! \[\bar{x} \pm 1.96\frac{\sigma}{\sqrt{n}}\]

Summary

Seen how to generate hypothesis test results and confidence intervals for the population mean \(\mu\)
Based, via the central limit theorem, on the normal distribution
Other quantities of interests have other distributions
Hence other types of statistical tests \(\Rightarrow\) t-tests, F-tests, etc.