Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, presenting, and organizing data.
Tools and methods for making sense of data and drawing conclusions or making decisions based on that data.
Often referred to as decision making science or the language of science
Descriptive: summarising and visualising data
Inferential: drawing conclusions / making inference from data
So, what we can do, is take a sample from our population:
Calculate a statistic from that sample:
And use inferential statistics to draw conclusions about our population parameter!
How can we be sure that our sample is a reasonable and reliable representation of our population?
Inherent uncertainty introduced by process of sampling
Question: How many bites do whales have on them, on average?
Population: whales
Parameter: average number of bites
Parameter of interest is unknowable! Impossible to count all the individuals in population
So, we take a random sample of 100 whales
Count the number of bites on each whale and calculate the sample mean
Take another 250 samples of 100 whales and for all the sample we calculate the sample mean.
So by taking many samples, calculating our statistics and then investigating its sampling distribution we can quantify the uncertainty introduced by the sampling process.
Sampling introduces uncertainty in our statistics
We want to associate uncertainty (“confidence”) intervals with all our sample statistics
We need the knowledge of the full sampling distribution and this requires us to take many, many samples
This is a problem!
What to do?
Sample means are normally distributed, but with what mean and variance?
With some math, the CLT theorem also tells us that:
\[\bar{X} \sim N(\mu, \frac{\sigma^2}{n})\] where \(\mu\) is our population mean and \(\sigma^2\) is our population variance!
The standard deviation of our sample mean (\(\frac{\sigma^2}{n}\)) is a super important statistic
Its formula tells us that as sample size (n) increases the SE decreases!
and therefore, our distribution becomes narrower and “closer” to the population mean
We also use the SE to calculate the range within which a particurlar percentage of sample means will occur!
So the CLT tells us that the mean of our sample means is the same as the population mean from which we have sampled with large sample sizes
and this distribution is normal, so majority of sample means will fall within two or three standard errors of \(\mu\)
The sample mean is therefore likely to be “close” to \(\mu\) and simply put, it is a good statistic to use to estimate the population mean.
The sample mean is an unbiased estimator of the population mean: \(\bar{x} \rightarrow \mu\)
and so is our sample variance: \(s^2 \rightarrow \sigma^2\)
Original sample had a mean of 2.43 and variance of 6.32 based on a 100 whales.
Using just this sample and the CLT, the distribution of the average number of bites is:
\[ N \sim (\mu, \frac{\sigma^2}{n}) \Longleftrightarrow N \sim(\bar{x}, \frac{s^2}{n}) \Longleftrightarrow N \sim (2.43, \frac{6.32}{100})\]
Let’s plot this against the sampling distribution we got from from our 250 repeated samples.
Construct a range around our population mean within which 95% of the sample means would be expected to occur
Use this range to decide whether the mean of a new sample is significantly different (or not) to the population mean
Start with a null hypothesis: \(H_0: \mu_0 = 3\)
We know that \(\bar{X} \sim N(\mu,\frac{\sigma^2}{n})\)
Assuming \(H_0\) is true then we would expect \(\bar{X} \sim N(3,\frac{\sigma^2}{n})\)
\(\bar{X}-3\) measures “sampling error”
If \(\bar{X}-3\) is “big” then \(H_0\) is unlikely to hold \(\Rightarrow\) evidence against \(H_0\)
How big is big enough?
We got \(\bar{x} = 2.43\)
Assuming \(H_0\) is true, our sampling error (the difference between sample mean and hypothesized mean) is: \[ \bar{x} - \mu_0 = 2.43 - 3 = -0.57\]
What is the probability of getting 2.43 or more extreme if our null hypothesis were true? I.e. from the distribution \(N(3,\frac{s^2}{n})\)
So an even bigger sampling error - in this case, the probability of getting 2.43 or less than under the null hypothesis?
Can calculate this and we get 0.011 - tiny probability!
Tells us that getting -0.57 or more extreme is likely not just due to variation introduced by sampling
This is our p-value, it is a statement about sampling variability and not the truth about \(H_0\)!
Provides evidence against \(H_0\)
Let’s generalise things a bit more
Constructing rejection regions
Use the properties of normal distribution to construct a rejection region:
Purely just by chance (constructed region so that 5% of the time this would happen)
Or population mean does not have the hypothesized value
Using similar logic, we can construct confidence intervals for our sample mean
NB to note that this interval either contains the true population or does not
In other words, the probability that it contains the population mean is either 1 or 0
Not very intuitive!
We construct a 95% confidence interval
Tells us that 95% of such confidence intervals will contain the population mean
So if we took many samples and calculated the confidence interval for each, 95% of those intervals would contain \(\mu\)
So there is basis for subjective optimisim or confidence that this one will be one of those! \[\bar{x} \pm 1.96\frac{\sigma}{\sqrt{n}}\]
Seen how to generate hypothesis test results and confidence intervals for the population mean \(\mu\)
Based, via the central limit theorem, on the normal distribution
Other quantities of interests have other distributions
Hence other types of statistical tests \(\Rightarrow\) t-tests, F-tests, etc.
Introduction to Statistical Modeling