Variability

Eamonn Mallon
26/08/2020

height

Measurements often cluster around intermediate values (central tendency)
Variability of a sample is perhaps the most important quantity in data analysis

greater variability in data = greater uncertainity in the parameters calculated from this data
=> the lower our ability to distinguish between competing hypotheses

plot of chunk unnamed-chunk-1

plot of chunk unnamed-chunk-2

plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-4

Ignore the signs, just use the absolute values
- This is exactly what newer techniques do, but the sums are hard
So earlier techniques just squared the residuals before adding them
\( \sum_{i=1}^{n} (\bar{y}-y_i)^2 \)
The single most important quantity in all of statistics the sum of squares
102.5454545 units squared

The sum of squares gets bigger (unless your added data point was exactly the mean)
Hopefully, everyone can see this isn't great
Easily solved, just divide by n (mean squared deviation)
But hang on, we need to detour to explain degrees of freedom

What's the four number?
- Could be anything
- Lets say its 0
What's the fifth number number?
- It's got to be 7
Degrees of freedom (d.f.)
- n - 1
- the sample size, n, minus the number of parameters, p, estimated from the data.

\( \sum_{i=1}^{n} (\bar{y}-y_i)^2 \) we had to calculate the mean, that is we calculated one parameter
so rather than divide by n we divide by n-1
Mean squared deviation becomes \( \frac{\sum_{i=1}^{n} (\bar{y}-y_i)^2}{n-1} \)
also known as the variance \( s^2 \)
an unbiased estimate of the variance, because we have taken account of the fact that one parameter was estimated from the data prior to computation
standard deviation is s, the square root of variance. We use this as s is in the same dimensions as the mean (e.g mean height = 150cm, variance = 36cm² , standard deviation = 6cm)

Variance can be used in two main ways

As variance increases so would unreliability
- \( unreliability \propto s^2 \)
Unreliability should go down as the number of samples increases
- \( unreliability \propto \frac{s^2}{n} \)
Unreliability should be in the same dimensions as our measurements
- \( S.E._\bar{y} =\sqrt{\frac{s^2}{n}} \) Standard error of the mean (s.e.m)
An example: 50 \( \pm \) 5 (mean \( \pm \) s.e.m)

This made me scratch my head as an undergrad
The population is, for example, all students in the room, the sample is the five people I picked out.
the standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean,
whereas the standard deviation of the sample is the degree to which individuals within the sample differ from the sample mean