Variability

Eamonn Mallon
26/08/2020

Descriptive statistics

height

  • Measurements often cluster around intermediate values (central tendency)
  • Variability of a sample is perhaps the most important quantity in data analysis

Variability

Variability is central to data analysis

  • greater variability in data = greater uncertainity in the parameters calculated from this data
  • => the lower our ability to distinguish between competing hypotheses

Measures of variability

plot of chunk unnamed-chunk-1

  • The x axis is just the order they were measured in, that is not important
  • How can we quantify the variation (the scatter)?

The range

plot of chunk unnamed-chunk-2

  • The range is the difference between the lowest and highest values
  • Too dependent on outlying values
  • Ideally all data would contribute to our measure of variability

A slight detour: Residuals

plot of chunk unnamed-chunk-3

  • Calculate departures from the mean \( \bar{y}-y \)
  • Add all them up \( \sum_{i=1}^{n} (\bar{y}-y_i) \)
  • But that equals zero!!!!!
  • Need to get rid of the signs

Behold! The sum of squares

plot of chunk unnamed-chunk-4

  • Ignore the signs, just use the absolute values
    • This is exactly what newer techniques do, but the sums are hard
  • So earlier techniques just squared the residuals before adding them
  • \( \sum_{i=1}^{n} (\bar{y}-y_i)^2 \)
  • The single most important quantity in all of statistics the sum of squares
  • 102.5454545 units squared

But what happens if you add more data?

  • The sum of squares gets bigger (unless your added data point was exactly the mean)
  • Hopefully, everyone can see this isn't great
  • Easily solved, just divide by n (mean squared deviation)
  • But hang on, we need to detour to explain degrees of freedom

A slight detour: Degrees of freedom

  • I have five numbers (positive or negative) that have to add up to 20
  • What's the first number?
    • Could be anything
    • Lets say its 2
  • What's the second number?
    • Could be anything
    • Lets say its 7
  • What's the third number?
    • Could be anything
    • Lets say its 4

A slight detour: Degrees of freedom

  • What's the four number?
    • Could be anything
    • Lets say its 0
  • What's the fifth number number?
    • It's got to be 7
  • Degrees of freedom (d.f.)
    • n - 1
    • the sample size, n, minus the number of parameters, p, estimated from the data.

Variance

  • \( \sum_{i=1}^{n} (\bar{y}-y_i)^2 \) we had to calculate the mean, that is we calculated one parameter
  • so rather than divide by n we divide by n-1
  • Mean squared deviation becomes \( \frac{\sum_{i=1}^{n} (\bar{y}-y_i)^2}{n-1} \)
  • also known as the variance \( s^2 \)
  • an unbiased estimate of the variance, because we have taken account of the fact that one parameter was estimated from the data prior to computation
  • standard deviation is s, the square root of variance. We use this as s is in the same dimensions as the mean (e.g mean height = 150cm, variance = 36cm2 , standard deviation = 6cm)

Using Variance

Variance can be used in two main ways

  • for establishing measures of unreliability
  • for testing hypotheses (a future lecture)

A measure of unreliability

  • As variance increases so would unreliability
    • \( unreliability \propto s^2 \)
  • Unreliability should go down as the number of samples increases
    • \( unreliability \propto \frac{s^2}{n} \)
  • Unreliability should be in the same dimensions as our measurements
    • \( S.E._\bar{y} =\sqrt{\frac{s^2}{n}} \) Standard error of the mean (s.e.m)
  • An example: 50 \( \pm \) 5 (mean \( \pm \) s.e.m)

A slight detour: Differences between populations and samples

  • This made me scratch my head as an undergrad
  • The population is, for example, all students in the room, the sample is the five people I picked out.
  • the standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean,
  • whereas the standard deviation of the sample is the degree to which individuals within the sample differ from the sample mean