Lecture 2 Central tendency and variability

Lecture outline

  • See sidebar

Descriptive statistics

  • Measurements often cluster around intermediate values (central tendency)
  • Variability of a sample is perhaps the most important quantity in data analysis

Central tendency

How should we quantify the central tendency

  • We will look at three different ways to look at the “average”, commonly used in biology
  • Lets start with the mode, the data values that occur the most frequently
  • The easiest way of calculating that is just looking at the data

Lets just look at the data

The mode

The arithmetic mean

  • \(\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_{i}\)
  • sum of a collection of numbers divided by the count of numbers in the collection
  • not a robust statistic, meaning that it is greatly influenced by outliers

The median

  • The middle value
  • 3,5,6,8,10 (median = 6)
  • 3,5,6,8,10,10 (median = 7), work out the average value of the two numbers either side of the middle

Variability

Variability is central to data analysis

  • greater variability in data = greater uncertainity in the parameters calculated from this data
  • => the lower our ability to distinguish between competing hypotheses

Measures of variability

  • The x axis is just the order they were measured in, that is not important
  • How can we quantify the variation (the scatter)?

The range

  • The range is the difference between the lowest and highest values
  • Too dependent on outlying values
  • Ideally all data would contribute to our measure of variability

A slight detour: Residuals

  • Calculate departures from the mean \(\bar{y}-y\)
  • Add all them up \(\sum_{i=1}^{n} (\bar{y}-y_i)\)
  • But that equals zero!!!!!
  • Need to get rid of the signs

The sum of squares

  • Ignore the signs, just use the absolute values
    • This is exactly what newer techniques do, but the sums are hard
  • So earlier techniques just squared the residuals before adding them
  • \(\sum_{i=1}^{n} (\bar{y}-y_i)^2\)
  • The single most important quantity in all of statistics the sum of squares
  • 102.5454545 units squared

But what happens if you add more data?

  • The sum of squares gets bigger (unless your added data point was exactly the mean)
  • Hopefully, everyone can see this isn’t great
  • Easily solved, just divide by n (mean squared deviation)
  • But hang on, we need to detour to explain degrees of freedom

Degrees of freedom

  • I have five numbers (positive or negative) that have to add up to 20
  • What’s the first number?
    • Could be anything
    • Lets say its 2
  • What’s the second number?
    • Could be anything
    • Lets say its 7
  • What’s the third number?
    • Could be anything
    • Lets say its 4
  • What’s the four number?
    • Could be anything
    • Lets say its 0
  • What’s the fifth number number?
    • It’s got to be 7
  • Degrees of freedom (d.f.)
    • n - 1
    • the sample size, n, minus the number of parameters, p, estimated from the data.

Variance

  • \(\sum_{i=1}^{n} (\bar{y}-y_i)^2\) we had to calculate the mean, that is we calculated one parameter
  • so rather than divide by n we divide by n-1
  • Mean squared deviation becomes \(\frac{\sum_{i=1}^{n} (\bar{y}-y_i)^2}{n-1}\)
  • also known as the variance \(s^2\)
  • an unbiased estimate of the variance, because we have taken account of the fact that one parameter was estimated from the data prior to computation
  • standard deviation is s, the square root of variance. We use this as s is in the same dimensions as the mean (e.g mean height = 150cm, variance = 36cm^2 , standard deviation = 6cm)

Using Variance

Variance can be used in two main ways

  • for establishing measures of unreliability
  • for testing hypotheses (a future lecture)

A measure of unreliability

  • As variance increases so would unreliability
    • \(unreliability \propto s^2\)
  • Unreliability should go down as the number of samples increases
    • \(unreliability \propto \frac{s^2}{n}\)
  • Unreliability should be in the same dimensions as our measurements
    • \(S.E._\bar{y} =\sqrt{\frac{s^2}{n}}\) Standard error of the mean (s.e.m)
  • An example: 50 \(\pm\) 5 (mean \(\pm\) s.e.m)

Differences between populations and samples

  • This made me scratch my head as an undergrad
  • The population is, for example, all students in the room, the sample is the five people I picked out.
  • the standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean,
  • whereas the standard deviation of the sample is the degree to which individuals within the sample differ from the sample mean

What have you learned today

  • Central tendency
    • Mode
    • Median
    • Mean
    • Robustness
  • Variability
    • Sum of squares (important for lots of tests)
    • What are degrees of freedom
    • Variance and its uses
    • Difference between S.D and S.E.M