Lecture 2 Central tendency and variability

Lecture outline

Measurements often cluster around intermediate values (central tendency)
Variability of a sample is perhaps the most important quantity in data analysis

We will look at three different ways to look at the “average”, commonly used in biology
Lets start with the mode, the data values that occur the most frequently
The easiest way of calculating that is just looking at the data

\(\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_{i}\)
sum of a collection of numbers divided by the count of numbers in the collection
not a robust statistic, meaning that it is greatly influenced by outliers

The middle value
3,5,6,8,10 (median = 6)
3,5,6,8,10,10 (median = 7), work out the average value of the two numbers either side of the middle

greater variability in data = greater uncertainity in the parameters calculated from this data
=> the lower our ability to distinguish between competing hypotheses

Ignore the signs, just use the absolute values
- This is exactly what newer techniques do, but the sums are hard
So earlier techniques just squared the residuals before adding them
\(\sum_{i=1}^{n} (\bar{y}-y_i)^2\)
The single most important quantity in all of statistics the sum of squares
102.5454545 units squared

The sum of squares gets bigger (unless your added data point was exactly the mean)
Hopefully, everyone can see this isn’t great
Easily solved, just divide by n (mean squared deviation)
But hang on, we need to detour to explain degrees of freedom

Degrees of freedom (d.f.)
- n - 1
- the sample size, n, minus the number of parameters, p, estimated from the data.

\(\sum_{i=1}^{n} (\bar{y}-y_i)^2\) we had to calculate the mean, that is we calculated one parameter
so rather than divide by n we divide by n-1
Mean squared deviation becomes \(\frac{\sum_{i=1}^{n} (\bar{y}-y_i)^2}{n-1}\)
also known as the variance \(s^2\)
an unbiased estimate of the variance, because we have taken account of the fact that one parameter was estimated from the data prior to computation
standard deviation is s, the square root of variance. We use this as s is in the same dimensions as the mean (e.g mean height = 150cm, variance = 36cm^2 , standard deviation = 6cm)

Variance can be used in two main ways

As variance increases so would unreliability
- \(unreliability \propto s^2\)
Unreliability should go down as the number of samples increases
- \(unreliability \propto \frac{s^2}{n}\)
Unreliability should be in the same dimensions as our measurements
- \(S.E._\bar{y} =\sqrt{\frac{s^2}{n}}\) Standard error of the mean (s.e.m)
An example: 50 \(\pm\) 5 (mean \(\pm\) s.e.m)

This made me scratch my head as an undergrad
The population is, for example, all students in the room, the sample is the five people I picked out.
the standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean,
whereas the standard deviation of the sample is the degree to which individuals within the sample differ from the sample mean

Central tendency
- Mode
- Median
- Mean
- Robustness
Variability
- Sum of squares (important for lots of tests)
- What are degrees of freedom
- Variance and its uses
- Difference between S.D and S.E.M