Measurements often cluster around intermediate values (central tendency)
Variability of a sample is perhaps the most important quantity in data analysis
Central tendency
How should we quantify the central tendency
We will look at three different ways to look at the “average”, commonly used in biology
Lets start with the mode, the data values that occur the most frequently
The easiest way of calculating that is just looking at the data
Lets just look at the data
The mode
The arithmetic mean
\(\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_{i}\)
sum of a collection of numbers divided by the count of numbers in the collection
not a robust statistic, meaning that it is greatly influenced by outliers
The median
The middle value
3,5,6,8,10 (median = 6)
3,5,6,8,10,10 (median = 7), work out the average value of the two numbers either side of the middle
Variability
Variability is central to data analysis
greater variability in data = greater uncertainity in the parameters calculated from this data
=> the lower our ability to distinguish between competing hypotheses
Measures of variability
The x axis is just the order they were measured in, that is not important
How can we quantify the variation (the scatter)?
The range
The range is the difference between the lowest and highest values
Too dependent on outlying values
Ideally all data would contribute to our measure of variability
A slight detour: Residuals
Calculate departures from the mean \(\bar{y}-y\)
Add all them up \(\sum_{i=1}^{n} (\bar{y}-y_i)\)
But that equals zero!!!!!
Need to get rid of the signs
The sum of squares
Ignore the signs, just use the absolute values
This is exactly what newer techniques do, but the sums are hard
So earlier techniques just squared the residuals before adding them
\(\sum_{i=1}^{n} (\bar{y}-y_i)^2\)
The single most important quantity in all of statistics the sum of squares
102.5454545 units squared
But what happens if you add more data?
The sum of squares gets bigger (unless your added data point was exactly the mean)
Hopefully, everyone can see this isn’t great
Easily solved, just divide by n (mean squared deviation)
But hang on, we need to detour to explain degrees of freedom
Degrees of freedom
I have five numbers (positive or negative) that have to add up to 20
What’s the first number?
Could be anything
Lets say its 2
What’s the second number?
Could be anything
Lets say its 7
What’s the third number?
Could be anything
Lets say its 4
What’s the four number?
Could be anything
Lets say its 0
What’s the fifth number number?
It’s got to be 7
Degrees of freedom (d.f.)
n - 1
the sample size, n, minus the number of parameters, p, estimated from the data.
Variance
\(\sum_{i=1}^{n} (\bar{y}-y_i)^2\) we had to calculate the mean, that is we calculated one parameter
so rather than divide by n we divide by n-1
Mean squared deviation becomes \(\frac{\sum_{i=1}^{n} (\bar{y}-y_i)^2}{n-1}\)
also known as the variance\(s^2\)
an unbiased estimate of the variance, because we have taken account of the fact that one parameter was estimated from the data prior to computation
standard deviation is s, the square root of variance. We use this as s is in the same dimensions as the mean (e.g mean height = 150cm, variance = 36cm^2 , standard deviation = 6cm)
Using Variance
Variance can be used in two main ways
for establishing measures of unreliability
for testing hypotheses (a future lecture)
A measure of unreliability
As variance increases so would unreliability
\(unreliability \propto s^2\)
Unreliability should go down as the number of samples increases
\(unreliability \propto \frac{s^2}{n}\)
Unreliability should be in the same dimensions as our measurements
\(S.E._\bar{y} =\sqrt{\frac{s^2}{n}}\)Standard error of the mean (s.e.m)
An example: 50 \(\pm\) 5 (mean \(\pm\) s.e.m)
Differences between populations and samples
This made me scratch my head as an undergrad
The population is, for example, all students in the room, the sample is the five people I picked out.
the standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean,
whereas the standard deviation of the sample is the degree to which individuals within the sample differ from the sample mean