Only one way of getting 2, lots of ways of getting 7
Lets throw the two dice 10,000 times, what distribution do we get?
Not quite normal
Average the score over three games
Last slide was one game repeated 10,000 times
Now, lets average over three games, 10,000 times
A normal distibution, the central limit theorem in action
But why does real world data often fit a normal distribution?
The central limit theorem
If you take repeated samples from a population and calculate the averages, then these averages will be normally distributed.
Real world data is often the result of numerous processes interacting, that is averages
Think of all the reasons that you are the height you are.
The normal distribution is well studied
Its symmetrical, so half of the values are below the mean and a half above
We can know the distribution in various parts e.g. 16% of samples will be more than 1 standard deviation above the mean.
Because we know this, we can work out the values that can be predicted by chance alone. The fact that 95% (remember 0.05?) of values lie within 1.96 standard deviations of the mean is often used in statisical tests.
Is my data normal?
A slight detour: Power (from lecture 1)
The power of a test is the probability of rejecting the null hypothesis when it is false
\( \beta \) is the probability of accepting the null hypothesis when it is false (Type II error)
\( \beta \) should be as small as possible
but the smaller we make \( \beta \) (reducing Type II error), the larger the probability of a Type I error
can use this and the variance (\( s^2 \)) to calculate the number of replicates required (n)
\[ n \approx \frac{8 \times s^2}{\partial^2} \]
Is my data normal?
Parametric (normal data) tests are more powerful than non-parametric
So, you should use parametric tests if you can
The simplest and best way of doing this is by looking
A slight wrinkle, its not that the data is normal but rather that the residuals are.
qq plots
A slight detour: Residuals
What to do if your data isn't normal?
Transform it
Use non-parametric tests
Transforming data
Applying a mathematical function to make the data/residuals fit a normal distribution
What! Surely thats dodgy?
Is converting feet into metres?
You are just changing the scale on which the data is measured.
Lots of transformations, but we'll look at log
Log transformation
A log-transformation stretches out the left hand side (smaller values) of the distribution and squashes in the right hand side (larger values). This is obviously useful where the data set has a long tail to the right (right skewed)
Log transformation
Non-parametric tests
Usually based on ranks
Why is that less powerful?
Think about 5,10,1000
That becomes 1,2,3
Next Week
The wait is over, lets do some statistical tests including