Lecture 3 Normality and transformations

Eamonn Mallon
07/08/2019

The story so far

We learned about why we should do data analysis and some preliminary things we need to know (lecture 1)
We learned about descriptive statistics and visualizing data (lecture 2)
Its not till the next lecture that we finally get to do a statistical analysis (does a affect b, is c different than d?)
In this lecture, we need to learn about an important concept in statistics called Normality

Structure of today's lecture

What is normality?
Why is it important?
Is data normal or not
A quick reminder of power
What to do if your data is not normal

Normality

The normal distribution

plot of chunk unnamed-chunk-1

So, why should I care?

A lot of real world data fits a normal distribution
- To know why we'll need to look at the central limit theorem
We know a lot about its properties, meaning our statistical analysis is simplified but still powerful

Why does real world data so often fit the normal distribution?

plot of chunk unnamed-chunk-2

The central limit theorem
If you take repeated samples from a population and calculate the averages, then these averages will be normally distributed.
An example involving craps

Craps: a dice game

 [1]  2  3  3  4  4  4  5  5  5  5  6  6  6  6  6  7  7  7  7  7  7  8  8
[24]  8  8  8  9  9  9  9 10 10 10 11 11 12

plot of chunk unnamed-chunk-3

Two dice, add up the score
Only one way of getting 2, lots of ways of getting 7
Lets throw the two dice 10,000 times, what distribution do we get?
- Not quite normal

Average the score over three games

plot of chunk unnamed-chunk-4

Last slide was one game repeated 10,000 times
Now, lets average over three games, 10,000 times
A normal distibution, the central limit theorem in action

But why does real world data often fit a normal distribution?

The central limit theorem
- If you take repeated samples from a population and calculate the averages, then these averages will be normally distributed.
Real world data is often the result of numerous processes interacting, that is averages
- Think of all the reasons that you are the height you are.

The normal distribution is well studied

plot of chunk unnamed-chunk-5

Its symmetrical, so half of the values are below the mean and a half above
We can know the distribution in various parts e.g. 16% of samples will be more than 1 standard deviation above the mean.
Because we know this, we can work out the values that can be predicted by chance alone. The fact that 95% (remember 0.05?) of values lie within 1.96 standard deviations of the mean is often used in statisical tests.

Is my data normal?

A slight detour: Power (from lecture 1)

The power of a test is the probability of rejecting the null hypothesis when it is false
\( \beta \) is the probability of accepting the null hypothesis when it is false (Type II error)
\( \beta \) should be as small as possible
but the smaller we make \( \beta \) (reducing Type II error), the larger the probability of a Type I error
Compromise \( \alpha = 0.05 \) and \( \beta = 0.2 \)
power is \( 1 - \beta = 0.8 \)
can use this and the variance (\( s^2 \)) to calculate the number of replicates required (n) \[ n \approx \frac{8 \times s^2}{\partial^2} \]

Is my data normal?

plot of chunk unnamed-chunk-6

Parametric (normal data) tests are more powerful than non-parametric
So, you should use parametric tests if you can
The simplest and best way of doing this is by looking
A slight wrinkle, its not that the data is normal but rather that the residuals are.

qq plots

plot of chunk unnamed-chunk-7

A slight detour: Residuals

plot of chunk unnamed-chunk-8

What to do if your data isn't normal?

Transform it
Use non-parametric tests

Transforming data

Applying a mathematical function to make the data/residuals fit a normal distribution
What! Surely thats dodgy?
- Is converting feet into metres?
- You are just changing the scale on which the data is measured.
Lots of transformations, but we'll look at log

Log transformation

plot of chunk unnamed-chunk-9

A log-transformation stretches out the left hand side (smaller values) of the distribution and squashes in the right hand side (larger values). This is obviously useful where the data set has a long tail to the right (right skewed)

Log transformation

plot of chunk unnamed-chunk-10

plot of chunk unnamed-chunk-11

Non-parametric tests

Usually based on ranks
Why is that less powerful?
- Think about 5,10,1000
- That becomes 1,2,3

Next Week

The wait is over, lets do some statistical tests including

t-test
Wilcoxon's test
Two types of correlations
chi-squared test