Normality and Hypotheses

Eamonn Mallon

2025-10-30

Normality

The story so far

We learned about why we should do data analysis and some preliminary things we need to know
We learned about descriptive statistics and visualizing data
In this lecture, we need to learn about an important concept in statistics called Normality

The normal distribution

So, why should I care?

A lot of real world data fits a normal distribution
To know why we’ll need to look at the central limit theorem
We know a lot about its properties, meaning our statistical analysis is simplified but still powerful

Real world data is often normal?

Real world

Why does real world data so often fit the normal distribution?

The central limit theorem
If you take repeated samples from a population and calculate the averages, then these averages will be normally distributed.
An example involving craps

Craps: a dice game

 [1]  2  3  3  4  4  4  5  5  5  5  6  6  6  6  6  7  7  7  7  7  7  8  8  8  8
[26]  8  9  9  9  9 10 10 10 11 11 12

Two dice, add up the score
Only one way of getting 2, lots of ways of getting 7
Lets throw the two dice 10,000 times, what distribution do we get?
- Not quite normal

Average the score over three games

Last slide was one game repeated 10,000 times
Now, lets average over three games, 10,000 times
A normal distribution, the central limit theorem in action

But why does real world data often fit a normal distribution?

The central limit theorem
- If you take repeated samples from a population and calculate the averages, then these averages will be normally distributed.
Real world data is often the result of numerous processes interacting, that is averages
- Think of all the reasons that you are the height you are.

The normal distribution is well studied

Its symmetrical, so half of the values are below the mean and a half above
We can know the distribution in various parts e.g. 16% of samples will be more than 1 standard deviation above the mean.
Because we know this, we can work out the values that can be predicted by chance alone. The fact that 95% (remember 0.05?) of values lie within 1.96 standard deviations of the mean is often used in statisical tests.

Is my data normal?

Parametric (normal data) tests are more powerful than non-parametric
So, you should use parametric tests if you can
The simplest and best way of doing this is by looking

Is my data normal?

Is my data normal?

Parametric (normal data) tests are more powerful than non-parametric
So, you should use parametric tests if you can
The simplest and best way of doing this is by looking
A slight wrinkle, its not that the data is normal but rather that the residuals are.

A slight detour: Residuals

What to do if your data isn’t normal?

Transform it
Use non-parametric tests

Transforming data

Applying a mathematical function to make the data/residuals fit a normal distribution
What! Surely thats dodgy?
Is converting feet into metres?
You are just changing the scale on which the data is measured. - Lots of transformations, but we’ll look at log

Log transformation

A log-transformation stretches out the left hand side (smaller values) of the distribution and squashes in the right hand side (larger values). This is obviously useful where the data set has a long tail to the right (right skewed)

Log transformation

Non-parametric tests

Usually based on ranks
Why is that less powerful?
Think about 5,10,1000
That becomes 1,2,3

Asking a question in statistics

Everything varies (Separating signal from noise)

Think about height
We need a way of discriminating between variation that is scientifically interesting and variation that just represents background heterogeneity
key concept the amount of variation that we would expect to occur by chance alone
when we find a difference bigger than this, we say it is statistically significant (a result unlikely to have occurred by chance)

Good and bad hypotheses

A good hypothesis must be capable of rejection (Popper)

There are vultures in the park
There are no vultures in the park

absence of evidence is not evidence of absence

Null hypotheses

The null hypothesis says nothing is happening

when comparing two samples’ means, the null hypothesis is that the two samples are the same
when looking at a graph of y against x, null hypothesis is that y is independent of x

p Values

p value is the estimate of the probability that a particular result or an even more extreme could occur by chance, if the null hypothesis were true
p < 0.05
we can reject the null hypothesis when it is true (Type I error)
we can accept the null hypothesis when it is false (Type II errors)

Power

The power of a test is the probability of rejecting the null hypothesis when it is false
\(\beta\) is the probability of accepting the null hypothesis when it is false (Type II error)
\(\beta\) should be as small as possible
but the smaller we make \(\beta\) (reducing Type II error), the larger the probability of a Type I error
Compromise \(\alpha = 0.05\) and \(\beta = 0.2\)
power is \(1 - \beta = 0.8\)
can use this and the variance (\(s^2\)) to calculate the number of replicates required (n) \[ n \approx \frac{8 \times s^2}{\partial^2} \]