Normality and Hypotheses

Eamonn Mallon

2025-10-30

Normality

The story so far

  • We learned about why we should do data analysis and some preliminary things we need to know
  • We learned about descriptive statistics and visualizing data
  • In this lecture, we need to learn about an important concept in statistics called Normality

The normal distribution

So, why should I care?

  • A lot of real world data fits a normal distribution
  • To know why we’ll need to look at the central limit theorem
  • We know a lot about its properties, meaning our statistical analysis is simplified but still powerful

Real world data is often normal?

  • Real world

Why does real world data so often fit the normal distribution?

  • The central limit theorem
  • If you take repeated samples from a population and calculate the averages, then these averages will be normally distributed.
  • An example involving craps

Craps: a dice game

 [1]  2  3  3  4  4  4  5  5  5  5  6  6  6  6  6  7  7  7  7  7  7  8  8  8  8
[26]  8  9  9  9  9 10 10 10 11 11 12

  • Two dice, add up the score
  • Only one way of getting 2, lots of ways of getting 7
  • Lets throw the two dice 10,000 times, what distribution do we get?
    • Not quite normal

Average the score over three games

  • Last slide was one game repeated 10,000 times
  • Now, lets average over three games, 10,000 times
  • A normal distribution, the central limit theorem in action

But why does real world data often fit a normal distribution?

  • The central limit theorem
    • If you take repeated samples from a population and calculate the averages, then these averages will be normally distributed.
  • Real world data is often the result of numerous processes interacting, that is averages
    • Think of all the reasons that you are the height you are.

The normal distribution is well studied

  • Its symmetrical, so half of the values are below the mean and a half above
  • We can know the distribution in various parts e.g. 16% of samples will be more than 1 standard deviation above the mean.
  • Because we know this, we can work out the values that can be predicted by chance alone. The fact that 95% (remember 0.05?) of values lie within 1.96 standard deviations of the mean is often used in statisical tests.

Is my data normal?

  • Parametric (normal data) tests are more powerful than non-parametric
  • So, you should use parametric tests if you can
  • The simplest and best way of doing this is by looking

Is my data normal?

Is my data normal?

  • Parametric (normal data) tests are more powerful than non-parametric
  • So, you should use parametric tests if you can
  • The simplest and best way of doing this is by looking
  • A slight wrinkle, its not that the data is normal but rather that the residuals are.

A slight detour: Residuals

What to do if your data isn’t normal?

  • Transform it
  • Use non-parametric tests

Transforming data

  • Applying a mathematical function to make the data/residuals fit a normal distribution
  • What! Surely thats dodgy?
  • Is converting feet into metres?
  • You are just changing the scale on which the data is measured. - Lots of transformations, but we’ll look at log

Log transformation

  • A log-transformation stretches out the left hand side (smaller values) of the distribution and squashes in the right hand side (larger values). This is obviously useful where the data set has a long tail to the right (right skewed)

Log transformation

Non-parametric tests

  • Usually based on ranks
  • Why is that less powerful?
  • Think about 5,10,1000
  • That becomes 1,2,3

Asking a question in statistics

Everything varies (Separating signal from noise)

  • Think about height
  • We need a way of discriminating between variation that is scientifically interesting and variation that just represents background heterogeneity
  • key concept the amount of variation that we would expect to occur by chance alone
  • when we find a difference bigger than this, we say it is statistically significant (a result unlikely to have occurred by chance)

Good and bad hypotheses

A good hypothesis must be capable of rejection (Popper)

  1. There are vultures in the park
  2. There are no vultures in the park

absence of evidence is not evidence of absence

Null hypotheses

The null hypothesis says nothing is happening

  • when comparing two samples’ means, the null hypothesis is that the two samples are the same
  • when looking at a graph of y against x, null hypothesis is that y is independent of x

p Values

  • p value is the estimate of the probability that a particular result or an even more extreme could occur by chance, if the null hypothesis were true
  • p < 0.05
  • we can reject the null hypothesis when it is true (Type I error)
  • we can accept the null hypothesis when it is false (Type II errors)

Power

  • The power of a test is the probability of rejecting the null hypothesis when it is false
  • \(\beta\) is the probability of accepting the null hypothesis when it is false (Type II error)
  • \(\beta\) should be as small as possible
  • but the smaller we make \(\beta\) (reducing Type II error), the larger the probability of a Type I error
  • Compromise \(\alpha = 0.05\) and \(\beta = 0.2\)
  • power is \(1 - \beta = 0.8\)
  • can use this and the variance (\(s^2\)) to calculate the number of replicates required (n) \[ n \approx \frac{8 \times s^2}{\partial^2} \]

BS1040 · Lecture 3