Lecture 4 Normality

Eamonn Mallon
27/08/2020

The story so far

  • We learned about why we should do data analysis and some preliminary things we need to know (lecture 1)
  • We learned about descriptive statistics and visualizing data (lecture 2 and 3)
  • In this lecture, we need to learn about an important concept in statistics called Normality

Structure of normality lectures

This session

  • What is normality?
  • Why is it important?

Next session

  • Power and errors

Third session

  • Is data normal or not
  • What to do if your data is not normal

Normality

The normal distribution

plot of chunk unnamed-chunk-1

So, why should I care?

  • A lot of real world data fits a normal distribution
    • To know why we'll need to look at the central limit theorem
  • We know a lot about its properties, meaning our statistical analysis is simplified but still powerful

Why does real world data so often fit the normal distribution?

plot of chunk unnamed-chunk-2

  • The central limit theorem
  • If you take repeated samples from a population and calculate the averages, then these averages will be normally distributed.
  • An example involving craps

Craps: a dice game

 [1]  2  3  3  4  4  4  5  5  5  5  6  6  6  6  6  7  7  7  7  7  7  8  8  8  8
[26]  8  9  9  9  9 10 10 10 11 11 12

plot of chunk unnamed-chunk-3

  • Two dice, add up the score
  • Only one way of getting 2, lots of ways of getting 7
  • Lets throw the two dice 10,000 times, what distribution do we get?
    • Not quite normal

Average the score over three games

plot of chunk unnamed-chunk-4

  • Last slide was one game repeated 10,000 times
  • Now, lets average over three games, 10,000 times
  • A normal distibution, the central limit theorem in action

But why does real world data often fit a normal distribution?

  • The central limit theorem
    • If you take repeated samples from a population and calculate the averages, then these averages will be normally distributed.
  • Real world data is often the result of numerous processes interacting, that is averages
    • Think of all the reasons that you are the height you are.

The normal distribution is well studied

plot of chunk unnamed-chunk-5

  • Its symmetrical, so half of the values are below the mean and a half above
  • We can know the distribution in various parts e.g. 16% of samples will be more than 1 standard deviation above the mean.
  • Because we know this, we can work out the values that can be predicted by chance alone. The fact that 95% (remember 0.05?) of values lie within 1.96 standard deviations of the mean is often used in statisical tests.

Is my data normal?