Lecture 3 Normality and transformations

Eamonn Mallon
07/08/2019

The story so far

  • We learned about why we should do data analysis and some preliminary things we need to know (lecture 1)
  • We learned about descriptive statistics and visualizing data (lecture 2)
  • Its not till the next lecture that we finally get to do a statistical analysis (does a affect b, is c different than d?)
  • In this lecture, we need to learn about an important concept in statistics called Normality

Structure of today's lecture

  • What is normality?
  • Why is it important?
  • Is data normal or not
  • A quick reminder of power
  • What to do if your data is not normal

Normality

The normal distribution

plot of chunk unnamed-chunk-1

So, why should I care?

  • A lot of real world data fits a normal distribution
    • To know why we'll need to look at the central limit theorem
  • We know a lot about its properties, meaning our statistical analysis is simplified but still powerful

Why does real world data so often fit the normal distribution?

plot of chunk unnamed-chunk-2

  • The central limit theorem
  • If you take repeated samples from a population and calculate the averages, then these averages will be normally distributed.
  • An example involving craps

Craps: a dice game

 [1]  2  3  3  4  4  4  5  5  5  5  6  6  6  6  6  7  7  7  7  7  7  8  8
[24]  8  8  8  9  9  9  9 10 10 10 11 11 12

plot of chunk unnamed-chunk-3

  • Two dice, add up the score
  • Only one way of getting 2, lots of ways of getting 7
  • Lets throw the two dice 10,000 times, what distribution do we get?
    • Not quite normal

Average the score over three games

plot of chunk unnamed-chunk-4

  • Last slide was one game repeated 10,000 times
  • Now, lets average over three games, 10,000 times
  • A normal distibution, the central limit theorem in action

But why does real world data often fit a normal distribution?

  • The central limit theorem
    • If you take repeated samples from a population and calculate the averages, then these averages will be normally distributed.
  • Real world data is often the result of numerous processes interacting, that is averages
    • Think of all the reasons that you are the height you are.

The normal distribution is well studied

plot of chunk unnamed-chunk-5

  • Its symmetrical, so half of the values are below the mean and a half above
  • We can know the distribution in various parts e.g. 16% of samples will be more than 1 standard deviation above the mean.
  • Because we know this, we can work out the values that can be predicted by chance alone. The fact that 95% (remember 0.05?) of values lie within 1.96 standard deviations of the mean is often used in statisical tests.

Is my data normal?

A slight detour: Power (from lecture 1)

  • The power of a test is the probability of rejecting the null hypothesis when it is false
  • \( \beta \) is the probability of accepting the null hypothesis when it is false (Type II error)
  • \( \beta \) should be as small as possible
  • but the smaller we make \( \beta \) (reducing Type II error), the larger the probability of a Type I error
  • Compromise \( \alpha = 0.05 \) and \( \beta = 0.2 \)
  • power is \( 1 - \beta = 0.8 \)
  • can use this and the variance (\( s^2 \)) to calculate the number of replicates required (n) \[ n \approx \frac{8 \times s^2}{\partial^2} \]

Is my data normal?

plot of chunk unnamed-chunk-6

  • Parametric (normal data) tests are more powerful than non-parametric
  • So, you should use parametric tests if you can
  • The simplest and best way of doing this is by looking
  • A slight wrinkle, its not that the data is normal but rather that the residuals are.

qq plots

plot of chunk unnamed-chunk-7

A slight detour: Residuals

plot of chunk unnamed-chunk-8

What to do if your data isn't normal?

  • Transform it
  • Use non-parametric tests

Transforming data

  • Applying a mathematical function to make the data/residuals fit a normal distribution
  • What! Surely thats dodgy?
    • Is converting feet into metres?
    • You are just changing the scale on which the data is measured.
  • Lots of transformations, but we'll look at log

Log transformation

plot of chunk unnamed-chunk-9

  • A log-transformation stretches out the left hand side (smaller values) of the distribution and squashes in the right hand side (larger values). This is obviously useful where the data set has a long tail to the right (right skewed)

Log transformation

plot of chunk unnamed-chunk-10

plot of chunk unnamed-chunk-11

Non-parametric tests

  • Usually based on ranks
  • Why is that less powerful?
    • Think about 5,10,1000
    • That becomes 1,2,3

Next Week

The wait is over, lets do some statistical tests including

  • t-test
  • Wilcoxon's test
  • Two types of correlations
  • chi-squared test