Lecture 7 Statistics: The Next Generation

Eamonn Mallon
31/01/2020

BS2004: Contemporary Techniques in Biological Data Analysis

  • Optional 15 credit second year course
  • 11 x 3 hours sessions plus 6 help sessions
  • Second semester

Do I not know all the stats?

  • What we taught you is a good basis
  • If your experiments are simple, they will be fine
  • multiway ANOVAs (with interactions), nested ANOVAs, ANCOVAs etc. etc
  • You could learn these piecemeal as required or

BS2004: Contemporary Techniques in Biological Data Analysis

  • Model formulae
  • General and generalised linear models
  • Geometrical approach

Model formulae

  • 50 male squirrels' weight, 50 female squirrels' weight
  • Does the weight of the squirrel depend on its sex?
  • Model formula: WEIGHT=SEX
  • In R:
WEIGHT~SEX
  • ~ means “depends on” (Dependent variable LHS, Independent variables RHS)

General and generalised linear models

  • t-tests, ANOVA, ANCOVA, and regressions are types of General linear models
  • The difference between general and generalised linear models is simply how error is handled
    • General linear models assume errors are independent and follow a normal distribution
    • Generalized linear models can use a wide range of distributions
    • i.e. your data doesn't have to be normal (bye bye non-parametric tests)
  • lm is the R command for General linear models
  • glm is the R command for Generalised linear models

Geometrical approach

  • You can just learn to do tests and not know how they work; dangerous and unsatisfying
  • The way to explain why a test works is to give the mathematical proof
    • This maths isn't important in the everyday use of stats
    • The maths is not accessible to most users
  • So in BS2004 we are going to use a different approach

Geometrical approach

cube

  • Any three points in n-dimensional space can be represented in 2 dimensions

A geometrical representation of an ANOVA

  • Remember back to the ANOVA lecture
  • SSY= SSE + SSA
  • Imagine we have 30 data points of yield (3 levels of fertiliser with 10 replicates each)
  • In 30 dimensional space, each point is represented by 30 coordinates
  • Point Y represents the data,
    • so the 30 coordinates describing this point are the 30 measurements of yield
  • Point M represents the grant mean,
    • so the 30 coordinates describing this point have all the same value (the grant mean)
  • Point F represents the treatment means,
    • so the 30 coordinates describing this point, the first ten equal the mean of treament A, the next ten the treatment B mean, and the last 10 the mean of treatment C

A geometrical representation of an ANOVA

cube

Pythagorus' theorem

cube

\[ d_1^2=d_2^2+d_3^2 \]

or

\[ SSY =SSE + SSA \]

A geometrical representation of an ANOVA

cube

A geometrical representation of an ANOVA

cube

See you next year

cube