23/2/2021

Outline

  • The problem
  • Categorical data and the poisson distribution
  • Generalized linear models: the solution

The problem

  • You can now
    • Have as many explanatory variables as you like in a model
    • They can be continous or categorical
    • include interactions at will
  • A substantial achievement for an undergrad biology course.
  • But all based on meeting the assumptions of the general linear model
  • Important types of biological data don’t fit these assumptions
    • I’ll demonstrate with categorical data, but there are lots more.

Categorical data

Blue.eyes Brown.eyes Row.totals
Fair hair 38 11 49
Dark hair 14 51 65
Column totals 52 62 114
  • Categorical data are usually counts of individual items, and so are discrete rather than continous
  • Is there an association between hair colour and eye colour?
  • Calculate expected value based on null hypothesis (Ho: There is no association between hair and eye colour)
  • \(\chi^2\) and other classical tests
  • You could also try transforming the data, but there is a more powerful tool, but first a detour to the Poisson distribution

The poisson distribution

  • This distribution is central to the analysis of categorical data.
  • It is a discrete distribution used to describe randomness in space or time.

Predicting the number of babies born in a hospital on a given day

The poisson distribution equation

\[ \text{Probability that } \gamma \text{ events occur in 1 time unit}= \frac{e^{-\lambda}\lambda^{\gamma}}{\gamma!}\]

  • \(\lambda\) is the mean number of events per unit time
  • The number e, also known as Euler’s number, is a mathematical constant approximately equal to 2.71828.
  • \(\gamma!\) is \(\gamma\) factorial. 4! is 4x3x2x1
  • The thing to notice here is that the poisson distribution is defined by a single parameter \(\lambda\), the mean.
  • Compare that to the normal distribution

Normal distribution equation

\[ P(x) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{{ -{(x - \mu)}^2 } /{2\sigma ^2 }}}\]

  • It depends on two parameters \(\sigma\), the standard deviance and \(\mu\), the mean.

Differences between data with Normal error and categorical data with Poisson error

Normal Poisson
Symmetric and continuous error Assymetric and discrete error
Variance independent of mean Variance dependent on the mean
Variance unknown Variance known (=mean)

Enter the generalized linear model

  • The difference between a general linear model and a generalized linear model is simply the way error is handled
  • General linear models assume errors are independent and follow a normal distribution
  • Generalized linear models use other distributions, e.g. bionomial, Poisson, negative bionomial, beta and gamma distributions (plus a lot more)
  • In the next lecture, we look at the three elelments that make up a generalized linear model
  • The one after that we implement a generalized linear model in R.