Lecture 1 Introduction

Eamonn Mallon
11/06/2019

Statistics is fundamental to Bioscience

embl

  • Bioscience is a data science

Statistics is fundamental to Bioscience

women_linux

  • Bioscience is a data science
  • Statistics is just experimental design (class example)

Data Science as a transferrable skill

Figure showing R's increasing popularity

Figure showing R's increasing popularity

Statistical thinking as a life skill

Figure showing R's increasing popularity

Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write

Statistical thinking as a life skill

Figure showing R's increasing popularity

Statistical thinking as a life skill

Chair: if “good” requires pupil performance to exceed the national average, and if all schools must be good, how is this mathematically possible?

Michael Gove: By getting better all the time.

Chair: So it is possible, is it?

Michael Gove: It is possible to get better all the time.

Chair: Were you better at literacy than numeracy, Secretary of State?

Michael Gove: I cannot remember.

(https://publications.parliament.uk/pa/cm201012/cmselect/cmeduc/uc1786-i/uc178601.htm)

What program

R's popularity

Figure showing R's increasing popularity

R is one of the fastest growing languages

R is too hard

Rstudio: R's IDE

Rstudio panes

What I would like you to get out of your first year data science

  • Why do we do statistics? Separating signal from noise
  • R is a useful thing to learn
  • A properly designed experiment can always be analysed, no amount of stats will save a poor experiment
  • A few statistical techniques to get you started
  • A burning desire to come and do the second year course

Quite nice

plot of chunk unnamed-chunk-2

Everything varies (Separating signal from noise)

  • Think about height
  • We need a way of discriminating between variation that is scientifically interesting and variation that just represents background heterogeneity
  • key concept the amount of variation that we would expect to occur by chance alone
  • when we find a difference bigger than this, we say it is statistically significant (a result unlikely to have occurred by chance)

Good and bad hypotheses

A good hypothesis must be capable of rejection (Popper)

  1. There are vultures in the park
  2. There are no vultures in the park

absence of evidence is not evidence of absence

Null hypotheses

The null hypothesis says nothing is happening

  • when comparing two samples' means, the null hypothesis is that the two samples are the same
  • when looking at a graph of y against x, null hypothesis is that y is independent of x

p Values

  • p value is the estimate of the probability that a particular result or an even more extreme could occur by chance, if the null hypothesis were true
  • p < 0.05
  • we can reject the null hyypothesis when it is true (Type I error)
  • we can accept the null hypothesis when it is false (Type II errors)
  • the boy who cried wolf

statistical modelling

The best model is the one that produces the least unexplained variation (minimal residual deviance), subject to the constraint that all the parameters in the model should be statistically significant

plot of chunk unnamed-chunk-3

Fitting a model

y = a +bx

plot of chunk unnamed-chunk-4

Experimental design

  • Replication
  • Randomization
  • The principle of parsimony
  • The power of a statistical test
  • Controls
  • Experimental versus observational data

The principle of parsimony (Occam's razor)

  • the correct explanation is the simplest explanation
    • models should have as few parameters as possible
    • linear models should be preferred to non-linear models
    • experiments relying on few assumptions should be preferred to those relying on many
    • models should be pared down until they are minimal adequate
    • simple explanations should be preferred to complex explanations -“A model should be as simple as possible. But no simpler” Einstein

Controls

No controls, no conclusions

Replication: the n's justify the means

  • How many replicates
    • As many as you can afford
    • 30
    • pilot studies (power)

Power

  • The power of a test is the probability of rejecting the null hypothesis when it is false
  • \( \beta \) is the probability of accepting the null hypothesis when it is false (Type II error)
  • \( \beta \) should be as small as possible
  • but the smaller we make \( \beta \) (reducing Type II error), the larger the probability of a Type I error
  • Compromise \( \alpha = 0.05 \) and \( \beta = 0.2 \)
  • power is \( 1 - \beta = 0.8 \)
  • can use this and the variance (\( s^2 \)) to calculate the number of replicates required (n) \[ n \approx \frac{8 \times s^2}{\partial^2} \]

R can do it for you

power.t.test(type = "one.sample", power = 0.8,
             sd=sqrt(10),delta = 2)

     One-sample t test power calculation 

              n = 21.62146
          delta = 2
             sd = 3.162278
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Randomization

  • Tribolium and pesticides
    • 5 treatments, 4 replicates, 10 beetles per replicate
    • If you just do them in order, then you could confound something with treatment
    • Rather make up 20 batches of beetles and assign treatments to them randomly (names in a bag?)
treatments <- c("aloprin","vitex","formixin","panto","allclear")
sample(treatments)
[1] "panto"    "aloprin"  "formixin" "vitex"    "allclear"

Raining cats

car

cat graph doi:10.1038/332586a0

Structure of the course

  • Taught over BS1040 and BS1070/MB1080
  • First hour a lecture, next part a computer practical, some homework
  • Tested by continous assessment and in the exams

-BS1040

  • Today: Getting started
  • Central tendency and variation
  • Single samples
  • Two samples

-BS1070/MS1080

  • Regression
  • Analysis of variance
  • Reproducible science / Looking forward to second year