Lecture 1 Introduction

Eamonn Mallon
11/06/2019

Statistics is fundamental to Bioscience

embl

Bioscience is a data science

Statistics is fundamental to Bioscience

women_linux

Bioscience is a data science
Statistics is just experimental design (class example)

Data Science as a transferrable skill

Figure showing R's increasing popularity

Statistical thinking as a life skill

Figure showing R's increasing popularity

Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write

Statistical thinking as a life skill

Figure showing R's increasing popularity

Statistical thinking as a life skill

Chair: if “good” requires pupil performance to exceed the national average, and if all schools must be good, how is this mathematically possible?

Michael Gove: By getting better all the time.

Chair: So it is possible, is it?

Michael Gove: It is possible to get better all the time.

Chair: Were you better at literacy than numeracy, Secretary of State?

Michael Gove: I cannot remember.

(https://publications.parliament.uk/pa/cm201012/cmselect/cmeduc/uc1786-i/uc178601.htm)

What program

Excel (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel.TAScott.slides.pdf)
SPSS, Stata, Minitab, Graphpad prism ($150.00/year/user)
Python (more general, free)
R (free)

R's popularity

Figure showing R's increasing popularity

R is one of the fastest growing languages

R is too hard

https://www.excel-easy.com/examples/histogram.html

hist(mtcars$mpg)

plot of chunk unnamed-chunk-1

Rstudio: R's IDE

What I would like you to get out of your first year data science

Why do we do statistics? Separating signal from noise
R is a useful thing to learn
A properly designed experiment can always be analysed, no amount of stats will save a poor experiment
A few statistical techniques to get you started
A burning desire to come and do the second year course

Quite nice

plot of chunk unnamed-chunk-2

Everything varies (Separating signal from noise)

Think about height
We need a way of discriminating between variation that is scientifically interesting and variation that just represents background heterogeneity
key concept the amount of variation that we would expect to occur by chance alone
when we find a difference bigger than this, we say it is statistically significant (a result unlikely to have occurred by chance)

Good and bad hypotheses

A good hypothesis must be capable of rejection (Popper)

There are vultures in the park
There are no vultures in the park

absence of evidence is not evidence of absence

Null hypotheses

The null hypothesis says nothing is happening

when comparing two samples' means, the null hypothesis is that the two samples are the same
when looking at a graph of y against x, null hypothesis is that y is independent of x

p Values

p value is the estimate of the probability that a particular result or an even more extreme could occur by chance, if the null hypothesis were true
p < 0.05
we can reject the null hyypothesis when it is true (Type I error)
we can accept the null hypothesis when it is false (Type II errors)
the boy who cried wolf

statistical modelling

The best model is the one that produces the least unexplained variation (minimal residual deviance), subject to the constraint that all the parameters in the model should be statistically significant

plot of chunk unnamed-chunk-3

Fitting a model

y = a +bx

plot of chunk unnamed-chunk-4

Experimental design

Replication
Randomization
The principle of parsimony
The power of a statistical test
Controls
Experimental versus observational data

The principle of parsimony (Occam's razor)

the correct explanation is the simplest explanation
- models should have as few parameters as possible
- linear models should be preferred to non-linear models
- experiments relying on few assumptions should be preferred to those relying on many
- models should be pared down until they are minimal adequate
- simple explanations should be preferred to complex explanations -“A model should be as simple as possible. But no simpler” Einstein

Controls

No controls, no conclusions

Replication: the n's justify the means

How many replicates
- As many as you can afford
- 30
- pilot studies (power)

Power

The power of a test is the probability of rejecting the null hypothesis when it is false
$ \beta $ is the probability of accepting the null hypothesis when it is false (Type II error)
$ \beta $ should be as small as possible
but the smaller we make $ \beta $ (reducing Type II error), the larger the probability of a Type I error
Compromise $ \alpha = 0.05 $ and $ \beta = 0.2 $
power is $ 1 - \beta = 0.8 $
can use this and the variance ($ s^2 $) to calculate the number of replicates required (n) \[ n \approx \frac{8 \times s^2}{\partial^2} \]

R can do it for you

power.t.test(type = "one.sample", power = 0.8,
             sd=sqrt(10),delta = 2)


     One-sample t test power calculation 

              n = 21.62146
          delta = 2
             sd = 3.162278
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Randomization

Tribolium and pesticides
- 5 treatments, 4 replicates, 10 beetles per replicate
- If you just do them in order, then you could confound something with treatment
- Rather make up 20 batches of beetles and assign treatments to them randomly (names in a bag?)

treatments <- c("aloprin","vitex","formixin","panto","allclear")
sample(treatments)

[1] "panto"    "aloprin"  "formixin" "vitex"    "allclear"

Raining cats

car

cat graph doi:10.1038/332586a0

Structure of the course

Taught over BS1040 and BS1070/MB1080
First hour a lecture, next part a computer practical, some homework
Tested by continous assessment and in the exams

-BS1040

Today: Getting started
Central tendency and variation
Single samples
Two samples

-BS1070/MS1080

Regression
Analysis of variance
Reproducible science / Looking forward to second year