July 26, 2016

Recap from last week

Week 2
  • Your presence/participation is the most important !
  • If you want to change your lab stream, see the office (WS floor 5)
  • Lab reports will turn into your key document for this paper !
  • Be creative, add to the lab reports !
  • Think about your cheat sheet, be creative there too !
  • Additional resources as recommended, refer to last week's slides

Lab reports to be handed in every week (starting Friday August 5 !) through turnitin

What we will learn today

Week 2
  • Some basics: the research process, scientific theory, hypotheses
  • Experiments versus observations
  • Models: What are they? What can they do?
  • Variables: what kinds are there? How to identify them, how to create them in R
  • Variability in data: where does it come from?
  • Population versus sample

Types of data analysis

Week 2
  • Quantitative Methods
    • Testing theories using numbers
  • Qualitative Methods
    • Reporting findings (new methods, chemical pathways)
    • Describing new species
    • Testing theories using language, interviews, etc.
  • Often, using qualitative methods still involves a (possibly small) quantitative aspect
    • for example, a new insect species is qualitatively described, but some statistics are used on how abdomen length significantly differs from another species.

The process of doing quantitative (biological) science

Week 2

The process of doing quantitative (biological) science

Week 2

The process of doing quantitative (biological) science

Week 2

Generating and Testing Theories

Week 2
  • Theory

A hypothesised general principle or set of principles that explains known findings about a topic and from which new hypotheses can be generated. E.g.: 'Biodiversity decreases towards the poles.'

  • Hypothesis

A prediction from a theory. E.g.: 'In any one animal phylum, there are less species found between 30 and 60 degrees north/south than between 0 and 30 degrees north/south.'

Note that the term 'null hypothesis' (H\(_0\)) and 'alternative hypothesis' (H\(_A\)) are something else (see later)

Testable and non-testable hypotheses

Week 2
  • Testable (scientifically usable) hypotheses
    • There are no rats on Rangitoto island
    • The Beatles sold more records than any other band
    • More people like this product with a sour taste rather than a bitter taste
    • Patients taking this medication live longer


  • Non-testable (non-scientific) hypotheses
    • There are some rats on Rangitoto island
    • The Beatles were the best band ever
    • Most people like this product with a sour taste
    • This medication has helped so many people, so surely it must be good


  • Falsification: a good scientific hypothesis is falsifiable!

Observational versus experimental studies

Week 2

Examples for observational studies:

  • Measuring the number of species along an altitudinal gradient
  • Recording the circumference of the heads of newborns (and relating it to the size of the mother)
  • Recording student attendance for this paper

Examples for experimental studies:

  • Measuring the number of species in response to fertiliser application
  • Recording the weight of 10 people who have undergone a diet versus 10 who haven't
  • The scores of students of BIOL501 who have received different teaching resources

Observational versus experimental studies

Week 2

Observational studies

  • do not involve (deliberate) manipulation
  • are often comparative or along a gradient

Experimental studies

  • involve an (intended) manipulation (or 'treatment') of some sort
  • usually involve a 'control group' (a group of individuals not receiving the treatment)

Statistical models

Week 2
  • A statistical model is a way of simplifying an observed process or mechanism

\(outcome_i = (model) + error_i\)

What can we use (statistical) models for?

Week 2

Models are not only used to test hypotheses! They can also summarise information, or predict values where data are missing (predictive model). E.g. what is y when x = 5?

Example: the simplest statistical model

Week 2

The mean!

  • The mean summarises data
  • The mean can be used to predict future outcomes of a variable (E.g. if 200 people died on average every year on the road in NZ over the past 3 years, we can use this value to predict the road toll in the following year)
  • The mean is a hypothetical value (i.e. it doesn’t have to be a value that actually exists in the data set, e.g. the mean of 192, 188, and 220 is 200).
  • The mean can be used to test whether it is different from a certain value (e.g. is the road toll in NZ in 2012-2015 different from the one 1962-1965?)

As such, the mean is simple statistical model.

The mean

Week 2
  • The mean is the sum of all scores divided by the number of scores.
  • The mean is also the value from which the (squared) scores deviate least (it has the least error).

\(mean(X) = \bar{X} = \frac{\sum\limits_{i=1}^n x_i}{n}\)

We all know the mean!

The mean: simple example

Week 2
  • Collect some data

\(1, 3, 4, 3, 2\)

  • Add them up:

\(\sum\limits_{i=1}^n x_i = 1 + 3 + 4 + 3 + 2 = 13\)

  • Divide by the number of scores, \(n\):

\(\bar{X} = \frac{\sum\limits_{i=1}^n x_i}{n} = \frac{13}{5} = 2.6\)

The mean is a statistical model !

\(outcome_i = (model) + error_i\)

\(outcome_{lecturer1} = (model) + error_{lecturer1}\)

\(1 = 2.6 + (-1.6)\)

Quick summary

Week 2
  • What is a (statistical) model?
    • It is a summary/simplification of what is going on in reality
    • It can be used for a number of purposes, e.g. to test hypotheses
  • How is the mean a statistical model?
    • It summarises information contained in many data points
    • It can be used to make a prediction
    • It can be used to test a hypothesis
  • What does a model need?
    • Variables !

Example: Coca-Cola kills sperm…

Week 2

Hypothesis: Coca-Cola kills sperm

Kinds of variables

Week 2
  • Categorical (entities are divided into distinct categories):
    • Binary variable: There are only two categories, e.g. dead or alive
    • Nominal variable: There are more than two categories, e.g. whether someone is an omnivore, vegetarian, vegan, or fruitarian
    • Ordinal variable: The same as a nominal variable but the categories have a logical order. e.g. whether people got a fail, a pass, a merit or a distinction in their exam
  • Continuous (entities get a distinct score): e.g. human body height

Kinds of variables

Week 2
  • 'Predictor variable', or 'independent variable'
    • The proposed cause
    • A manipulated variable (in experiments)
    • The concentration of Coca-Cola in the above example
    • usually plotted on the x-axis
  • 'Response variable', or 'dependent variable'
    • The proposed effect ('outcome')
    • Measured, not manipulated (in experiments)
    • Sperm count in the above example
    • usually plotted on the y-axis

Kinds of variables

Week 2
Type of variable Categorical (Binomial) Categorical (Nominal) Categorical (Ordinal) Continuous
Predictor smoker, gender, handedness state of mind, hair colour age class, rank long jump results, body weight
Response survival, handedness employment type, hair colour income bracket, clutch size cholesterol level, body weight


A variable has got a name, and values, examples:

  • Variable name: handedness, values: left, right
  • Variable name: body weight, values: 63.4, 88.2, …
  • A categorical predictor variable is often called a 'factor', its values 'factor levels'

Variability in data

Week 2
  • Almost always occurs
    • e.g. if you measure your height 10 times
    • e.g. if you measure the height of 10 different people
    • e.g. if you measure the height of 10 men and then 10 woman
  • The reasons for variability to occur however differ!

Types of Variation and their origin

Week 2
  • Systematic Variation
    • Variation created by a specific experimental manipulation (e.g. administering a drug)
    • Variation created by unknown factors (confounding variables, more later!)
      • Age, gender, IQ, time of day, measurement error, etc.
    • Variation that occurs while applying the treatment
    • Variation that occurs while measuring a variable
  • Unsystematic Variation
    • Variation created by unknown factors.
      • Age, gender, IQ, time of day, measurement error, etc.
    • Variation that occurs while applying the treatment
    • Variation that occurs while measuring a variable

Systematic and unsystematic variation

Week 2


Signal versus noise: Systematic variation adds to the signal, unsystematic variation adds to the noise!

Systematic and unsystematic variation: the mouse example

Week 2


            - …recall what is a 'control group'!

By the way: what was that again with response and predictor variables…?

Week 2

Systematic and unsystematic variation in the mouse experiment example

Week 2
Reason for variation → Type of variation ↓ Application of manipulation Measurement error Natural
Systematic Syringe has wrong volume Experimenter is using a faulty tape measure A batch of mice has higher growth rates because they come from different origins
Unsystematic Solution to be injected is not homogenous Experimenter is tired, makes random mistakes Natural variation in growth rates of mice, even though they originate from the same mother

Systematic and unsystematic variation

Week 2
  • Disentangling the two is the key task of statistics, e.g. by
    • reducing unsystematic variation
    • attributing systematic variation
    • taking several measurements of the same quantity (replication!)

We thus maximise the signal to noise ratio, i.e. we maximise our explanatory power

  • Mouse experiment example:
    • We are trying to minimise the ‘noise’ by
      • applying the injections properly
      • measuring the heart rates accurately
      • Accounting for additional factors such as gender

Population versus sample

Week 2

Consider these three experiments:

  1. You are asked to determine the germination rate of 20 kg of grass seeds. What is your sample? What is your population?

  2. You are asked to determine the germination rate of grass seeds 'supergrass' sold at the warehouse. What could be your sample? What is the population?

  3. You are asked to test the efficiency of a lung cancer treatment. What could be your sample? What is the population?

Population versus sample

Week 2
  • In example (1), you are simply referring to the 20 kg of grass seeds, this is your sample, but it is also your population that you are making your inference on.

  • In (2), your population are all grass seeds sold all over New Zealand during this season. Your sample could be one sachet of seeds per warehouse branch.

  • In (3), your population are possibly all present and future lung cancer patients globally. Your population is fictitious. Your sample may be 20 lung cancer patients in Auckland.

When selecting your sample, think of the population you would like to make an inference on!

How to create variables in R

Week 2
x = c('red', 'blue', 'yellow', 'yellow', 'red', 'blue', 'blue') 
x #nominal variable (categorical)
[1] "red"    "blue"   "yellow" "yellow" "red"    "blue"   "blue"  
offspring = c(4, 5, 2, 0, 1, 1, 3)
offspring #ordinal variable (categorical)
[1] 4 5 2 0 1 1 3
size = c(3.87, 5.1, 9.63, 3.11, 4.28, 5.34, 2.19)
size #continuous variable
[1] 3.87 5.10 9.63 3.11 4.28 5.34 2.19

And a few tricks to create variables in R

Week 2
x = rep(c('red', 'blue'), times = 3)
x
[1] "red"  "blue" "red"  "blue" "red"  "blue"
y = rep(c(2, 4, 6), each = 3)
y
[1] 2 2 2 4 4 4 6 6 6

Use the funcion rep() with the arguments times or each to specify how you would like to repeat the values.

What will we have learnt in week 2?

Week 2
  • What is the scientific research process
  • Scientific and non-scientific hypotheses
  • What is a hypothesis that can be falsified?
  • What is an experiment? What is an observational study?
  • What are response and predictor variables, what is a ‘treatment’ and a 'control'
  • To characterise a variable, its name and values
  • Identify whether it is categorical (binomial, nominal, ordinal?) or continuous
  • Identify whether it is a predictor or a response variable
  • Identify systematic vs. unsystematic variation in data, what are its sources?
  • What is the signal-noise ratio? How can we increase it?
  • Identify the difference between a sample and a population
  • How to get started with R / R Markdown

Glossary week 2

Week 2
  • theory
  • hypothesis (plural hypotheses)
  • falsification, falsifyable
  • variable, variable name, values
  • variation (systematic and unsystematic)
  • predictor, response variable
  • dependent, independent variable
  • experiment versus observation
  • treatment variable, ‘treatment’
  • control, control group
  • continuous/categorical variables
  • binomial, nominal, ordinal variables
  • signal/noise ratio
  • factor, factor levels
  • population, sample