July 25, 2021

Recap from last week

Week 2
  • You don’t need to hand in your lab reports, but completing them is essential !
  • Your presence/participation is the most important !
  • If you want to change your lab stream, see the office (WS ground floor)
  • Lab reports will turn into your key document to study for this paper !
  • Be creative, add to the lab reports !
  • Think about your cheat sheet and other ways to be creative !

What we will learn today

Week 2
  • Some basics: the research process, scientific theory, hypotheses
  • Experiments versus observations
  • Models: What are they? What can they do?
  • Variables: what kinds are there? How to identify them, how to create them in R
  • Variability in data: where does it come from?
  • Population versus sample

Types of data analysis

Week 2
  • Quantitative Methods
    • Testing hypotheses using numbers
  • Qualitative Methods
    • Reporting findings (new methods, chemical pathways)
    • Describing new species
    • Testing hypotheses using language, interviews, etc.
  • Often, using qualitative methods still involves a quantitative aspect
    • for example, a new insect species is qualitatively described, but some statistics are used on how abdomen length significantly differs from another species.

The process of doing quantitative science

Week 2

The process of doing quantitative science

Week 2

Generating and Testing Theories

Week 2
  • Theory

A hypothesised general principle or set of principles that explains known findings about a topic and from which new hypotheses can be generated. E.g.: ‘Biodiversity decreases towards the poles.’

  • Hypothesis

A prediction from a theory. E.g.: ‘In any one animal phylum, there are less species found between 30 and 60 degrees north/south than between 0 and 30 degrees north/south.’

Note that the terms ‘null hypothesis’ (H\(_0\)) and ‘alternative hypothesis’ (H\(_A\)) are something else (see later)

Testable and non-testable hypotheses

Week 2
  • Testable (scientifically usable) hypotheses
    • There are no rats on Rangitoto Island
    • The Beatles sold more records than any other band
    • More people like this product with a sour taste rather than a bitter taste
    • Patients taking this medication live longer than those who don’t
  • Non-testable (non-scientific) hypotheses
    • There are 8 rats on Rangitoto Island
    • There are rats on Rangitoto Island
    • The Beatles were the best band ever
    • Most people like this product with a sour taste
    • This medication has helped so many people, so surely it must be good


  • Falsification: a good scientific hypothesis is quantifiable and falsifiable!

Observational versus experimental studies

Week 2

Examples for observational studies:

  • Measuring the number of species along an altitudinal gradient
  • Recording the circumference of the heads of newborns (and relating it to the size of the mother)
  • Recording student attendance for this paper

Examples for experimental studies:

  • Measuring the number of species in response to fertiliser application
  • Recording the weight of 10 people who have undergone a diet versus 10 who haven’t
  • The scores of students of BIOL501 who have received different teaching resources

Observational versus experimental studies

Week 2

Observational studies

  • do not involve (deliberate) manipulation
  • are often comparative or along a gradient
  • make use of naturally present ‘treatments’

Experimental studies

  • involve an (intended) manipulation (or ‘treatment’) of some sort
  • usually involve a ‘control group’ (a group of individuals not receiving the treatment)

Statistical models

Week 2
  • A statistical model is a way of simplifying an observed process or mechanism

\(outcome_i = (model) + error_i\)

\(person_1 = 170 cm + 3 cm\)

\(person_{10} = 170 cm -1 cm\)

What can we use (statistical) models for?

Week 2

Models can be used to test hypotheses, but they can also summarise information, or predict values where data are missing (predictive model).

For example:

Week 2

  • Are x and y correlated? (hypothesis testing)
  • What is the mean of x, the mean of y? (summary information)
  • What is y when x = 5? (predictive model)

The simplest statistical model

Week 2

The mean!

  • The mean summarises data
  • The mean can be used to predict future outcomes of a variable (E.g. if 200 people died on average every year on the road in NZ over the past 3 years, we can use this value to predict the road toll in the following year)
  • The mean is a hypothetical value (i.e. it doesn’t have to be a value that actually exists in the data set, e.g. the mean of 192, 188, and 220 is 200).
  • The mean can be used to test whether it is different from a certain value (e.g. is the road toll in NZ in 2012-2015 different from the one 1962-1965?)

As such, the mean is simple statistical model.

The mean

Week 2
  • The mean is the sum of all scores divided by the number of scores.
  • The mean is also the value from which the (squared) scores deviate least (it has the least error).

\(mean(X) = \bar{X} = \frac{\sum\limits_{i=1}^n x_i}{n}\)

We all know the mean!

The mean: simple example

Week 2
  • Collect some data

\(1, 3, 4, 3, 2\)

  • Add them up:

\(\sum\limits_{i=1}^n x_i = 1 + 3 + 4 + 3 + 2 = 13\)

  • Divide by the number of scores, \(n\):

\(\bar{X} = \frac{\sum\limits_{i=1}^n x_i}{n} = \frac{13}{5} = 2.6\)

The mean is a statistical model !

\(outcome_i = (model) + error_i\)

\(outcome_{person1} = (model) + error_{person1}\)

\(1 = 2.6 + (-1.6)\)

Quick summary

Week 2
  • What is a (statistical) model?
    • It is a summary/simplification of what is going on in reality
    • It can be used for a number of purposes: to test hypotheses, to predict, or to summarise
  • How is the mean a statistical model?
    • It summarises information contained in many data points
    • It can be used to make a prediction
    • It can be used to test a hypothesis
  • What does a model need?
    • Variables !

Variables

Week 2
  • What is a variable?
    • A variable has a name, and values, e.g.:
Variable name Possible values Units
Smoker Yes / No or 0 / 1 NA
Time 4, 65, 9.4 seconds
Hair colour brown, blond, black NA
Concentration 4.6, 1.9, 4.0 mg L\(^{-1}\)

Variables - easy?

Week 2

How many variables? What are their names? What are their values?

Variables - easy?

Week 2

We have two variables in this data set:

Variable name Values Units
Site 1, 2, 3 NA
Soil moisture 25.3, 16.4, 20.5, 17.2, … Vol %

Variables - easy?

Week 2

Once properly organised, the data set should look like this:

Soil moisture Site
25.3 1
16.4 1
20.5 1
17.2 2

All variable names are always at the top, the values listed underneath, units not needed here, they can be noted elsewhere!

Kinds of variables

Week 2
  • Categorical (entities are divided into distinct categories):
    • Binary variable: There are only two categories, e.g. dead or alive, present or absent
    • Nominal variable: There are more than two categories, e.g. whether someone is an omnivore, vegetarian, vegan, or fruitarian
    • Ordinal variable: The same as a nominal variable but the categories have a logical order. e.g. whether people got a fail, a pass, a merit or a distinction in their exam
  • Continuous (entities get a distinct score): e.g. human body height

Kinds of variables

Week 2
  • ‘Predictor variable’, or ‘independent variable’
    • The proposed cause
    • A manipulated variable (in experiments)
    • The concentration of Coca-Cola in the example on the next slide
    • usually plotted on the x-axis
  • ‘Response variable’, or ‘dependent variable’
    • The proposed effect (‘outcome’)
    • Measured, not manipulated (in experiments)
    • Bacteria count in the example on the next slide
    • usually plotted on the y-axis

Example: Coca-Cola kills bacteria…

Week 2

Hypothesis: Coca-Cola kills bacteria

Kinds of variables

Week 2
Type of variable Categorical (Binomial) Categorical (Nominal) Categorical (Ordinal) Continuous
Predictor smoker, sex, handedness state of mind, gender age class, rank long jump results, body weight
Response survival, handedness employment type, hair colour income bracket, clutch size cholesterol level, body weight


A variable has got a name, and values, examples:

  • Variable name: handedness, values: left, right
  • Variable name: body weight, values: 63.4, 88.2, …
  • A categorical predictor variable is often called a ‘factor’, its values ‘factor levels’

Variables: THE most important in a nutshell

Week 2

ANY data analysis procedure starts with these steps, not following this protocol causes 80% of the problems encoutered!

  1. Identify what the variables are (and what kind, e.g. continuous etc.)

  2. Put variable names in the first row of your data table

  3. Fill in the values for the variables

  4. Identify predictor and respone variable(s)

  5. Plot the data

    Example on document camera!

Variability in data

Week 2
  • Almost always occurs
    • e.g. if you measure your height 10 times
    • e.g. if you measure the height of 10 different people
    • e.g. if you measure the height of 10 men and then 10 woman
  • The reasons for variability to occur however differ!

Types of variation and their origin

Week 2
  • Systematic Variation
    • Variation created by a specific experimental manipulation (e.g. administering a drug)
    • Variation introduced by unknown factors (both while applying a treatment and while measuring a variable)
    • The ‘direction’ of the variation tends to be the same, e.g. in your experiment, if you happen to have more older subjects, they will tend to introduce the same bias, and hence the variation is systematic.
  • Unsystematic Variation
    • Variation introduced by unknown factors (both while applying a treatment and while measuring a variable)
    • The ‘direction’ of the variation tends to be random, e.g. in your experiment, if you happen to get a more heterogenous sample, subjects will inherently differ more, but not in a systematic way

Systematic and unsystematic variation

Week 2


Signal versus noise: Systematic variation adds to the signal (but only if we know its origin!), unsystematic variation adds to the noise!

Systematic and unsystematic variation: the mouse example

Week 2


            - …recall what is a ‘control group’!

By the way: what was that again with response and predictor variables…?

Week 2

(Unwanted) systematic and unsystematic variation in the mouse experiment example

Week 2
Reason for variation → Type of variation ↓ Application of manipulation Measurement error Natural
Systematic Syringe has wrong volume Experimenter is using a faulty tape measure A batch of mice has higher growth rates because they come from different origins
Unsystematic Solution to be injected is not homogenous Experimenter is tired, makes random mistakes Natural variation in growth rates of mice, even though they originate from the same mother

Systematic and unsystematic variation

Week 2
  • Disentangling the two is the key task of statistics, e.g. by
    • reducing unsystematic variation
    • explaining systematic variation
    • taking several measurements (replication!)

We thus maximise the signal to noise ratio, i.e. we maximise our explanatory power

  • Mouse experiment example:
    • We are trying to minimise the ‘noise’ by
      • applying the injections properly
      • measuring the heart rates accurately
      • Accounting for additional factors such as sex

Population versus sample

Week 2

Consider these three experiments:

  1. You are asked to determine the germination rate of 1 kg of grass seeds. What is your sample? What is your population?

  2. You are asked to determine the germination rate of grass seeds ‘supergrass’ sold at the warehouse. What could be your sample? What is the population?

  3. You are asked to test the efficiency of a lung cancer treatment. What could be your sample? What is the population?

Population versus sample

Week 2
  • In example (1), you are simply referring to the 1 kg of grass seeds, this is your sample, but it is also your population that you are making your inference on.

  • In (2), your population are all grass seeds sold all over New Zealand during this season. Your sample could be one sachet of seeds per warehouse branch.

  • In (3), your population are possibly all present and future lung cancer patients globally. Your population is fictitious. Your sample may be 20 lung cancer patients in Auckland.

When selecting your sample, think of the population you would like to make an inference on!

How to create variables in R

Week 2
x = c('red', 'blue', 'yellow', 'yellow', 'red', 'blue', 'blue') 
x #nominal variable (categorical)
[1] "red"    "blue"   "yellow" "yellow" "red"    "blue"   "blue"  
offspring = c(4, 5, 2, 0, 1, 1, 3)
offspring #ordinal variable (categorical)
[1] 4 5 2 0 1 1 3
size = c(3.87, 5.1, 9.63, 3.11, 4.28, 5.34, 2.19)
size #continuous variable
[1] 3.87 5.10 9.63 3.11 4.28 5.34 2.19

And a few tricks to create variables in R

Week 2
x = rep(c('red', 'blue'), times = 3)
x
[1] "red"  "blue" "red"  "blue" "red"  "blue"
y = rep(c(2, 4, 6), each = 3)
y
[1] 2 2 2 4 4 4 6 6 6

Use the funcion rep() with the arguments times or each to specify how you would like to repeat the values.

Things you should do this (and next) week

Week 2
  • If you have a lap top: Install R, then install RStudio
  • Start ‘playing’ with R: In the console (the ‘boxing ring’) and using Rmarkdown
  • Do your lab reports, and ADD to them, make them your course manual
  • Read up on the theory. If slides are not enough, get other resources, e.g. free online resources. Use the following slides (concepts, keywords)
  • Ask your questions during labs/tutorials

What will we have learnt in week 2?

Week 2
  • What is the scientific research process
  • Scientific and non-scientific hypotheses
  • What is a hypothesis that can be falsified?
  • What is an experiment? What is an observational study?
  • What are response and predictor variables, what is a ‘treatment’ and a ‘control’
  • To characterise a variable, its name and values
  • Identify whether it is categorical (binomial, nominal, ordinal?) or continuous
  • Identify whether it is a predictor or a response variable
  • Identify systematic vs. unsystematic variation in data, what are its sources?
  • What is the signal-noise ratio? How can we increase it?
  • Identify the difference between a sample and a population
  • How to get started with R / R Markdown

Glossary week 2

Week 2
  • theory
  • hypothesis (plural hypotheses)
  • falsification, falsifyable
  • experiment versus observation
  • model, observation/score, error
  • variable, variable name, values
  • variation (systematic and unsystematic)
  • predictor, response variable

Glossary week 2

Week 2
  • dependent, independent variable
  • treatment variable, ‘treatment’
  • control, control group
  • continuous/categorical variables
  • binomial, nominal, ordinal variables
  • signal/noise ratio
  • factor, factor levels
  • population, sample