Biological Sampling and Interpretation

August 1, 2022

Timetable first half (tentative)

Week 3

Date	Week	Lab	Assignment	Topic
18.7.	1	-	-	Introduction, R, Rmarkdown
25.7.	2	1	-	Hypotheses, variables, variation
1.8.	3	2	-	Organising data, measuring variation
8.8.	4	3	-	Distributions, quartiles, quantiles, probabilities
15.8.	5	4	1	Type I and type II error (Assignment 1 due Friday)
22.8.	6	5	-	The t-test

What we will learn today

Week 3

Recap of last week, examples
How to organise and import your data in R (that’s often 50% of your job done!)
Measuring variability graphically
- frequency distributions, histograms, box plots
Measuring variability numerically
- the standard deviation
- the variance
- degrees of freedom
- population vs. sample

Recap from last week

Week 3

What is the scientific research process?
Scientific and non-scientific hypotheses
What is a hypothesis that can be falsified?
How to get started with R / R Markdown
What are response and predictor variables, what is a ‘treatment’ and a ‘control’
Binomial, nominal, ordinal, and continuous variables, factors
Systematic vs. unsystematic variation in data, what are the sources?
Experiments vs. observations
What is the signal-noise ratio? How can we increase it?
What is a sample? What is a population?

Recap using these simple examples

Week 3

Two studies are shown, are they observational or experimental studies?
What could the scientific questions behind the studies be?
What would be good scientific hypotheses?
What are the variable names and possible values in the two studies?
Which variables are predictor/response variables?
At what point during the studies could unsystematic variation be introduced?
At what point during the studies could systematic variation be introduced?
What do the properly organised data sets look like?
(For later: what statistical models could you use to analyse these data sets?)

The signal - noise ratio

Week 3

We are always trying to maximise the signal to noise ratio!

Introducing additional variables can help shift variation from ‘noise’ to ‘signal’

Example questions for the test/exam

Week 3

Describe a study which uses 2 predictor variables, one continuous, one categorical, and one binomial response variable

Example questions for the test/exam

Week 3

Describe a study which uses 2 predictor variables, one continuous, one categorical, and one binomial response variable

An experiment with fish tanks, each set of tanks at a different water temperature (e.g. 18, 20, and 22 degrees, 3 tanks each, this is the categorical predictor variable). In each tank there are 5 fish, whose size is measured at the start of the experiment (continous predictor variable). At the end of the experiment, fish are tested for a certain bacteria (presence/absence, your binomial response). It is studied whether fish size and/or water temperature influence the presence of the bacteria.

Example questions for the test/exam

Week 3

Create all variables of the above example.

Example questions for the test/exam

Week 3

Create all variables of the above example.

tank = c(18, 18, 18, 18, 18, 18,... #or:
tank = rep(c(18, 20, 22), each = 15)
size = c(3.5, 4.2, 1.9, 2.3,...)
bact = c(0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, ...)

Example questions for the test/exam

Week 3

Explain why unsystematic variation in data can prevent us from detecting a signal

Example questions for the test/exam

Week 3

Explain why unsystematic variation in data can prevent us from detecting a signal

Unsystematic variation always causes ‘noise’ and blurs the signal/noise ratio (denominator increases, ratio decreases). This makes it harder to detect the signal (the explained systematic variation).

Example questions for the test/exam

Week 3

Explain 2 sources of systematic variation in the above example

Example questions for the test/exam

Week 3

Explain 2 sources of systematic variation in the above example

The water temperature can be expected to introduce systematic variation, this is variation that we can control for (it is our treatment)

Sources of (unwanted, unexplained) systematic variation could be:

If one tank receives more light than all other tanks
If fish in some tanks are tested for bacteria by one person, and the others are tested by a different person

How to organise your data

Week 3

Variable names
- short, consistent, unique
- NO special characters (%, $, &…) and NO spaces
- period / underscore are ok
- Examples:
  - ‘% damage in leaves’
  - ‘Percent_damage_in_leaves’
  - ‘damage’
  - ‘pcd’
Variable values
- Example: variable ‘age class’
  - c(‘1-2’, ‘3-4’, ‘5-6’, ‘>6’)
  - c(1, 2, 3, 4)
Wide vs. long format
- Use the long format, already before you import your data into R!

Wide vs. long format

Week 3

Creating a string or character variable

Week 3

We use the c() function and list all values in quotations so that R knows that they are text
As such, we can create a variable called name as follows:

name = c("Ben", "Martin", "Andy", "Pauline", "Eva", "Carina")
name
[1] "Ben"     "Martin"  "Andy"    "Pauline" "Eva"     "Carina"

It does not matter whether you use single or double quotes
Note that if you don’t type ‘name’ again, then the variable is not displayed
What does the ‘[1]’ mean in front of the variable?

Creating a categorical variable (a factor in case of a predictor)

Week 3

Imagine we had 3 males and 3 females in a data set and we wanted to create a coding variable called ‘sex’, 1 means male, 2 female.
Enter the data:

sex = c(1, 1, 1, 2, 2, 2) #or, easier:
sex = rep(c(1, 2), each = 3) #or, even sleeker:
sex = rep(1:2, each = 3)
sex
[1] 1 1 1 2 2 2

Where can we insert a space in the R code? Where should we?

Creating a numeric variable

Week 3

Numeric variables are the easiest ones to create:

alcohol = c(0.75, 1.2, 2.4, 0.23, 0.9, 1.36) #standard drinks/day
income = c(58000, 38000, 28000, 63000, 90500, 17000)
alcohol
[1] 0.75 1.20 2.40 0.23 0.90 1.36
income
[1] 58000 38000 28000 63000 90500 17000

What does the ‘#’ sign do again?
Are those continuous or categorical variables?
Are they predictor or response variables?

A few tricks using seq()

Week 3

c(1, 2, 3, 4, 5)
[1] 1 2 3 4 5
1:5 #easier
[1] 1 2 3 4 5
seq(from = 1, to = 5, length.out = 5) #if you want a certain length
[1] 1 2 3 4 5
seq(from = 0, to = 1, length.out = 11)
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(from = 1, to = 3, by = .3) #if you want a certain increment
[1] 1.0 1.3 1.6 1.9 2.2 2.5 2.8
seq(1, 3, .3) #no argument names needed if you respect the order
[1] 1.0 1.3 1.6 1.9 2.2 2.5 2.8

A few tricks using rep()

Week 3

rep(1:5, each = 2) 
 [1] 1 1 2 2 3 3 4 4 5 5
rep(1:5, times = 2) #note the arguments 'each' and 'times'
 [1] 1 2 3 4 5 1 2 3 4 5
rep(1:5, 2) #the default is 'times' (if you don't specify)
 [1] 1 2 3 4 5 1 2 3 4 5

rep(c('M', 'F'), each = 5)
 [1] "M" "M" "M" "M" "M" "F" "F" "F" "F" "F"

Creating a data frame (a table) in R

Week 3

We can combine variables into a data frame:

d1 = data.frame(name, alcohol, income, sex)
d1
     name alcohol income sex
1     Ben    0.75  58000   1
2  Martin    1.20  38000   1
3    Andy    2.40  28000   1
4 Pauline    0.23  63000   2
5     Eva    0.90  90500   2
6  Carina    1.36  17000   2

What are the dimensions of the data frame?
Why call it ‘d1’? Can we call it something else…?

Working with data frames in R

Week 3

As soon as our data sets (called data frames in R) become larger (> 20 rows), we will want to be able to:

Quickly look at the content without showing every single line/column
Only show a certain part of the data frame
Aggregate a data frame according to a grouping variable
Add a variable (a column) to a data frame
…and eventually much much more, e.g. search and replace certain string patterns, apply algorithms to every row, … (this is not part of BIOL501)

Simple exploratory commands using the data set ‘iris’

Week 3

The below commands are not evaluated (you don’t see what R does). Try them on your own device!

#the iris object (data set comes with R, it's already there for you!)
summary(iris) #summary() is very generic, try it on anything!
head(iris) #shows the first few lines of your data
tail(iris) #shows the last few lines of your data
plot(iris) #try to interpret this plot!
iris$Sepal.Length #access one variable in the data frame
head(iris$Sepal.Length) #access the first few values of one variable
#in a data frame

The ‘$’ symbol is to access variables contained inside a data frame, here we extract the variable ‘Sepal Length’ from ‘iris’

Subsetting a data frame using the ‘iris’ example

Week 3

Extracting part of a data frame is called ‘subsetting’ and can be done in many ways. Here is one using [] (row selection before, column selection after the comma):

iris[4, 2] #show the fourth value (row) of the second column
iris[4, ] #show the fourth row of all columns
iris[, 'Species'] #show all rows for column 'Species'
iris[c(3, 16), c('Species', 'Petal.Length')]
#all rows for column 'Species'
iris[iris$Species == 'virginica', ] #all rows of species 'virginica'
iris[iris$Sepal.Length > 6, ] #all rows where Sepal.Length > 6
iris[iris$Sepal.Length > 6 & iris$Species == 'virginica', ]
#all rows where Sepal.Length > 6 AND species is 'virginica'

Aggregating a data frame

Week 3

You can do this in many ways! Here is one way:

#calculate the mean petal length per species:
tapply(iris$Petal.Length, iris$Species, mean)

    setosa versicolor  virginica 
     1.462      4.260      5.552

#extract the maximum value of petal length per species:
tapply(iris$Petal.Length, iris$Species, max)

    setosa versicolor  virginica 
       1.9        5.1        6.9

Adding variables to a data frame

Week 3

You can do this in many ways! Here is one way:

#adding a variable
iris$newVariable = 1
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species newVariable
1          5.1         3.5          1.4         0.2  setosa           1
2          4.9         3.0          1.4         0.2  setosa           1
3          4.7         3.2          1.3         0.2  setosa           1
4          4.6         3.1          1.5         0.2  setosa           1
5          5.0         3.6          1.4         0.2  setosa           1
6          5.4         3.9          1.7         0.4  setosa           1

What’s the work space/work directory in R?

How to set up your work directory in R

Week 3

Say you have a folder in ‘My Documents’ called ‘BIOL501’ where your data files are and where your input should go:

Either click on ‘Session’, then ‘set work directory’, then select the above folder
Or even better: include a setwd() (‘set work directory’) command in a chunk that you place at the beginning of your RMarkdown document:

setwd("C:/Documents/BIOL501") #adjust the path to your system!

We can now access files in that folder directly. For example: myData = read.csv(“data.csv”)

How to import data into R

Week 3

Graphically measuring the variation of a (continuous) variable

Week 3

The histogram! Always THE first thing to look at:

hist(x)
hist(y)

What’s a histogram, what is it used for?

Week 3

A histogram is used to show the distribution of a single (mostly continuous) variable
The x-axis represents the ‘bins’: a continuous variable is ‘categorised’ into a number of bins (about 10-20)
The y-axis is the frequency, i.e. how many values fall in a given bin.

Graphically measuring the variation of a (continuous) variable

Week 3

The box plot! Very useful when you want to look at a continuous variable that is grouped by another (categorical variable). What values does the box show (look it up)?

boxplot(d1$income ~ d1$sex) #note the use of '$' and '~'!

Measuring variation with numbers

Week 3

A perfect fit (rare!):

Measuring variation with numbers

Week 3

More often it looks like this:

Calculating ‘error’

Week 3

A deviation is the difference between the mean and an actual data point.
Deviations can be calculated by taking each score and subtracting the mean from it:

\[deviation = x_i - \bar{x}\]

NB: ‘Deviation’ is called ‘residual’ in linear models

Calculating ‘error’

Week 3

What do you think? How could we compute a number that is large when the variation is large, and small when the variation is small?

Use the total error?

Week 3

We could just sum up the errors between the mean and the data

score	mean	deviation
1	2.6	-1.6
2	2.6	-0.6
3	2.6	0.4
3	2.6	0.4
4	2.6	1.4
	Total	0

\[\sum(x_i - \bar{x}) = 0\]

The sum of squared errors

Week 3

The problem with summing up deviations is that they cancel out because some are positive and others negative
Therefore, we square each deviation.
If we add these squared deviations we get the sum of squared errors (SS).

\[SS = \sum(x_i - \bar{x})^2\]

Sum of squared errors

Week 3

score	mean	deviation	squared deviation
1	2.6	-1.6	2.56
2	2.6	-0.6	0.36
3	2.6	0.4	0.16
3	2.6	0.4	0.16
4	2.6	1.4	1.96
	Total	0	5.2

\[SS = \sum(x_i - \bar{x})^2 = 5.2\]

Variance

Week 3

The sum of squares is a good measure of overall variability, but it is dependent on the number of scores/values
We calculate the average variability by dividing by the number of scores ($n$) minus 1.
This value is called the variance ($s^2$).

\[variance = s^2 = \frac{SS}{n-1} = \frac{\sum(x_i-\bar{x})^2}{n-1} = \frac{5.2}{4} = 1.3\]

Standard deviation

Week 3

The variance has one problem: it is measured in units squared
This isn’t a very meaningful metric so we take the square root value
This is the standard deviation ($s$, sometimes $sd$):

\[s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}} = \sqrt{\frac{5.2}{4}} = 1.14\]

In R:

friends = c(1, 2, 3, 3, 4)
sd(friends)
[1] 1.140175

Same mean, different standard deviation

Week 3

Sample standard deviation: why divide by n-1?

Week 3

NB: mostly, the population standard deviation is called $s$, while the sample standard deviation is called $\sigma$

Degrees of freedom

Week 3

Calculating metrics (e.g. the mean) ‘costs’ you degrees of freedom!

What will we have learnt in Week 3?

Week 3

Understanding an experimental/observational context, anticipate sources of variation
Organise and import your data, create a data frame in R
Measuring variability graphically
- draw a histogram and interpret it
Computing and understanding
- the sum of squared errors
- the variance
- the standard deviation
Simple data aggregation

Glossary Week 3

Week 3

comma separated values (csv files)
wide vs. long data format
value or score
string or character variable
categorical variable
coding variable
numeric variable
histogram
box plot, whisker plot
deviation
sum of squared errors (sum of squares)
variance
standard deviation
degrees of freedom