Biological Sampling and Interpretation

August 4, 2015

Recap from last week

Week 3

What is the scientific research process
Scientific and non-scientific hypotheses
What is a hypothesis that can be falsified?
How to get started with R / R Markdown
What are response and predictor variables, what is a ‘treatment’ and a 'control'
Binomial, nominal, ordinal, and continuous variables
Predictor vs. response variables
Systematic vs. unsystematic variation in data, what are its sources?
What is the signal-noise ratio? How can we increase it?
PLEASE: lab reports (filenames!)

Timetable first half (tentative)

Week 3

Date	Week	Lab	Topic
21.7.	1	-	Introduction, R, Rmarkdown
28.7.	2	1	Hypotheses, variables, variation
4.8.	3	2	Data, measuring variation, the normal distribution
11.8.	4	3	Quartiles, quantiles, probabilities
18.8.	5	4	Type I and type II error
25.8.	6	5	Week of midsemester test

Example questions for the test/exam

Week 3

Describe a study which uses 2 predictor variables, one continuous, one categorical, and one binomial response variable
In R, create three variables of the same length, one continuous, one categorical and one binomial
Explain why unsystematic variation in data can prevent us from detecting a signal
Explain 2 sources of systematic variation

What we will learn today

Week 3

How to organise and import your data in R (that’s often 50% of your job done!)
Measuring variability graphically
- frequency distributions, histograms
Measuring variability numerically
- the standard deviation
- the standard error
- degrees of freedom
- population vs. sample
The normal distribution
- the mean, the median and the mode
- quartiles, quantiles
- confidence intervals

How to organise your data

Week 3

Variable names
- short, consistent, unique
- NO special characters (%, $, &…) and NO spaces
- period / underscore are ok
- Examples:
  - ‘% damage in leaves’
  - ‘Percent_damage_in_leaves’
  - ‘damage’
  - ‘pcd’
Variable values
- Example: variable ‘age class’
  - c(‘1-2’, ‘3-4’, ‘5-6’, ‘>6’)
  - c(1, 2, 3, 4)
Wide vs. long format
- Use the long format, already before you import your data into R!

Wide vs. long format

Week 3

Wide vs. long format

Week 3

Creating a string or character variable

Week 3

We use the c() function and list all values in quotations so that R knows that it is string data.
As such, we can create a variable called name as follows:

name <- c("Ben", "Martin", "Andy", "Pauline", "Eva", "Carina")
name
[1] "Ben"     "Martin"  "Andy"    "Pauline" "Eva"     "Carina"

It does not matter whether you use single or double quotes
Note that if you don't type 'name' again, then the variable is not displayed
What does the '[1]' mean in front of the variable?

Creating a categorical variable (a factor in case of a predictor)

Week 3

Imagine we had 3 males and 3 females in a data set and we wanted to create a coding variable called 'gender', 1 means male, 2 female.
Enter the data:

gender <- c(1, 1, 1, 2, 2, 2)

Where can we leave a space in the R code, where can't we?

Creating a numeric variable

Week 3

Numeric variables are the easiest ones to create:

alcohol <- c(0.75, 1.2, 2.4, 0.23, 0.9, 1.36) #standard drinks/day
income <- c(58000, 38000, 28000, 63000, 90500, 17000)
alcohol
[1] 0.75 1.20 2.40 0.23 0.90 1.36
income
[1] 58000 38000 28000 63000 90500 17000

What does the '#' sign do again?
Are those continuous or categorical variables?
Are they predictor or response variables

Creating a data frame (a table) in R

Week 3

We can combine variables into a data frame:

d1 <- data.frame(name, alcohol, income, gender)
d1
     name alcohol income gender
1     Ben    0.75  58000      1
2  Martin    1.20  38000      1
3    Andy    2.40  28000      1
4 Pauline    0.23  63000      2
5     Eva    0.90  90500      2
6  Carina    1.36  17000      2

What are the dimensions of the data frame?
Why call it 'd1'? Can we call it something else…?

What's the work space/work directory in R?

Week 3

How to set up your work directory in R

Week 3

Say you have a folder in ‘My Documents’ called ‘BIOL501’ in where your data files are and where your input should go:

Either click on 'Session', then 'set work directory', then select the above folder
Or even better: include a setwd() ('set work directory') command in a chunk that you place at the beginning of your RMarkdown document:

setwd("C:/Documents/BIOL501") #adjust path to your system!

We can now access files in that folder directly. For example: myData <- read.csv("data.csv")

How to import data into R

Week 3

Graphically measuring the variation of a (continuous) variable

Week 3

The histogram! Always THE first thing to look at:

hist(x)
hist(y)

Graphically measuring the variation of a (continuous) variable

Week 3

The box plot! Very useful when you want to look at a continuous variable that is grouped by another (categorical variable):

boxplot(d1$income ~ d1$gender) #note the use of '$' and '~'!

Measuring variation with numbers

Week 3

A perfect fit (rare!):

Measuring variation with numbers

Week 3

More often it looks like this:

Calculating 'error'

Week 3

A deviation is the difference between the mean and an actual data point.
Deviations can be calculated by taking each score and subtracting the mean from it:

$deviation = x_i - \bar{x}$

NB: 'Deviation' is called 'residual' in linear models

Calculating 'error'

What do you think? How could we compute a number that is large when the variation is large, and small when the variation is small?

Use the total error?

Week 3

We could just sum up the errors between the mean and the data

score	mean	deviation
1	2.6	-1.6
2	2.6	-0.6
3	2.6	0.4
3	2.6	0.4
4	2.6	1.4
	Total	0

\[\sum(x_i - \bar{x}) = 0\]

The sum of squared errors

Week 3

The problem with summing up deviations is that they cancel out because some are positive and others negative
Therefore, we square each deviation.
If we add these squared deviations we get the sum of squared errors (SS).

\[SS = \sum(x_i - \bar{x})^2\]

Sum of squared errors

Week 3

score	mean	deviation	squared deviation
1	2.6	-1.6	2.56
2	2.6	-0.6	0.36
3	2.6	0.4	0.16
3	2.6	0.4	0.16
4	2.6	1.4	1.96
	Total	0	5.2

\[SS = \sum(x_i - \bar{x})^2 = 5.2\]

Variance

Week 3

The sum of squares is a good measure of overall variability, but it is dependent on the number of scores/values
We calculate the average variability by dividing by the number of scores ($n$) minus 1.
This value is called the variance ($s^2$).

\[variance (s^2) = \frac{SS}{n-1} = \frac{\sum(x_i-\bar{x})^2}{n-1} = \frac{5.2}{4} = 1.3\]

Standard deviation

Week 3

The variance has one problem: it is measured in units squared
This isn’t a very meaningful metric so we take the square root value
This is the standard deviation ($s$, sometimes $sd$):

\[s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}} = \sqrt{\frac{5.2}{4}} = 1.14\]

In R:

friends <- c(1, 2, 3, 3, 4)
sd(friends)
[1] 1.140175

Same mean, different standard deviation

Week 3

Standard deviation: sample vs. population

NB: mostly, the population standard deviation is called $s$, while the sample standard deviation is called $\sigma$

Degrees of freedom

Calculating metrics e.g. the mean 'cost' you degrees of freedom!

What will we have learnt in Week 3?

Week 3

Organise and import your data, create a data frame in R
Measuring variability graphically
- draw a histogram and interpret it
Computing and understanding
- the sum of squared errors
- the variance
- the standard deviation

Glossary Week 3

Week 3

comma separated values (csv files)
wide vs. long data format
value or score
string or character variable
categorical variable
coding variable
numeric variable
histogram
box plot, whisker plot
deviation
sum of squared errors (sum of squares)
variance
standard deviation
degrees of freedom