August 1, 2022

Timetable first half (tentative)

Week 3
Date Week Lab Assignment Topic
18.7. 1 - - Introduction, R, Rmarkdown
25.7. 2 1 - Hypotheses, variables, variation
1.8. 3 2 - Organising data, measuring variation
8.8. 4 3 - Distributions, quartiles, quantiles, probabilities
15.8. 5 4 1 Type I and type II error (Assignment 1 due Friday)
22.8. 6 5 - The t-test

What we will learn today

Week 3
  • Recap of last week, examples
  • How to organise and import your data in R (that’s often 50% of your job done!)
  • Measuring variability graphically
    • frequency distributions, histograms, box plots
  • Measuring variability numerically
    • the standard deviation
    • the variance
    • degrees of freedom
    • population vs. sample

Recap from last week

Week 3
  • What is the scientific research process?
  • Scientific and non-scientific hypotheses
  • What is a hypothesis that can be falsified?
  • How to get started with R / R Markdown
  • What are response and predictor variables, what is a ‘treatment’ and a ‘control’
  • Binomial, nominal, ordinal, and continuous variables, factors
  • Systematic vs. unsystematic variation in data, what are the sources?
  • Experiments vs. observations
  • What is the signal-noise ratio? How can we increase it?
  • What is a sample? What is a population?

Recap using these simple examples

Week 3
  • Two studies are shown, are they observational or experimental studies?
  • What could the scientific questions behind the studies be?
  • What would be good scientific hypotheses?
  • What are the variable names and possible values in the two studies?
  • Which variables are predictor/response variables?
  • At what point during the studies could unsystematic variation be introduced?
  • At what point during the studies could systematic variation be introduced?
  • What do the properly organised data sets look like?
  • (For later: what statistical models could you use to analyse these data sets?)

The signal - noise ratio

Week 3

We are always trying to maximise the signal to noise ratio!

Introducing additional variables can help shift variation from ‘noise’ to ‘signal’

Example questions for the test/exam

Week 3

Describe a study which uses 2 predictor variables, one continuous, one categorical, and one binomial response variable

Example questions for the test/exam

Week 3

Describe a study which uses 2 predictor variables, one continuous, one categorical, and one binomial response variable

An experiment with fish tanks, each set of tanks at a different water temperature (e.g. 18, 20, and 22 degrees, 3 tanks each, this is the categorical predictor variable). In each tank there are 5 fish, whose size is measured at the start of the experiment (continous predictor variable). At the end of the experiment, fish are tested for a certain bacteria (presence/absence, your binomial response). It is studied whether fish size and/or water temperature influence the presence of the bacteria.

Example questions for the test/exam

Week 3

Create all variables of the above example.

Example questions for the test/exam

Week 3

Create all variables of the above example.

tank = c(18, 18, 18, 18, 18, 18,... #or:
tank = rep(c(18, 20, 22), each = 15)
size = c(3.5, 4.2, 1.9, 2.3,...)
bact = c(0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, ...)

Example questions for the test/exam

Week 3

Explain why unsystematic variation in data can prevent us from detecting a signal

Example questions for the test/exam

Week 3

Explain why unsystematic variation in data can prevent us from detecting a signal

Unsystematic variation always causes ‘noise’ and blurs the signal/noise ratio (denominator increases, ratio decreases). This makes it harder to detect the signal (the explained systematic variation).

Example questions for the test/exam

Week 3

Explain 2 sources of systematic variation in the above example

Example questions for the test/exam

Week 3

Explain 2 sources of systematic variation in the above example

  • The water temperature can be expected to introduce systematic variation, this is variation that we can control for (it is our treatment)

Sources of (unwanted, unexplained) systematic variation could be:

  • If one tank receives more light than all other tanks
  • If fish in some tanks are tested for bacteria by one person, and the others are tested by a different person

How to organise your data

Week 3
  • Variable names
    • short, consistent, unique
    • NO special characters (%, $, &…) and NO spaces
    • period / underscore are ok
    • Examples:
      • ‘% damage in leaves’
      • ‘Percent_damage_in_leaves’
      • ‘damage’
      • ‘pcd’
  • Variable values
    • Example: variable ‘age class’
      • c(‘1-2’, ‘3-4’, ‘5-6’, ‘>6’)
      • c(1, 2, 3, 4)
  • Wide vs. long format
    • Use the long format, already before you import your data into R!

Wide vs. long format

Week 3

Creating a string or character variable

Week 3
  • We use the c() function and list all values in quotations so that R knows that they are text
  • As such, we can create a variable called name as follows:
name = c("Ben", "Martin", "Andy", "Pauline", "Eva", "Carina")
name
[1] "Ben"     "Martin"  "Andy"    "Pauline" "Eva"     "Carina" 
  • It does not matter whether you use single or double quotes
  • Note that if you don’t type ‘name’ again, then the variable is not displayed
  • What does the ‘[1]’ mean in front of the variable?

Creating a categorical variable (a factor in case of a predictor)

Week 3
  • Imagine we had 3 males and 3 females in a data set and we wanted to create a coding variable called ‘sex’, 1 means male, 2 female.
  • Enter the data:
sex = c(1, 1, 1, 2, 2, 2) #or, easier:
sex = rep(c(1, 2), each = 3) #or, even sleeker:
sex = rep(1:2, each = 3)
sex
[1] 1 1 1 2 2 2
  • Where can we insert a space in the R code? Where should we?

Creating a numeric variable

Week 3
  • Numeric variables are the easiest ones to create:
alcohol = c(0.75, 1.2, 2.4, 0.23, 0.9, 1.36) #standard drinks/day
income = c(58000, 38000, 28000, 63000, 90500, 17000)
alcohol
[1] 0.75 1.20 2.40 0.23 0.90 1.36
income
[1] 58000 38000 28000 63000 90500 17000
  • What does the ‘#’ sign do again?
  • Are those continuous or categorical variables?
  • Are they predictor or response variables?

A few tricks using seq()

Week 3
c(1, 2, 3, 4, 5)
[1] 1 2 3 4 5
1:5 #easier
[1] 1 2 3 4 5
seq(from = 1, to = 5, length.out = 5) #if you want a certain length
[1] 1 2 3 4 5
seq(from = 0, to = 1, length.out = 11)
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(from = 1, to = 3, by = .3) #if you want a certain increment
[1] 1.0 1.3 1.6 1.9 2.2 2.5 2.8
seq(1, 3, .3) #no argument names needed if you respect the order
[1] 1.0 1.3 1.6 1.9 2.2 2.5 2.8

A few tricks using rep()

Week 3
rep(1:5, each = 2) 
 [1] 1 1 2 2 3 3 4 4 5 5
rep(1:5, times = 2) #note the arguments 'each' and 'times'
 [1] 1 2 3 4 5 1 2 3 4 5
rep(1:5, 2) #the default is 'times' (if you don't specify)
 [1] 1 2 3 4 5 1 2 3 4 5

rep(c('M', 'F'), each = 5)
 [1] "M" "M" "M" "M" "M" "F" "F" "F" "F" "F"

Creating a data frame (a table) in R

Week 3
  • We can combine variables into a data frame:
d1 = data.frame(name, alcohol, income, sex)
d1
     name alcohol income sex
1     Ben    0.75  58000   1
2  Martin    1.20  38000   1
3    Andy    2.40  28000   1
4 Pauline    0.23  63000   2
5     Eva    0.90  90500   2
6  Carina    1.36  17000   2
  • What are the dimensions of the data frame?
  • Why call it ‘d1’? Can we call it something else…?

Working with data frames in R

Week 3

As soon as our data sets (called data frames in R) become larger (> 20 rows), we will want to be able to:

  • Quickly look at the content without showing every single line/column
  • Only show a certain part of the data frame
  • Aggregate a data frame according to a grouping variable
  • Add a variable (a column) to a data frame
  • …and eventually much much more, e.g. search and replace certain string patterns, apply algorithms to every row, … (this is not part of BIOL501)

Simple exploratory commands using the data set ‘iris’

Week 3

The below commands are not evaluated (you don’t see what R does). Try them on your own device!

#the iris object (data set comes with R, it's already there for you!)
summary(iris) #summary() is very generic, try it on anything!
head(iris) #shows the first few lines of your data
tail(iris) #shows the last few lines of your data
plot(iris) #try to interpret this plot!
iris$Sepal.Length #access one variable in the data frame
head(iris$Sepal.Length) #access the first few values of one variable
#in a data frame

The ‘$’ symbol is to access variables contained inside a data frame, here we extract the variable ‘Sepal Length’ from ‘iris’

Subsetting a data frame using the ‘iris’ example

Week 3

Extracting part of a data frame is called ‘subsetting’ and can be done in many ways. Here is one using [] (row selection before, column selection after the comma):

iris[4, 2] #show the fourth value (row) of the second column
iris[4, ] #show the fourth row of all columns
iris[, 'Species'] #show all rows for column 'Species'
iris[c(3, 16), c('Species', 'Petal.Length')]
#all rows for column 'Species'
iris[iris$Species == 'virginica', ] #all rows of species 'virginica'
iris[iris$Sepal.Length > 6, ] #all rows where Sepal.Length > 6
iris[iris$Sepal.Length > 6 & iris$Species == 'virginica', ]
#all rows where Sepal.Length > 6 AND species is 'virginica'

Aggregating a data frame

Week 3

You can do this in many ways! Here is one way:

#calculate the mean petal length per species:
tapply(iris$Petal.Length, iris$Species, mean) 
    setosa versicolor  virginica 
     1.462      4.260      5.552 
#extract the maximum value of petal length per species:
tapply(iris$Petal.Length, iris$Species, max) 
    setosa versicolor  virginica 
       1.9        5.1        6.9 

Adding variables to a data frame

Week 3

You can do this in many ways! Here is one way:

#adding a variable
iris$newVariable = 1
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species newVariable
1          5.1         3.5          1.4         0.2  setosa           1
2          4.9         3.0          1.4         0.2  setosa           1
3          4.7         3.2          1.3         0.2  setosa           1
4          4.6         3.1          1.5         0.2  setosa           1
5          5.0         3.6          1.4         0.2  setosa           1
6          5.4         3.9          1.7         0.4  setosa           1

What’s the work space/work directory in R?

How to set up your work directory in R

Week 3

Say you have a folder in ‘My Documents’ called ‘BIOL501’ where your data files are and where your input should go:

  • Either click on ‘Session’, then ‘set work directory’, then select the above folder

  • Or even better: include a setwd() (‘set work directory’) command in a chunk that you place at the beginning of your RMarkdown document:

setwd("C:/Documents/BIOL501") #adjust the path to your system!

We can now access files in that folder directly. For example: myData = read.csv(“data.csv”)

How to import data into R

Week 3

Graphically measuring the variation of a (continuous) variable

Week 3
  • The histogram! Always THE first thing to look at:
hist(x)
hist(y)

What’s a histogram, what is it used for?

Week 3
  • A histogram is used to show the distribution of a single (mostly continuous) variable
  • The x-axis represents the ‘bins’: a continuous variable is ‘categorised’ into a number of bins (about 10-20)
  • The y-axis is the frequency, i.e. how many values fall in a given bin.

Graphically measuring the variation of a (continuous) variable

Week 3
  • The box plot! Very useful when you want to look at a continuous variable that is grouped by another (categorical variable). What values does the box show (look it up)?
boxplot(d1$income ~ d1$sex) #note the use of '$' and '~'!

Measuring variation with numbers

Week 3
  • A perfect fit (rare!):

Measuring variation with numbers

Week 3
  • More often it looks like this:

Calculating ‘error’

Week 3
  • A deviation is the difference between the mean and an actual data point.
  • Deviations can be calculated by taking each score and subtracting the mean from it:

\[deviation = x_i - \bar{x}\]

  • NB: ‘Deviation’ is called ‘residual’ in linear models

Calculating ‘error’

Week 3

What do you think? How could we compute a number that is large when the variation is large, and small when the variation is small?

Use the total error?

Week 3
  • We could just sum up the errors between the mean and the data
score mean deviation
1 2.6 -1.6
2 2.6 -0.6
3 2.6 0.4
3 2.6 0.4
4 2.6 1.4
Total 0

\[\sum(x_i - \bar{x}) = 0\]

The sum of squared errors

Week 3
  • The problem with summing up deviations is that they cancel out because some are positive and others negative
  • Therefore, we square each deviation.
  • If we add these squared deviations we get the sum of squared errors (SS).

\[SS = \sum(x_i - \bar{x})^2\]

Sum of squared errors

Week 3
score mean deviation squared deviation
1 2.6 -1.6 2.56
2 2.6 -0.6 0.36
3 2.6 0.4 0.16
3 2.6 0.4 0.16
4 2.6 1.4 1.96
Total 0 5.2

\[SS = \sum(x_i - \bar{x})^2 = 5.2\]

Variance

Week 3
  • The sum of squares is a good measure of overall variability, but it is dependent on the number of scores/values
  • We calculate the average variability by dividing by the number of scores (\(n\)) minus 1.
  • This value is called the variance (\(s^2\)).

\[variance = s^2 = \frac{SS}{n-1} = \frac{\sum(x_i-\bar{x})^2}{n-1} = \frac{5.2}{4} = 1.3\]

Standard deviation

Week 3
  • The variance has one problem: it is measured in units squared
  • This isn’t a very meaningful metric so we take the square root value
  • This is the standard deviation (\(s\), sometimes \(sd\)):

\[s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}} = \sqrt{\frac{5.2}{4}} = 1.14\]

In R:

friends = c(1, 2, 3, 3, 4)
sd(friends)
[1] 1.140175

Same mean, different standard deviation

Week 3

Sample standard deviation: why divide by n-1?

Week 3

NB: mostly, the population standard deviation is called \(s\), while the sample standard deviation is called \(\sigma\)

Degrees of freedom

Week 3

Calculating metrics (e.g. the mean) ‘costs’ you degrees of freedom!

What will we have learnt in Week 3?

Week 3
  • Understanding an experimental/observational context, anticipate sources of variation
  • Organise and import your data, create a data frame in R
  • Measuring variability graphically
    • draw a histogram and interpret it
  • Computing and understanding
    • the sum of squared errors
    • the variance
    • the standard deviation
  • Simple data aggregation

Glossary Week 3

Week 3
  • comma separated values (csv files)
  • wide vs. long data format
  • value or score
  • string or character variable
  • categorical variable
  • coding variable
  • numeric variable
  • histogram
  • box plot, whisker plot
  • deviation
  • sum of squared errors (sum of squares)
  • variance
  • standard deviation
  • degrees of freedom