| Date | Week | Lab | Assignment | Topic |
|---|---|---|---|---|
| 18.7. | 1 | - | - | Introduction, R, Rmarkdown |
| 25.7. | 2 | 1 | - | Hypotheses, variables, variation |
| 1.8. | 3 | 2 | - | Organising data, measuring variation |
| 8.8. | 4 | 3 | - | Distributions, quartiles, quantiles, probabilities |
| 15.8. | 5 | 4 | 1 | Type I and type II error (Assignment 1 due Friday) |
| 22.8. | 6 | 5 | - | The t-test |
What we will learn today
- Recap of last week, examples
- How to organise and import your data in R (that’s often 50% of your job done!)
- Measuring variability graphically
- frequency distributions, histograms, box plots
- Measuring variability numerically
- the standard deviation
- the variance
- degrees of freedom
- population vs. sample
Recap from last week
- What is the scientific research process?
- Scientific and non-scientific hypotheses
- What is a hypothesis that can be falsified?
- How to get started with R / R Markdown
- What are response and predictor variables, what is a ‘treatment’ and a ‘control’
- Binomial, nominal, ordinal, and continuous variables, factors
- Systematic vs. unsystematic variation in data, what are the sources?
- Experiments vs. observations
- What is the signal-noise ratio? How can we increase it?
- What is a sample? What is a population?
Recap using these simple examples
- Two studies are shown, are they observational or experimental studies?
- What could the scientific questions behind the studies be?
- What would be good scientific hypotheses?
- What are the variable names and possible values in the two studies?
- Which variables are predictor/response variables?
- At what point during the studies could unsystematic variation be introduced?
- At what point during the studies could systematic variation be introduced?
- What do the properly organised data sets look like?
- (For later: what statistical models could you use to analyse these data sets?)
The signal - noise ratio
We are always trying to maximise the signal to noise ratio!
Introducing additional variables can help shift variation from ‘noise’ to ‘signal’
Example questions for the test/exam
Describe a study which uses 2 predictor variables, one continuous, one categorical, and one binomial response variable
Example questions for the test/exam
Describe a study which uses 2 predictor variables, one continuous, one categorical, and one binomial response variable
An experiment with fish tanks, each set of tanks at a different water temperature (e.g. 18, 20, and 22 degrees, 3 tanks each, this is the categorical predictor variable). In each tank there are 5 fish, whose size is measured at the start of the experiment (continous predictor variable). At the end of the experiment, fish are tested for a certain bacteria (presence/absence, your binomial response). It is studied whether fish size and/or water temperature influence the presence of the bacteria.
Example questions for the test/exam
Create all variables of the above example.
Example questions for the test/exam
Create all variables of the above example.
tank = c(18, 18, 18, 18, 18, 18,... #or: tank = rep(c(18, 20, 22), each = 15) size = c(3.5, 4.2, 1.9, 2.3,...) bact = c(0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, ...)
Example questions for the test/exam
Explain why unsystematic variation in data can prevent us from detecting a signal
Example questions for the test/exam
Explain why unsystematic variation in data can prevent us from detecting a signal
Unsystematic variation always causes ‘noise’ and blurs the signal/noise ratio (denominator increases, ratio decreases). This makes it harder to detect the signal (the explained systematic variation).
Example questions for the test/exam
Explain 2 sources of systematic variation in the above example
Example questions for the test/exam
Explain 2 sources of systematic variation in the above example
- The water temperature can be expected to introduce systematic variation, this is variation that we can control for (it is our treatment)
Sources of (unwanted, unexplained) systematic variation could be:
- If one tank receives more light than all other tanks
- If fish in some tanks are tested for bacteria by one person, and the others are tested by a different person
How to organise your data
- Variable names
- short, consistent, unique
- NO special characters (%, $, &…) and NO spaces
- period / underscore are ok
- Examples:
- ‘% damage in leaves’
- ‘Percent_damage_in_leaves’
- ‘damage’
- ‘pcd’
- Variable values
- Example: variable ‘age class’
- c(‘1-2’, ‘3-4’, ‘5-6’, ‘>6’)
- c(1, 2, 3, 4)
- Example: variable ‘age class’
- Wide vs. long format
- Use the long format, already before you import your data into R!
Wide vs. long format
Creating a string or character variable
- We use the
c()function and list all values in quotations so that R knows that they are text - As such, we can create a variable called name as follows:
name = c("Ben", "Martin", "Andy", "Pauline", "Eva", "Carina")
name
[1] "Ben" "Martin" "Andy" "Pauline" "Eva" "Carina"
- It does not matter whether you use single or double quotes
- Note that if you don’t type ‘name’ again, then the variable is not displayed
- What does the ‘[1]’ mean in front of the variable?
Creating a categorical variable (a factor in case of a predictor)
- Imagine we had 3 males and 3 females in a data set and we wanted to create a coding variable called ‘sex’, 1 means male, 2 female.
- Enter the data:
sex = c(1, 1, 1, 2, 2, 2) #or, easier: sex = rep(c(1, 2), each = 3) #or, even sleeker: sex = rep(1:2, each = 3) sex [1] 1 1 1 2 2 2
- Where can we insert a space in the R code? Where should we?
Creating a numeric variable
- Numeric variables are the easiest ones to create:
alcohol = c(0.75, 1.2, 2.4, 0.23, 0.9, 1.36) #standard drinks/day income = c(58000, 38000, 28000, 63000, 90500, 17000) alcohol [1] 0.75 1.20 2.40 0.23 0.90 1.36 income [1] 58000 38000 28000 63000 90500 17000
- What does the ‘#’ sign do again?
- Are those continuous or categorical variables?
- Are they predictor or response variables?
A few tricks using seq()
c(1, 2, 3, 4, 5) [1] 1 2 3 4 5 1:5 #easier [1] 1 2 3 4 5 seq(from = 1, to = 5, length.out = 5) #if you want a certain length [1] 1 2 3 4 5 seq(from = 0, to = 1, length.out = 11) [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 seq(from = 1, to = 3, by = .3) #if you want a certain increment [1] 1.0 1.3 1.6 1.9 2.2 2.5 2.8 seq(1, 3, .3) #no argument names needed if you respect the order [1] 1.0 1.3 1.6 1.9 2.2 2.5 2.8
A few tricks using rep()
rep(1:5, each = 2)
[1] 1 1 2 2 3 3 4 4 5 5
rep(1:5, times = 2) #note the arguments 'each' and 'times'
[1] 1 2 3 4 5 1 2 3 4 5
rep(1:5, 2) #the default is 'times' (if you don't specify)
[1] 1 2 3 4 5 1 2 3 4 5
rep(c('M', 'F'), each = 5)
[1] "M" "M" "M" "M" "M" "F" "F" "F" "F" "F"
Creating a data frame (a table) in R
- We can combine variables into a data frame:
d1 = data.frame(name, alcohol, income, sex)
d1
name alcohol income sex
1 Ben 0.75 58000 1
2 Martin 1.20 38000 1
3 Andy 2.40 28000 1
4 Pauline 0.23 63000 2
5 Eva 0.90 90500 2
6 Carina 1.36 17000 2
- What are the dimensions of the data frame?
- Why call it ‘d1’? Can we call it something else…?
Working with data frames in R
As soon as our data sets (called data frames in R) become larger (> 20 rows), we will want to be able to:
- Quickly look at the content without showing every single line/column
- Only show a certain part of the data frame
- Aggregate a data frame according to a grouping variable
- Add a variable (a column) to a data frame
- …and eventually much much more, e.g. search and replace certain string patterns, apply algorithms to every row, … (this is not part of BIOL501)
Simple exploratory commands using the data set ‘iris’
The below commands are not evaluated (you don’t see what R does). Try them on your own device!
#the iris object (data set comes with R, it's already there for you!) summary(iris) #summary() is very generic, try it on anything! head(iris) #shows the first few lines of your data tail(iris) #shows the last few lines of your data plot(iris) #try to interpret this plot! iris$Sepal.Length #access one variable in the data frame head(iris$Sepal.Length) #access the first few values of one variable #in a data frame
The ‘$’ symbol is to access variables contained inside a data frame, here we extract the variable ‘Sepal Length’ from ‘iris’
Subsetting a data frame using the ‘iris’ example
Extracting part of a data frame is called ‘subsetting’ and can be done in many ways. Here is one using [] (row selection before, column selection after the comma):
iris[4, 2] #show the fourth value (row) of the second column
iris[4, ] #show the fourth row of all columns
iris[, 'Species'] #show all rows for column 'Species'
iris[c(3, 16), c('Species', 'Petal.Length')]
#all rows for column 'Species'
iris[iris$Species == 'virginica', ] #all rows of species 'virginica'
iris[iris$Sepal.Length > 6, ] #all rows where Sepal.Length > 6
iris[iris$Sepal.Length > 6 & iris$Species == 'virginica', ]
#all rows where Sepal.Length > 6 AND species is 'virginica'
Aggregating a data frame
You can do this in many ways! Here is one way:
#calculate the mean petal length per species: tapply(iris$Petal.Length, iris$Species, mean)
setosa versicolor virginica
1.462 4.260 5.552
#extract the maximum value of petal length per species: tapply(iris$Petal.Length, iris$Species, max)
setosa versicolor virginica
1.9 5.1 6.9
Adding variables to a data frame
You can do this in many ways! Here is one way:
#adding a variable iris$newVariable = 1 head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species newVariable 1 5.1 3.5 1.4 0.2 setosa 1 2 4.9 3.0 1.4 0.2 setosa 1 3 4.7 3.2 1.3 0.2 setosa 1 4 4.6 3.1 1.5 0.2 setosa 1 5 5.0 3.6 1.4 0.2 setosa 1 6 5.4 3.9 1.7 0.4 setosa 1
What’s the work space/work directory in R?
How to set up your work directory in R
Say you have a folder in ‘My Documents’ called ‘BIOL501’ where your data files are and where your input should go:
Either click on ‘Session’, then ‘set work directory’, then select the above folder
Or even better: include a
setwd()(‘set work directory’) command in a chunk that you place at the beginning of your RMarkdown document:
setwd("C:/Documents/BIOL501") #adjust the path to your system!
We can now access files in that folder directly. For example: myData = read.csv(“data.csv”)
How to import data into R
Graphically measuring the variation of a (continuous) variable
- The histogram! Always THE first thing to look at:
hist(x) hist(y)
What’s a histogram, what is it used for?
- A histogram is used to show the distribution of a single (mostly continuous) variable
- The x-axis represents the ‘bins’: a continuous variable is ‘categorised’ into a number of bins (about 10-20)
- The y-axis is the frequency, i.e. how many values fall in a given bin.
Graphically measuring the variation of a (continuous) variable
- The box plot! Very useful when you want to look at a continuous variable that is grouped by another (categorical variable). What values does the box show (look it up)?
boxplot(d1$income ~ d1$sex) #note the use of '$' and '~'!
Measuring variation with numbers
- A perfect fit (rare!):
Measuring variation with numbers
- More often it looks like this:
Calculating ‘error’
- A deviation is the difference between the mean and an actual data point.
- Deviations can be calculated by taking each score and subtracting the mean from it:
\[deviation = x_i - \bar{x}\]
- NB: ‘Deviation’ is called ‘residual’ in linear models
Calculating ‘error’
What do you think? How could we compute a number that is large when the variation is large, and small when the variation is small?
Use the total error?
- We could just sum up the errors between the mean and the data
| score | mean | deviation |
|---|---|---|
| 1 | 2.6 | -1.6 |
| 2 | 2.6 | -0.6 |
| 3 | 2.6 | 0.4 |
| 3 | 2.6 | 0.4 |
| 4 | 2.6 | 1.4 |
| Total | 0 |
\[\sum(x_i - \bar{x}) = 0\]
The sum of squared errors
- The problem with summing up deviations is that they cancel out because some are positive and others negative
- Therefore, we square each deviation.
- If we add these squared deviations we get the sum of squared errors (SS).
\[SS = \sum(x_i - \bar{x})^2\]
Sum of squared errors
| score | mean | deviation | squared deviation |
|---|---|---|---|
| 1 | 2.6 | -1.6 | 2.56 |
| 2 | 2.6 | -0.6 | 0.36 |
| 3 | 2.6 | 0.4 | 0.16 |
| 3 | 2.6 | 0.4 | 0.16 |
| 4 | 2.6 | 1.4 | 1.96 |
| Total | 0 | 5.2 |
\[SS = \sum(x_i - \bar{x})^2 = 5.2\]
Variance
- The sum of squares is a good measure of overall variability, but it is dependent on the number of scores/values
- We calculate the average variability by dividing by the number of scores (\(n\)) minus 1.
- This value is called the variance (\(s^2\)).
\[variance = s^2 = \frac{SS}{n-1} = \frac{\sum(x_i-\bar{x})^2}{n-1} = \frac{5.2}{4} = 1.3\]
Standard deviation
- The variance has one problem: it is measured in units squared
- This isn’t a very meaningful metric so we take the square root value
- This is the standard deviation (\(s\), sometimes \(sd\)):
\[s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}} = \sqrt{\frac{5.2}{4}} = 1.14\]
In R:
friends = c(1, 2, 3, 3, 4) sd(friends) [1] 1.140175
Same mean, different standard deviation
Sample standard deviation: why divide by n-1?
NB: mostly, the population standard deviation is called \(s\), while the sample standard deviation is called \(\sigma\)
Degrees of freedom
Calculating metrics (e.g. the mean) ‘costs’ you degrees of freedom!
What will we have learnt in Week 3?
- Understanding an experimental/observational context, anticipate sources of variation
- Organise and import your data, create a data frame in R
- Measuring variability graphically
- draw a histogram and interpret it
- Computing and understanding
- the sum of squared errors
- the variance
- the standard deviation
- Simple data aggregation
Glossary Week 3
- comma separated values (csv files)
- wide vs. long data format
- value or score
- string or character variable
- categorical variable
- coding variable
- numeric variable
- histogram
- box plot, whisker plot
- deviation
- sum of squared errors (sum of squares)
- variance
- standard deviation
- degrees of freedom