August 4, 2015

Recap from last week

Week 3
  • What is the scientific research process
  • Scientific and non-scientific hypotheses
  • What is a hypothesis that can be falsified?
  • How to get started with R / R Markdown
  • What are response and predictor variables, what is a ‘treatment’ and a 'control'
  • Binomial, nominal, ordinal, and continuous variables
  • Predictor vs. response variables
  • Systematic vs. unsystematic variation in data, what are its sources?
  • What is the signal-noise ratio? How can we increase it?

  • PLEASE: lab reports (filenames!)

Timetable first half (tentative)

Week 3
Date Week Lab Topic
21.7. 1 - Introduction, R, Rmarkdown
28.7. 2 1 Hypotheses, variables, variation
4.8. 3 2 Data, measuring variation, the normal distribution
11.8. 4 3 Quartiles, quantiles, probabilities
18.8. 5 4 Type I and type II error
25.8. 6 5 Week of midsemester test

Example questions for the test/exam

Week 3
  • Describe a study which uses 2 predictor variables, one continuous, one categorical, and one binomial response variable
  • In R, create three variables of the same length, one continuous, one categorical and one binomial
  • Explain why unsystematic variation in data can prevent us from detecting a signal
  • Explain 2 sources of systematic variation

What we will learn today

Week 3
  • How to organise and import your data in R (that’s often 50% of your job done!)
  • Measuring variability graphically
    • frequency distributions, histograms
  • Measuring variability numerically
    • the standard deviation
    • the standard error
    • degrees of freedom
    • population vs. sample
  • The normal distribution
    • the mean, the median and the mode
    • quartiles, quantiles
    • confidence intervals

How to organise your data

Week 3
  • Variable names
    • short, consistent, unique
    • NO special characters (%, $, &…) and NO spaces
    • period / underscore are ok
    • Examples:
      • ‘% damage in leaves’
      • ‘Percent_damage_in_leaves’
      • ‘damage’
      • ‘pcd’
  • Variable values
    • Example: variable ‘age class’
      • c(‘1-2’, ‘3-4’, ‘5-6’, ‘>6’)
      • c(1, 2, 3, 4)
  • Wide vs. long format
    • Use the long format, already before you import your data into R!

Wide vs. long format

Week 3

Wide vs. long format

Week 3

Creating a string or character variable

Week 3
  • We use the c() function and list all values in quotations so that R knows that it is string data.
  • As such, we can create a variable called name as follows:
name <- c("Ben", "Martin", "Andy", "Pauline", "Eva", "Carina")
name
[1] "Ben"     "Martin"  "Andy"    "Pauline" "Eva"     "Carina" 
  • It does not matter whether you use single or double quotes
  • Note that if you don't type 'name' again, then the variable is not displayed
  • What does the '[1]' mean in front of the variable?

Creating a categorical variable (a factor in case of a predictor)

Week 3
  • Imagine we had 3 males and 3 females in a data set and we wanted to create a coding variable called 'gender', 1 means male, 2 female.
  • Enter the data:
gender <- c(1, 1, 1, 2, 2, 2)
  • Where can we leave a space in the R code, where can't we?

Creating a numeric variable

Week 3
  • Numeric variables are the easiest ones to create:
alcohol <- c(0.75, 1.2, 2.4, 0.23, 0.9, 1.36) #standard drinks/day
income <- c(58000, 38000, 28000, 63000, 90500, 17000)
alcohol
[1] 0.75 1.20 2.40 0.23 0.90 1.36
income
[1] 58000 38000 28000 63000 90500 17000
  • What does the '#' sign do again?
  • Are those continuous or categorical variables?
  • Are they predictor or response variables

Creating a data frame (a table) in R

Week 3
  • We can combine variables into a data frame:
d1 <- data.frame(name, alcohol, income, gender)
d1
     name alcohol income gender
1     Ben    0.75  58000      1
2  Martin    1.20  38000      1
3    Andy    2.40  28000      1
4 Pauline    0.23  63000      2
5     Eva    0.90  90500      2
6  Carina    1.36  17000      2
  • What are the dimensions of the data frame?
  • Why call it 'd1'? Can we call it something else…?

What's the work space/work directory in R?

Week 3

How to set up your work directory in R

Week 3

Say you have a folder in ‘My Documents’ called ‘BIOL501’ in where your data files are and where your input should go:

  • Either click on 'Session', then 'set work directory', then select the above folder

  • Or even better: include a setwd() ('set work directory') command in a chunk that you place at the beginning of your RMarkdown document:

setwd("C:/Documents/BIOL501") #adjust path to your system!

We can now access files in that folder directly. For example: myData <- read.csv("data.csv")

How to import data into R

Week 3

Graphically measuring the variation of a (continuous) variable

Week 3
  • The histogram! Always THE first thing to look at:
hist(x)
hist(y)

Graphically measuring the variation of a (continuous) variable

Week 3
  • The box plot! Very useful when you want to look at a continuous variable that is grouped by another (categorical variable):
boxplot(d1$income ~ d1$gender) #note the use of '$' and '~'!

Measuring variation with numbers

Week 3
  • A perfect fit (rare!):

Measuring variation with numbers

Week 3
  • More often it looks like this:

Calculating 'error'

Week 3
  • A deviation is the difference between the mean and an actual data point.
  • Deviations can be calculated by taking each score and subtracting the mean from it:

\(deviation = x_i - \bar{x}\)

  • NB: 'Deviation' is called 'residual' in linear models

Calculating 'error'

What do you think? How could we compute a number that is large when the variation is large, and small when the variation is small?

Use the total error?

Week 3
  • We could just sum up the errors between the mean and the data
score mean deviation
1 2.6 -1.6
2 2.6 -0.6
3 2.6 0.4
3 2.6 0.4
4 2.6 1.4
Total 0

\[\sum(x_i - \bar{x}) = 0\]

The sum of squared errors

Week 3
  • The problem with summing up deviations is that they cancel out because some are positive and others negative
  • Therefore, we square each deviation.
  • If we add these squared deviations we get the sum of squared errors (SS).

\[SS = \sum(x_i - \bar{x})^2\]

Sum of squared errors

Week 3
score mean deviation squared deviation
1 2.6 -1.6 2.56
2 2.6 -0.6 0.36
3 2.6 0.4 0.16
3 2.6 0.4 0.16
4 2.6 1.4 1.96
Total 0 5.2

\[SS = \sum(x_i - \bar{x})^2 = 5.2\]

Variance

Week 3
  • The sum of squares is a good measure of overall variability, but it is dependent on the number of scores/values
  • We calculate the average variability by dividing by the number of scores (\(n\)) minus 1.
  • This value is called the variance (\(s^2\)).

\[variance (s^2) = \frac{SS}{n-1} = \frac{\sum(x_i-\bar{x})^2}{n-1} = \frac{5.2}{4} = 1.3\]

Standard deviation

Week 3
  • The variance has one problem: it is measured in units squared
  • This isn’t a very meaningful metric so we take the square root value
  • This is the standard deviation (\(s\), sometimes \(sd\)):

\[s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}} = \sqrt{\frac{5.2}{4}} = 1.14\]

In R:

friends <- c(1, 2, 3, 3, 4)
sd(friends)
[1] 1.140175

Same mean, different standard deviation

Week 3

Standard deviation: sample vs. population

NB: mostly, the population standard deviation is called \(s\), while the sample standard deviation is called \(\sigma\)

Degrees of freedom

Calculating metrics e.g. the mean 'cost' you degrees of freedom!

What will we have learnt in Week 3?

Week 3
  • Organise and import your data, create a data frame in R
  • Measuring variability graphically
    • draw a histogram and interpret it
  • Computing and understanding
    • the sum of squared errors
    • the variance
    • the standard deviation

Glossary Week 3

Week 3
  • comma separated values (csv files)
  • wide vs. long data format
  • value or score
  • string or character variable
  • categorical variable
  • coding variable
  • numeric variable
  • histogram
  • box plot, whisker plot
  • deviation
  • sum of squared errors (sum of squares)
  • variance
  • standard deviation
  • degrees of freedom