August 2, 2016

Recap from last week

Week 3
  • What is the scientific research process
  • Scientific and non-scientific hypotheses
  • What is a hypothesis that can be falsified?
  • How to get started with R / R Markdown
  • What are response and predictor variables, what is a ‘treatment’ and a 'control'
  • Binomial, nominal, ordinal, and continuous variables
  • Predictor vs. response variables
  • Systematic vs. unsystematic variation in data, what are its sources?
  • What is the signal-noise ratio? How can we increase it?
  • What is a sample? What is a population?

Timetable first half (tentative)

Week 3
Date Week Lab Topic
19.7. 1 - Introduction, R, Rmarkdown
26.7. 2 1 Hypotheses, variables, variation
2.8. 3 2 Data, measuring variation, the normal distribution, quartiles, quantiles, probabilities
9.8. 4 3 Sebastian absent
16.8. 5 4 Type I and type II error
23.8. 6 5 Week of midsemester test

Example questions for the test/exam

Week 3
  • Describe a study that uses 2 predictor variables, one continuous, one categorical, and one binomial response variable
  • In R, create three variables of the same length, one continuous, one categorical and one binomial
  • Explain why unsystematic variation in data can prevent us from detecting a signal
  • Explain 2 sources of systematic variation

What we will learn today

Week 3
  • How to organise and import your data in R (that’s often 50% of your job done!)
  • Measuring variability graphically
    • frequency distributions, histograms
  • Measuring variability numerically
    • the standard deviation
    • the standard error
    • degrees of freedom
  • The normal distribution
    • the mean, the median and the mode
    • quartiles, quantiles
    • confidence intervals

How to organise your data

Week 3
  • Variable names
    • short, consistent, unique
    • NO special characters (%, $, &…) and NO spaces
    • period / underscore are ok
    • Examples:
      • ‘% damage in leaves’
      • ‘Percent_damage_in_leaves’
      • ‘damage’
      • ‘pcd’
  • Variable values
    • Example: variable ‘age class’
      • c(‘1-2’, ‘3-4’, ‘5-6’, ‘>6’)
      • c(1, 2, 3, 4)
  • Wide vs. long format
    • Use the long format, already before you import your data into R!

Wide vs. long format

Week 3

Wide vs. long format

Week 3

Creating a string or character variable

Week 3
  • We use the c() function and list all values in quotations so that R knows that it is string data.
  • As such, we can create a variable called name as follows:
name = c("Ben", "Mary", "Andy", "Paul", "Eva", "Carl")
name
[1] "Ben"  "Mary" "Andy" "Paul" "Eva"  "Carl"
  • It does not matter whether you use single or double quotes
  • Note that if you don't type 'name' again, then the variable is not displayed
  • What does the '[1]' mean in front of the variable?

Creating a string or character variable

Week 3
name1 =  rep(name, times = 3)
name1
 [1] "Ben"  "Mary" "Andy" "Paul" "Eva"  "Carl" "Ben"  "Mary" "Andy" "Paul"
[11] "Eva"  "Carl" "Ben"  "Mary" "Andy" "Paul" "Eva"  "Carl"

The number in square brackets indicates the 'index', the position of the value within the variable:

rnorm(20)
 [1]  1.6739606 -0.1665988  0.4953175 -2.0888457  1.3930438  0.3808360
 [7]  1.0048967  0.1355061  2.8393412 -0.1163399  0.9209963 -0.3198669
[13] -1.3672879 -1.8611386 -0.4684234  1.2762487  2.0978015 -0.5942577
[19]  0.6131022 -0.3829628

Creating a categorical variable (a factor in case of a predictor)

Week 3
  • Imagine we had 3 males and 3 females in a data set and we wanted to create a coding variable called 'gender', 1 means male, 2 female.
  • Enter the data:
gender = c(1, 1, 1, 2, 2, 2)
  • Where can we leave a space in the R code?

Creating a numeric variable

Week 3
  • Numeric variables are the easiest ones to create:
alcohol = c(0.75, 1.2, 2.4, 0.23, 0.9, 1.36) #standard drinks/day
income = c(58000, 38000, 28000, 63000, 90500, 17000)
alcohol
[1] 0.75 1.20 2.40 0.23 0.90 1.36
income
[1] 58000 38000 28000 63000 90500 17000
  • What does the '#' sign do again?
  • Are those continuous or categorical variables?
  • Are they predictor or response variables?

Creating a data frame (a table) in R

Week 3
  • We can combine variables into a data frame:
d1 = data.frame(name, alcohol, income, gender)
d1
  name alcohol income gender
1  Ben    0.75  58000      1
2 Mary    1.20  38000      1
3 Andy    2.40  28000      1
4 Paul    0.23  63000      2
5  Eva    0.90  90500      2
6 Carl    1.36  17000      2
  • What are the dimensions of the data frame?
  • Why call it 'd1'? Can we call it something else…?

What's the work space/work directory in R?

Week 3

How to set up your work directory in R

Week 3

Say you have a folder in ‘My Documents’ called ‘BIOL501’ in where your data files are and where your input should go:

  • Either click on 'Session', then 'set work directory', then select the above folder

  • Or even better: include a setwd() ('set work directory') command in a chunk that you place at the beginning of your RMarkdown document:

setwd("C:/Documents/BIOL501") #adjust path to your system!

We can now access files in that folder directly. For example: myData = read.csv("data.csv")

How to import data into R

Week 3

Graphically measuring the variation of a (continuous) variable

Week 3
  • The histogram! Always THE first thing to look at:
hist(x)
hist(y)

Graphically measuring the variation of a (continuous) variable

Week 3
  • The box plot! Very useful when you want to look at a continuous variable that is grouped by another (categorical variable):
boxplot(d1$income ~ d1$gender) #note the use of '$' and '~'!

Measuring variation with numbers

Week 3
  • A perfect fit (rare!):

Measuring variation with numbers

Week 3
  • More often it looks like this:

Calculating 'error'

Week 3
  • A deviation is the difference between the mean and an actual data point.
  • Deviations can be calculated by taking each score and subtracting the mean from it:

\(deviation = x_i - \bar{x}\)

  • NB: 'Deviation' is called 'residual' in linear models

Calculating 'error'

What do you think? How could we compute a number that is large when the variation is large, and small when the variation is small?

Use the total error?

Week 3
  • We could just sum up the errors between the mean and the data
score mean deviation
1 2.6 -1.6
2 2.6 -0.6
3 2.6 0.4
3 2.6 0.4
4 2.6 1.4
Total 0

\[\sum(x_i - \bar{x}) = 0\]

The sum of squared errors

Week 3
  • The problem with summing up deviations is that they cancel out because some are positive and others negative
  • Therefore, we square each deviation.
  • If we add these squared deviations we get the sum of squared errors (SS).

\[SS = \sum(x_i - \bar{x})^2\]

Sum of squared errors

Week 3
score mean deviation squared deviation
1 2.6 -1.6 2.56
2 2.6 -0.6 0.36
3 2.6 0.4 0.16
3 2.6 0.4 0.16
4 2.6 1.4 1.96
Total 0 5.2

\[SS = \sum(x_i - \bar{x})^2 = 5.2\]

Variance

Week 3
  • The sum of squares is a good measure of overall variability, but it is dependent on the number of scores/values
  • We calculate the average variability by dividing by the number of scores (\(n\)) minus 1.
  • This value is called the variance (\(s^2\)).

\[variance = s^2 = \frac{SS}{n-1} = \frac{\sum(x_i-\bar{x})^2}{n-1} = \frac{5.2}{4} = 1.3\]

Standard deviation

Week 3
  • The variance has one problem: it is measured in units squared
  • This isn’t a very meaningful metric so we take the square root value
  • This is the standard deviation (\(s\), sometimes \(sd\)):

\[s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}} = \sqrt{\frac{5.2}{4}} = 1.14\]

In R:

friends = c(1, 2, 3, 3, 4)
sd(friends)
[1] 1.140175

Same mean, different standard deviation

Week 3

Standard deviation: sample vs. population

NB: mostly, the population standard deviation is called \(s\), while the sample standard deviation is called \(\sigma\)

Degrees of freedom

Calculating metrics e.g. the mean 'cost' you degrees of freedom!

Quick exercise

Week 3

You tested a drug on a group of 7 people, and you have a control group of 6 people. You also recorded the gender, the body height and body weight of your patients. Your response variable is the heart rate before and after administring the drug.

  • What does your final data frame look like? (sketch it on paper)
  • Describe all your variables (categorical, continuous, response, predictor…?)
  • What plots could you use to visualise this data set?
  • How would you create this data set in R?

Some example MC questions

Week 3

Below, indicate which variable is most likely binomial

  1. Hair colour
  2. Gender
  3. Body height
  4. Age class
  5. None of the above

Some example MC questions

Week 3

Which two variables are most unlikely response variables?

  1. 'Hair colour' and 'body height'
  2. 'Education of parents' and 'smoking habits of parents'
  3. 'Drug dosis' and 'gender'
  4. 'Heart rate' and 'blood sugar level'

Some example MC questions

Week 3

What is the variance of this sample: [3, 4, 3, 2] ? (Calculate it by hand)

  1. \(5\)
  2. \(0.66\)
  3. \(0\)
  4. \(\sqrt 6\)
  5. \(4\)

Some example MC questions

Week 3

When calculating the standard deviation, we divide by n-1 rather than by n because:

  1. This makes the formula look more scientific
  2. This makes sure that the sum of squares is not zero
  3. This avoids division by zero
  4. Otherwise we are computing the variance
  5. None of the above

Some example MC questions

Week 3

Which line correctly defines a categorical variable that is also nominal (using the software R)?

  1. c(4, 2, 2, 5, 9, 10)
  2. c(3.78, 5.98, 1.30. 22.43)
  3. c('black', 'black', 'black', 'blond', 'brown', 'brown', 'blond')
  4. rep(c('A', 'B', 'C'), each = 2)
  5. both (3) and (4) are correct

Additional resources

Week 3

Bozeman Science

  • Standard deviation: youtube.com/watch?v=09kiX3p5Vek
  • Standard error: youtube.com/watch?v=BwYj69LAQOIset

Standard deviation and standard error

Week 3

Consider this example:

x <- c(10, 20)
y <- c(5, 18, 22, 13, 9, 23)
sd(x)
[1] 7.071068
sd(y)
[1] 7.238784

Standard deviation and standard error

Week 3

So: the standard deviation does not indicate how well we can estimate the mean, therefore, we use the standard error: \[s.e. = \frac{s}{\sqrt{n}}\]

sd(x)/sqrt(2)
[1] 5
sd(y)/sqrt(6)
[1] 2.955221

Standard deviation and standard error

Week 3

Important to remember

Week 3
  • The sum of squares, variance, and standard deviation represent the same thing:
    • The ‘fit’ of the mean to the data (how well the mean represents the observed data)
    • The variability in the data
  • The difference between the standard deviation and the standard error:
    • The standard deviation measures the variability in a sample or estimates the variability in the population
    • The standard error measures how well we estimate the mean of the population

Calculating the standard error in R:

friends = c(1, 2, 3, 3, 4)
sd(friends)/sqrt(5)
[1] 0.509902

The mean, the mode, and the median

Week 3
  • Mean: see last week's slides
  • Mode
    • The most frequent score or category (the peak of the histogram)
  • Median
    • The middle score when scores are ordered. In R: median()
    • Example: number of friends of 11 Facebook users:

22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252

Which is the median?

The dispersion: range

Week 3
  • The Range
    • The smallest score subtracted from the largest. In R: range()
  • Example
    • Number of friends of 11 Facebook users.
    • 22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252
    • Range = 252 – 22 = 230
  • What could be a problem when you indicate the range of a varible? What could we be missing?

=> this metric is prone to extreme values!

The dispersion: the interquartile range

Week 3
  • Quartiles
    • The three values that split the sorted data into four equal parts.
    • Second quartile = median.
    • Lower quartile = median of lower half of the data.
    • Upper quartile = median of upper half of the data.
    • In R: qnorm(p = .25) or qnorm(p = .75)

=> quartiles/medians are not prone to extreme values!

The normal distribution

Week 3

The standard normal distribution has mean 0 and standard deviation 1

In R, you can create normally distributed random numbers using the function rnorm()

The normal distribution has superior importance! (central limit theorem, assumptions of standard parametric tests)

From quartiles to quantiles

Week 3

Examples with human body height

Week 3

Females: mean = 160 cm, sd = 6 cm, males: mean = 170 cm, sd = 7 cm

Examples with human body height

Week 3

What is the probability of being shorter than 175 cm if you are a woman ?

pnorm(q = 175, mean = 160, sd = 6)
[1] 0.9937903

Examples with human body height

Week 3

What is the maximum height for 95 % of the male population ?

qnorm(p = .95, mean = 170, sd = 7)
[1] 181.514

What will we have learnt in Week 3?

Week 3
  • Organise and import your data, create a data frame in R
  • Measuring variability graphically
    • draw a histogram and interpret it
  • Computing and understanding
    • the sum of squared errors
    • the variance
    • the standard deviation
  • The standard error
    • how it differs from the standard deviation
    • when it is used
  • The normal distribution
    • the mean, the median and the mode
    • computing quartiles
    • computing probabilities and quantiles using pnorm() and qnorm()

Glossary Week 3

Week 3
  • comma separated values (csv files)
  • wide vs. long data format
  • value or score
  • string or character variable
  • categorical variable
  • coding variable
  • numeric variable
  • histogram
  • box plot, whisker plot

Glossary Week 3

Week 3
  • deviation
  • sum of squared errors (sum of squares)
  • variance
  • standard deviation
  • degrees of freedom
  • standard error
  • Normal distribution
  • mean, mode, median, range
  • quartiles, interquartile range
  • quantiles
  • confidence intervals