August 8, 2022

Recap week 3

Week 4
  • Organise and import your data (long format!)
  • Create a data frame in R
    • create different kinds of variables
    • assign them to an object name
    • stick them into a ‘data frame’
  • Measuring variability graphically
    • draw a histogram and interpret it
    • draw a boxplot and interpret it
  • Computing and understanding
    • the sum of squared errors
    • the variance
    • the standard deviation

What will we do today

Week 4
  • Example questions for the upcoming midsemester test
  • The standard error
  • Distributions
  • The central limit theorem
  • The normal distribution
  • computing probabilities and quantiles
  • computing confidence intervals

Assignment 1

Week 4
  • Assignment 1 will look much like a lab report
  • It will be uploaded on Tuesday evening
  • You complete it until Friday the 19th of August 4pm
  • You hand it in on Canvas
  • Do not plagiarise, it has to be your own work!
  • The TAs will not give advice on the assignment
  • Questions?

Quick exercise

Week 4

You tested a drug on a group of 7 people, and you have a control group of 6 people. You also recorded the sex, the body height and body weight of your patients. Your response variable is the heart rate before and after administering the drug.

  • What does your final data frame look like? (sketch it on paper)
  • Describe all your variables (categorical, continuous, response, predictor…?)
  • What plots could you use to visualise this data set?
  • How would you create this data set in R?

Some example MC questions

Week 4

Below, indicate which variable is most likely binomial

  1. Hair colour
  2. Sex
  3. Body height
  4. Age class
  5. None of the above

Some example MC questions

Week 4

Which two variables are most unlikely response variables?

  1. ‘Hair colour’ and ‘body height’
  2. ‘Education of parents’ and ‘smoking habits of parents’
  3. ‘Drug dosage’ and ‘fertiliser level’
  4. ‘Heart rate’ and ‘blood sugar level’
  5. None of the above

Some example MC questions

Week 4

What is the variance of this sample: [3, 4, 3, 2] ? (Calculate it by hand)

  1. \(5\)
  2. \(0.66\)
  3. \(0\)
  4. \(\sqrt 6\)
  5. \(4\)

Some example MC questions

Week 4

When calculating the standard deviation, we divide by n-1 rather than by n because:

  1. This makes the formula look more scientific
  2. This makes sure that the sum of squares is not zero
  3. This avoids division by zero
  4. Otherwise we are computing the variance
  5. None of the above

Some example MC questions

Week 4

Which line correctly defines a categorical variable that is also nominal (using the software R)?

  1. c(4, 2, 2, 5, 9, 10)
  2. c(3.78, 5.98, 1.30. 22.43)
  3. c('black', 'black', 'black', 'blond', 'brown', 'brown', 'blond')
  4. rep(c('A', 'B', 'C'), each = 2)
  5. both (3) and (4) are correct

Standard deviation and standard error

Week 4

Consider this example:

x = c(10, 20)
y = c(5, 18, 22, 13, 9, 23)
sd(x)
[1] 7.071068
sd(y)
[1] 7.238784

Standard deviation and standard error

Week 4

So: the standard deviation does not indicate how well we can estimate the mean, for this purpose, we use the standard error of the mean (note sd = s = standard deviation): \[s.e. = \frac{s}{\sqrt{n}}\]

sd(x)/sqrt(2)
[1] 5
sd(y)/sqrt(6)
[1] 2.955221

Standard deviation and standard error

Week 4

Important to remember

Week 4
  • The variance and standard deviation represent the same thing:
    • The spread in a variable, how much variability there is
    • The higher the value, the higher the variability
    • With increasing sample size, we achieve a more precise estimate for the variability
  • The standard error
    • measures how well we estimate the mean of the population
    • decreases with the number of observations because we gain more confidence in the estimate of the mean

Calculating the standard error in R:

friends <- c(1, 2, 3, 3, 4)
sd(friends)/sqrt(5) #or:
[1] 0.509902
sd(friends)/sqrt(length(friends)) #which of the two is better?
[1] 0.509902

Additional resources

Week 4

Bozeman Science

  • Standard deviation: youtube.com/watch?v=09kiX3p5Vek
  • Standard error: youtube.com/watch?v=BwYj69LAQOIset

The mean, the mode, and the median

Week 4
  • Mean: see last week’s slides
  • Mode
    • The most frequent score or category (The peak of the histogram!)
  • Median
    • The middle score when scores are ordered. In R: median()
    • Example: number of friends of 11 Facebook users:

22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252

What is the median?

The dispersion: range

Week 4
  • The Range
    • The smallest score subtracted from the largest. In R: range()
  • Example
    • Number of friends of 11 Facebook users.
    • 22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252
    • Range = 252 – 22 = 230
  • What could be a problem when you indicate the range of a variable? What could we be missing?

=> this metric is prone to extreme values! Often not a very informative metric!

The dispersion: the interquartile range

Week 4
  • Quartiles
    • The three values that split the sorted data into four equal parts.
    • Second quartile = median.
    • Lower quartile (25\(^{th}\) percentile) = median of lower half of the data.
    • Upper quartile (75\(^{th}\) percentile) = median of upper half of the data.
    • And of course you can have any ‘percentile’ (e.g. the 5\(^{th}\))

=> quartiles/medians are not prone to extreme values!

Metrics to summarise data: example question

Week 4

If you were to describe yearly income in the United States, what metrics would you use? Why?

Metrics to summarise data: example question

Week 4

If you were to describe yearly income in the United States, what metrics would you use? Why?

  • A good metric to use would be the median. The mean would be distorted by very few extreme values. The range would not be informative at all. The interquartile range would be somewhat useful.

Distributions

Week 4
hist(rnorm(1000),  xlab = 'Score or Quantile',
     ylab = 'Density or Probability', main = "",
     axes = F, cex = 1.5); axis(1, cex = 1.5)

The uniform distribution

Week 4

  • Parameters to specify are the minimum and the maximum
  • Equal probablities between the min and the max

The poisson distribution

Week 4

  • Parameter to specify is lambda (\(\lambda\)), which represents both the mean and the variance
  • Count data: only integers, so for categorical (ordinal) variables

The normal distribution

Week 4

  • The standard deviation is symmetrical, both its tails extend infinitely

  • The two parameters are the mean and the standard deviation

  • The standard normal distribution has mean 0 and standard deviation 1

  • In R, you can create normally distributed random numbers using the function rnorm()

  • The normal distribution has superior importance! (Central Limit Theorem, assumptions of standard parametric tests)

The Central Limit Theorem

From quartiles to quantiles

Week 4

For a standard normal distribution:

Distributions - examples using our data

Week 4

Class exercise…Enter into the google doc:

  • How many siblings have you got?
  • How tall are you (in cm)?
  • How tall are your parents?
  • Enter a random number between 0 and 10

Questions to ask:

  • Can we read this data set into R? How?
  • What kind of variables are there?
  • How can we summarise/visualise this data set?
  • How are these variables distributed? (Histograms!)
  • What distributions will those variables follow, i.e. how frequent are we expecting certain values to be?
  • E.g. for the variable body height, how frequently do we expect a value of 150, 170, 190 to occur?
  • For the random number, how frequently do we expect a value of 1, 5, or 10?

Playing with our data set: distributions

Week 4

Let’s play…

Examples with human body height

Week 4

Females: mean = 160 cm, sd = 6 cm, males: mean = 170 cm, sd = 7 cm

Examples with human body height

Week 4

What is the probability of being shorter than 175 cm if you are a woman ?

pnorm(q = 175, mean = 160, sd = 6)
[1] 0.9937903

Examples with human body height

Week 4

What is the maximum height for 95 % of the male population ?

qnorm(p = .95, mean = 170, sd = 7)
[1] 181.514

Calculating confidence intervals

Week 4
  • A 95% confidence interval is an interval, within which the true mean falls 95% of the time if we took multiple samples.

  • In plain language, it gives us an idea within which range the true mean likely lies.

  • Confidence intervals are computed by subtracting/adding an error term to the mean:

For a 95% confidence interval and large samples (>30), the error is the 97.5% quantile of the standard normal distribution (1.96) times the standard error, e.g. at a mean of 10, a standard deviation of 1.58, and a sample size of 30:

\[CI = 10 \pm 1.96 \frac{1.58}{\sqrt{30}} = 10 \pm 0.57\]

So the true mean is likely to sit between 9.43 and 10.57 if we took multiple samples

https://seeing-theory.brown.edu/frequentist-inference/index.html

Calculating confidence intervals

Week 4

In R (note the small sample size):

x = c(8, 12, 10, 9, 11)
m = mean(x)
n = 5
error = qnorm(p = .975)*sd(x)/sqrt(n)
lower = m - error
upper = m + error
lower
[1] 8.614096
upper
[1] 11.3859

The formula for a 95% CI is simple: \[CI = mean \pm quantile_{97.5} \frac{sd}{\sqrt{n}}\] Adapt the quantile value if you would like to calculate e.g. a 90% CI

More on confidence intervals

What will we have learnt in Week 4?

Week 4
  • The standard error
    • how it differs from the standard deviation
    • when it is used
  • The normal distribution
    • the mean, the median and the mode
    • computing quartiles of the normal distribution
    • computing probabilities and quantiles of the normal distribution using pnorm() and qnorm()
  • computing confidence intervals
  • Summarising and visualising data sets

Glossary Week 4

Week 4
  • standard error
  • Normal distribution
  • Other distributions (uniform, poisson)
  • The Central Limit Theorem
  • mean, mode, median, range
  • quartiles, interquartile range
  • quantiles
  • confidence intervals