- Organise and import your data (long format!)
- Create a data frame in R
- create different kinds of variables
- assign them to an object name
- stick them into a ‘data frame’
- Measuring variability graphically
- draw a histogram and interpret it
- draw a boxplot and interpret it
- Computing and understanding
- the sum of squared errors
- the variance
- the standard deviation
What will we do today
- Example questions for the upcoming midsemester test
- The standard error
- Distributions
- The central limit theorem
- The normal distribution
- computing probabilities and quantiles
- computing confidence intervals
Assignment 1
- Assignment 1 will look much like a lab report
- It will be uploaded on Tuesday evening
- You complete it until Friday the 19th of August 4pm
- You hand it in on Canvas
- Do not plagiarise, it has to be your own work!
- The TAs will not give advice on the assignment
- Questions?
Quick exercise
You tested a drug on a group of 7 people, and you have a control group of 6 people. You also recorded the sex, the body height and body weight of your patients. Your response variable is the heart rate before and after administering the drug.
- What does your final data frame look like? (sketch it on paper)
- Describe all your variables (categorical, continuous, response, predictor…?)
- What plots could you use to visualise this data set?
- How would you create this data set in R?
Some example MC questions
Below, indicate which variable is most likely binomial
- Hair colour
- Sex
- Body height
- Age class
- None of the above
Some example MC questions
Which two variables are most unlikely response variables?
- ‘Hair colour’ and ‘body height’
- ‘Education of parents’ and ‘smoking habits of parents’
- ‘Drug dosage’ and ‘fertiliser level’
- ‘Heart rate’ and ‘blood sugar level’
- None of the above
Some example MC questions
What is the variance of this sample: [3, 4, 3, 2] ? (Calculate it by hand)
- \(5\)
- \(0.66\)
- \(0\)
- \(\sqrt 6\)
- \(4\)
Some example MC questions
When calculating the standard deviation, we divide by n-1 rather than by n because:
- This makes the formula look more scientific
- This makes sure that the sum of squares is not zero
- This avoids division by zero
- Otherwise we are computing the variance
- None of the above
Some example MC questions
Which line correctly defines a categorical variable that is also nominal (using the software R)?
c(4, 2, 2, 5, 9, 10)c(3.78, 5.98, 1.30. 22.43)c('black', 'black', 'black', 'blond', 'brown', 'brown', 'blond')rep(c('A', 'B', 'C'), each = 2)- both (3) and (4) are correct
Standard deviation and standard error
Consider this example:
x = c(10, 20) y = c(5, 18, 22, 13, 9, 23) sd(x) [1] 7.071068 sd(y) [1] 7.238784
Standard deviation and standard error
So: the standard deviation does not indicate how well we can estimate the mean, for this purpose, we use the standard error of the mean (note sd = s = standard deviation): \[s.e. = \frac{s}{\sqrt{n}}\]
sd(x)/sqrt(2) [1] 5 sd(y)/sqrt(6) [1] 2.955221
Standard deviation and standard error
Important to remember
- The variance and standard deviation represent the same thing:
- The spread in a variable, how much variability there is
- The higher the value, the higher the variability
- With increasing sample size, we achieve a more precise estimate for the variability
- The standard error
- measures how well we estimate the mean of the population
- decreases with the number of observations because we gain more confidence in the estimate of the mean
Calculating the standard error in R:
friends <- c(1, 2, 3, 3, 4) sd(friends)/sqrt(5) #or: [1] 0.509902 sd(friends)/sqrt(length(friends)) #which of the two is better? [1] 0.509902
Additional resources
Bozeman Science
- Standard deviation: youtube.com/watch?v=09kiX3p5Vek
- Standard error: youtube.com/watch?v=BwYj69LAQOIset
The mean, the mode, and the median
- Mean: see last week’s slides
- Mode
- The most frequent score or category (The peak of the histogram!)
- Median
- The middle score when scores are ordered. In R:
median() - Example: number of friends of 11 Facebook users:
- The middle score when scores are ordered. In R:
22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252
What is the median?
The dispersion: range
- The Range
- The smallest score subtracted from the largest. In R:
range()
- The smallest score subtracted from the largest. In R:
- Example
- Number of friends of 11 Facebook users.
- 22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252
- Range = 252 – 22 = 230
- What could be a problem when you indicate the range of a variable? What could we be missing?
=> this metric is prone to extreme values! Often not a very informative metric!
The dispersion: the interquartile range
- Quartiles
- The three values that split the sorted data into four equal parts.
- Second quartile = median.
- Lower quartile (25\(^{th}\) percentile) = median of lower half of the data.
- Upper quartile (75\(^{th}\) percentile) = median of upper half of the data.
- And of course you can have any ‘percentile’ (e.g. the 5\(^{th}\))
=> quartiles/medians are not prone to extreme values!
Metrics to summarise data: example question
If you were to describe yearly income in the United States, what metrics would you use? Why?
Metrics to summarise data: example question
If you were to describe yearly income in the United States, what metrics would you use? Why?
- A good metric to use would be the median. The mean would be distorted by very few extreme values. The range would not be informative at all. The interquartile range would be somewhat useful.
Distributions
hist(rnorm(1000), xlab = 'Score or Quantile',
ylab = 'Density or Probability', main = "",
axes = F, cex = 1.5); axis(1, cex = 1.5)
The uniform distribution
- Parameters to specify are the minimum and the maximum
- Equal probablities between the min and the max
The poisson distribution
- Parameter to specify is lambda (\(\lambda\)), which represents both the mean and the variance
- Count data: only integers, so for categorical (ordinal) variables
The normal distribution
The standard deviation is symmetrical, both its tails extend infinitely
The two parameters are the mean and the standard deviation
The standard normal distribution has mean 0 and standard deviation 1
In R, you can create normally distributed random numbers using the function
rnorm()The normal distribution has superior importance! (Central Limit Theorem, assumptions of standard parametric tests)
The Central Limit Theorem
’The sampling distribution of the sample means of any distribution approaches a normal distribution as the sample size gets larger!
Let’t visualise this:
https://seeing-theory.brown.edu/probability-distributions/index.html#section3
The ‘seeing theory’ page is absolutely great by the way!
From quartiles to quantiles
For a standard normal distribution:
Distributions - examples using our data
Class exercise…Enter into the google doc:
- How many siblings have you got?
- How tall are you (in cm)?
- How tall are your parents?
- Enter a random number between 0 and 10
Questions to ask:
- Can we read this data set into R? How?
- What kind of variables are there?
- How can we summarise/visualise this data set?
- How are these variables distributed? (Histograms!)
- What distributions will those variables follow, i.e. how frequent are we expecting certain values to be?
- E.g. for the variable body height, how frequently do we expect a value of 150, 170, 190 to occur?
- For the random number, how frequently do we expect a value of 1, 5, or 10?
Playing with our data set: distributions
Let’s play…
Examples with human body height
Females: mean = 160 cm, sd = 6 cm, males: mean = 170 cm, sd = 7 cm
Examples with human body height
What is the probability of being shorter than 175 cm if you are a woman ?
pnorm(q = 175, mean = 160, sd = 6)
[1] 0.9937903
Examples with human body height
What is the maximum height for 95 % of the male population ?
qnorm(p = .95, mean = 170, sd = 7)
[1] 181.514
Calculating confidence intervals
A 95% confidence interval is an interval, within which the true mean falls 95% of the time if we took multiple samples.
In plain language, it gives us an idea within which range the true mean likely lies.
Confidence intervals are computed by subtracting/adding an error term to the mean:
For a 95% confidence interval and large samples (>30), the error is the 97.5% quantile of the standard normal distribution (1.96) times the standard error, e.g. at a mean of 10, a standard deviation of 1.58, and a sample size of 30:
\[CI = 10 \pm 1.96 \frac{1.58}{\sqrt{30}} = 10 \pm 0.57\]
So the true mean is likely to sit between 9.43 and 10.57 if we took multiple samples
https://seeing-theory.brown.edu/frequentist-inference/index.html
Calculating confidence intervals
In R (note the small sample size):
x = c(8, 12, 10, 9, 11) m = mean(x) n = 5 error = qnorm(p = .975)*sd(x)/sqrt(n) lower = m - error upper = m + error lower [1] 8.614096 upper [1] 11.3859
The formula for a 95% CI is simple: \[CI = mean \pm quantile_{97.5} \frac{sd}{\sqrt{n}}\] Adapt the quantile value if you would like to calculate e.g. a 90% CI
More on confidence intervals
Some more background, explained differently (not part of the test/exam):
https://www.youtube.com/watch?v=vWFdiGg7f6k
You will just have to be able to calculate and interpret confidence intervals!
What will we have learnt in Week 4?
- The standard error
- how it differs from the standard deviation
- when it is used
- The normal distribution
- the mean, the median and the mode
- computing quartiles of the normal distribution
- computing probabilities and quantiles of the normal distribution using
pnorm()andqnorm()
- computing confidence intervals
- Summarising and visualising data sets
Glossary Week 4
- standard error
- Normal distribution
- Other distributions (uniform, poisson)
- The Central Limit Theorem
- mean, mode, median, range
- quartiles, interquartile range
- quantiles
- confidence intervals