Workshop 17 [Week 7]: Model assumptions
Lecture review
- Parametric models (t-test, ANOVA, linear regression) make assumptions about their input (the data).
- Assumptions relate to the normal distribution
- data must be continuous
- observations must be independent and identically distributed
- The normal distribution is assumed because of the central limit theorem.
Learning outcomes
This workshop, along with the lecture on model assumptions should give you the opportunity to be able to
- name central properties and aspects of the normal distribution.
- understand the conceptual idea of the central limit theorem.
- use basic simulations in R to understand both.
For revision of central limit theorem and normal distribution see Matloff (2019) chapter 9.7 and Baguley (2012) chapter 2.4.1
Setup
You need the package psyntur
:
library(psyntur)
packageVersion("psyntur") # should be version 0.0.2
[1] '0.0.2'
The normal distribution
Simulate normal distributed data
To draw out a normal distribution, we need to know the mean and the standard deviation.
The function rnorm
(r=random, norm=normal) allows us to randomly sample normal distributed data. The function takes three parameter values: the number of samples n we want to obtain from a distribution with a mean, and a standard deviation sd or some sort. Go through the example revealed by Buttons 1. – 3.
1. Create random data
Create random distributed data:
<- 5 # Number of observations to be sampled.
n <- 500 # True population mean
mean <- 100 # True population standard deviation
sd <- rnorm(n = n, mean = mean, sd = sd) x
You won’t have the same values. Every time you use rnorm
you will get new values (i.e. magic). Check the randomly sampled observations by typing x
and Return in the R console.
# Tada, 5 random values x
[1] 443.9524 476.9823 655.8708 507.0508 512.9288
2. Why fake data?
Functions like rnorm
allow us to control population parameters. In reality we rarely know the population mean, say, the average time it takes to recover from COVID-19, but we can estimate it from a large enough sample and repeated sampling from the population. These unknown population parameters are conventionally indicated with Greek letters, i.e. \(\mu\) (mu; the population mean) and \(\sigma\) (sigma; the population standard deviation).
Because we don’t know the population parameters in real life, functions like rnorm
are great. We can simulate data of which we do know the parameter values, because we defined both above (i.e. mean
, sd
).
3. Plot the data
Plot the “fake” data from above.
histogram(x, data = NULL)
Here is the mean of the sample:
mean(x)
[1] 519.357
and the standard deviation of the sample:
sd(x)
[1] 81.10218
Questions
1. Question
Your numbers will be different from mine because we take random samples. Are the data in your histogram normal distributed? Why do you think they are (or are not) normal distributed?
2. Question
Compare the sample mean and the sample standard deviation to our population mean and population standard deviation. Are they different? Remember, we are in the almost unique situation to know the population parameters. Why do you think are the sample mean and standard deviation different from the population mean and standard deviation?
Tasks
1. Task
Repeat the steps above but this time create a histogram with sample size n=1000. What changes did you make in the code? How has the histogram changed? Calculate the sample mean and sd again. How did they change?
2. Task
Repeat the steps above but this time with a sample size n=1000 and a sd of 10. What changes did you make in the code? How has the histogram changed? Calculate the sample mean and sd again. How did they change?
3. Task
If we know the population mean and population variance, we know everything we need to know to create data we would be expected under the normal distribution. Let’s do this for IQ which has a population mean of 100 and a population sd of 15. Use an n of 1000 samples and plot a histogram of IQ.
The area under the curve
Let’s continue with the IQ example from the lecture. We know the population values, which is extremely handy. Here is a density plot of IQ with a mean of 100 and a sd of 15. This should look similar to your plot from task 3. Except instead of counts / frequency, the density plot describes the relative likelihood of IQ values.
Remember, the area underneath the curve must sum to 1. In other words the likelihood of an observation to take on an IQ of any value under the curve is 1 (or 100%). The likelihood of a value to be outside this area is > 0 (%). Those are the extremes.
We can now use the pnorm
(probability normal) function to calculate the area underneath the curve. You might have done calculus and integral calculations in school. No worries, we won’t go there :)
Again, we take the population values (mean and standard deviation) of IQ.
<- 100
mean <- 15 sd
Say we want to know the probability of observing a person with an IQ below 100 which is the red shaded area in the density plot:
The probability of observing a person with an IQ below 100 can be determined like this:
pnorm(100, mean = mean, sd = sd)
[1] 0.5
In other words, 50% (0.5) of the population have an IQ below 100.
Tasks
1. Task
Calculate the probability of observing a person with an IQ below 75 corresponding to the area shaded in this plot:
2. Task
Calculate the probability of observing a person with an IQ above 75 corresponding to the area shaded in the plot below.
Hint: Keep in mind that pnorm
returns the probability of IQ lower than the critical value. The entire area underneath the curve sums to 1 (100%) and the distribution is symmetric. In other words you can use 1 - x
where x is the result returned from pnorm
. Make sure you understand why :)
3. Task
Calculate the probability of observing a person with an IQ between 95 and 110 corresponding to the area shaded in the plot below. Example below.
As an example, here is the probability of observing a person with an IQ between 75 and 95
pnorm(95, mean = mean, sd = sd) - pnorm(75, mean = mean, sd = sd)
[1] 0.321651
which is the probability of observing a person with an IQ below 95
pnorm(95, mean = mean, sd = sd)
[1] 0.3694413
but removing the probability of observing a person with an IQ below 75
pnorm(75, mean = mean, sd = sd)
[1] 0.04779035
4. Task
Calculate the probability of observing a person with an IQ between 99.9 and 100.1 corresponding to the area shaded in the plot below.
5. Task
Calculate the probability of observing a person with an IQ of below 60 or above 140; see the area shaded in the plot.
6. Task (bonus)
Calculate the probability of observing a person with an IQ outside of one standard deviation around the mean (see plot). Reminder the population mean of IQ is 100 and its standard deviation is 15.
Central limit theorem
The central limit theorem states that the sampling distribution will approach normality when the sample size increase, regardless of the shape of the distribution we are sampling from. For the central limit theorem to work, samples must be independent and identically distributed (iid; see lecture).
Let’s look at the depression example from the lecture. The CES-D scale for self-report depression (Radloff, 1977) contains 22 items. Say we ask participants to indicate on a 5-point Likert scale to what extent they agree with each of the 22 statements (1 = strongly disagree – 5 = Strongly agree).
- Item 1: I was bothered by things that usually don’t bother me.
- Item 2: I had a poor appetite.
- Item 3: I did no feel like eating, even though I should have been hungry.
- …
- Item 22: I didn’t enjoy life.
Questions
1. Question
What type of data are the responses to each item?
2. Question
Are responses to these items are normal distributed? Why (or why not)?
Simulation
We will see now, why we can still use linear-regression models that assume normally distributed data, even though the data we obtain are not normal distributed.
1. Task
First we need to define the response options available to the participant.
# response options
<- 1:5 # people can only respond with a number from 1 to 5
response # Check out the vector
# numbers from 1 to 5 (easy) response
[1] 1 2 3 4 5
We can use sample
to take random samples from response
. Try out; there is a \(\frac{1}{5}\) chance the number you obtained is the same as mine. Do you know why?
sample(response, size = 1) # Sample 1 random number between 1 and 5
[1] 4
What’s the code for sampling 2 random numbers?
2. Task
To sample more than 5 random numbers we need to set replace = TRUE
to allow the same number to be sample more than once (called sampling with replacement).
sample(response, size = 6, replace = TRUE) # Sample 6 random numbers between 1 and 5
[1] 1 2 3 5 3 3
Remember that the depression scale above has 22 items. So if we want to simulate one participant who responds to every of the 22 items (at random), what would you need to change size
to?
3. Task
You probably worked out in Task 3 that size
has to be 22. Lets save the result in ppt_1
.
<- sample(response, size = 22, replace = TRUE) # Simulate one participant ppt_1
Let’s plot these data to see how they are distributed. Use histogram()
and set data
to NULL
(cause we don’t use a data frame) and set x
to ppt_1
. Replace xxxs.
# create histogram of ppt_1
#histogram(data = NULL, x = xxx)
Is this distribution normal? Why or why not?
4. Task
We’ve seen how we can sample random data for one participant that answers all 22 depression items. Now, lets demonstrated the central limit theorem. Remember that the central limit theorem states that we will arrive at a normal distribution if our sample is approaching infinity (or a large number) regardless of the shape of the distribution we’re sampling from.
We don’t learn much from sampling data for one hypothetical participant. We can use the replicate
function to do the same for 2 participants.
replicate(2, sample(response, size = 22, replace = T))
[,1] [,2]
[1,] 5 1
[2,] 5 2
[3,] 3 4
[4,] 1 5
[5,] 2 5
[6,] 5 3
[7,] 5 1
[8,] 4 4
[9,] 5 1
[10,] 2 1
[11,] 1 3
[12,] 1 4
[13,] 3 1
[14,] 1 3
[15,] 5 5
[16,] 1 3
[17,] 2 2
[18,] 4 5
[19,] 4 5
[20,] 3 3
[21,] 1 2
[22,] 2 2
Let’s save the data for two participants in two_ppts
and calculate the means for each column (i.e. participant) using colMeans
.
<- replicate(2, sample(response, size = 22, replace = T))
two_ppts <- colMeans(two_ppts) two_ppts_means
These are the means:
two_ppts_means
[1] 3.227273 2.909091
Here is a histogram of two participant means:
histogram(data = NULL, x = two_ppts_means)
This was good but still not very impressive in terms of normal distributions. Get a histogram for 10, 100 and then 1000 hypothetical participants. Replace xxx accordingly and create a histogram each time.
# samps <- replicate(xxx, sample(response, size = 22, replace = T))
#samps_means <- colMeans(xxx)
#histogram(data = NULL, x = xxx)
Share the plot that you think shows a normal distribution best in your Teams chat:
To share your plot: Click on Export
in the Plots
panel, then Copy to Clipboard ...
and Copy Plot
, then go to your Teams chat, click in the chat box and click CTRL
+ V
to insert the plot (or CMD
+ V
on Mac).
5. Task (bonus)
You feel the previous task wasn’t challenging enough? Try to do the same histogram of the sampling distribution again but instead of means, use total; i.e. replace colMeans
with colSums
. The central tendency theorem works not just for means but also for totals and even for standard deviations.
Summary
Congratulations, you just
- simulated normal distributed data.
- learned how to use means and standard deviations to calculate the likelihood of observing a range of possible values.
- used a simulation to create a sampling distribution from a sample of non-normal distributed data and thereby
- demonstrated the central limit theorem.