Workshop 17 [Week 7]: Model assumptions

Lecture review

Parametric models (t-test, ANOVA, linear regression) make assumptions about their input (the data).
Assumptions relate to the normal distribution
- data must be continuous
- observations must be independent and identically distributed
The normal distribution is assumed because of the central limit theorem.

Learning outcomes

This workshop, along with the lecture on model assumptions should give you the opportunity to be able to

name central properties and aspects of the normal distribution.
understand the conceptual idea of the central limit theorem.
use basic simulations in R to understand both.

For revision of central limit theorem and normal distribution see Matloff (2019) chapter 9.7 and Baguley (2012) chapter 2.4.1

Setup

You need the package psyntur:

library(psyntur) 
packageVersion("psyntur") # should be version 0.0.2

[1] '0.0.2'

The normal distribution

Simulate normal distributed data

To draw out a normal distribution, we need to know the mean and the standard deviation.

The function rnorm (r=random, norm=normal) allows us to randomly sample normal distributed data. The function takes three parameter values: the number of samples n we want to obtain from a distribution with a mean, and a standard deviation sd or some sort. Go through the example revealed by Buttons 1. – 3.

1. Create random data

Create random distributed data:

n <- 5 # Number of observations to be sampled.
mean <- 500 # True population mean  
sd <- 100 # True population standard deviation
x <- rnorm(n = n, mean = mean, sd = sd)

You won’t have the same values. Every time you use rnorm you will get new values (i.e. magic). Check the randomly sampled observations by typing x and Return in the R console.

x # Tada, 5 random values

[1] 443.9524 476.9823 655.8708 507.0508 512.9288

2. Why fake data?

Functions like rnorm allow us to control population parameters. In reality we rarely know the population mean, say, the average time it takes to recover from COVID-19, but we can estimate it from a large enough sample and repeated sampling from the population. These unknown population parameters are conventionally indicated with Greek letters, i.e. \(\mu\) (mu; the population mean) and \(\sigma\) (sigma; the population standard deviation).

Because we don’t know the population parameters in real life, functions like rnorm are great. We can simulate data of which we do know the parameter values, because we defined both above (i.e. mean, sd).

3. Plot the data

Plot the “fake” data from above.

histogram(x, data = NULL)

Here is the mean of the sample:

mean(x)

[1] 519.357

and the standard deviation of the sample:

sd(x)

[1] 81.10218

Questions

1. Question

Your numbers will be different from mine because we take random samples. Are the data in your histogram normal distributed? Why do you think they are (or are not) normal distributed?

2. Question

Compare the sample mean and the sample standard deviation to our population mean and population standard deviation. Are they different? Remember, we are in the almost unique situation to know the population parameters. Why do you think are the sample mean and standard deviation different from the population mean and standard deviation?

Tasks

1. Task

Repeat the steps above but this time create a histogram with sample size n=1000. What changes did you make in the code? How has the histogram changed? Calculate the sample mean and sd again. How did they change?

2. Task

Repeat the steps above but this time with a sample size n=1000 and a sd of 10. What changes did you make in the code? How has the histogram changed? Calculate the sample mean and sd again. How did they change?

3. Task

If we know the population mean and population variance, we know everything we need to know to create data we would be expected under the normal distribution. Let’s do this for IQ which has a population mean of 100 and a population sd of 15. Use an n of 1000 samples and plot a histogram of IQ.

The area under the curve

Let’s continue with the IQ example from the lecture. We know the population values, which is extremely handy. Here is a density plot of IQ with a mean of 100 and a sd of 15. This should look similar to your plot from task 3. Except instead of counts / frequency, the density plot describes the relative likelihood of IQ values.

Remember, the area underneath the curve must sum to 1. In other words the likelihood of an observation to take on an IQ of any value under the curve is 1 (or 100%). The likelihood of a value to be outside this area is > 0 (%). Those are the extremes.

We can now use the pnorm (probability normal) function to calculate the area underneath the curve. You might have done calculus and integral calculations in school. No worries, we won’t go there :)

Again, we take the population values (mean and standard deviation) of IQ.

mean <- 100
sd <- 15

Say we want to know the probability of observing a person with an IQ below 100 which is the red shaded area in the density plot:

The probability of observing a person with an IQ below 100 can be determined like this:

pnorm(100, mean = mean, sd = sd)

[1] 0.5

In other words, 50% (0.5) of the population have an IQ below 100.

Tasks

1. Task

Calculate the probability of observing a person with an IQ below 75 corresponding to the area shaded in this plot:

2. Task

Calculate the probability of observing a person with an IQ above 75 corresponding to the area shaded in the plot below.

Hint: Keep in mind that pnorm returns the probability of IQ lower than the critical value. The entire area underneath the curve sums to 1 (100%) and the distribution is symmetric. In other words you can use 1 - x where x is the result returned from pnorm. Make sure you understand why :)

3. Task

Calculate the probability of observing a person with an IQ between 95 and 110 corresponding to the area shaded in the plot below. Example below.

As an example, here is the probability of observing a person with an IQ between 75 and 95

pnorm(95, mean = mean, sd = sd) - pnorm(75, mean = mean, sd = sd)

[1] 0.321651

which is the probability of observing a person with an IQ below 95

pnorm(95, mean = mean, sd = sd)

[1] 0.3694413

but removing the probability of observing a person with an IQ below 75

pnorm(75, mean = mean, sd = sd)

[1] 0.04779035

4. Task

Calculate the probability of observing a person with an IQ between 99.9 and 100.1 corresponding to the area shaded in the plot below.

5. Task

Calculate the probability of observing a person with an IQ of below 60 or above 140; see the area shaded in the plot.

6. Task (bonus)

Calculate the probability of observing a person with an IQ outside of one standard deviation around the mean (see plot). Reminder the population mean of IQ is 100 and its standard deviation is 15.

Central limit theorem

The central limit theorem states that the sampling distribution will approach normality when the sample size increase, regardless of the shape of the distribution we are sampling from. For the central limit theorem to work, samples must be independent and identically distributed (iid; see lecture).

Let’s look at the depression example from the lecture. The CES-D scale for self-report depression (Radloff, 1977) contains 22 items. Say we ask participants to indicate on a 5-point Likert scale to what extent they agree with each of the 22 statements (1 = strongly disagree – 5 = Strongly agree).

Item 1: I was bothered by things that usually don’t bother me.
Item 2: I had a poor appetite.
Item 3: I did no feel like eating, even though I should have been hungry.
…
Item 22: I didn’t enjoy life.

Questions

1. Question

What type of data are the responses to each item?

2. Question

Are responses to these items are normal distributed? Why (or why not)?

Simulation

We will see now, why we can still use linear-regression models that assume normally distributed data, even though the data we obtain are not normal distributed.

1. Task

First we need to define the response options available to the participant.

# response options
response <- 1:5 # people can only respond with a number from 1 to 5
# Check out the vector
response # numbers from 1 to 5 (easy)

[1] 1 2 3 4 5

We can use sample to take random samples from response. Try out; there is a \(\frac{1}{5}\) chance the number you obtained is the same as mine. Do you know why?

sample(response, size = 1) # Sample 1 random number between 1 and 5

[1] 4

What’s the code for sampling 2 random numbers?

2. Task

To sample more than 5 random numbers we need to set replace = TRUE to allow the same number to be sample more than once (called sampling with replacement).

sample(response, size = 6, replace = TRUE) # Sample 6 random numbers between 1 and 5

[1] 1 2 3 5 3 3

Remember that the depression scale above has 22 items. So if we want to simulate one participant who responds to every of the 22 items (at random), what would you need to change size to?

3. Task

You probably worked out in Task 3 that size has to be 22. Lets save the result in ppt_1.

ppt_1 <- sample(response, size = 22, replace = TRUE) # Simulate one participant

Let’s plot these data to see how they are distributed. Use histogram() and set data to NULL (cause we don’t use a data frame) and set x to ppt_1. Replace xxxs.

# create histogram of ppt_1
#histogram(data = NULL, x = xxx)

Is this distribution normal? Why or why not?

4. Task

We’ve seen how we can sample random data for one participant that answers all 22 depression items. Now, lets demonstrated the central limit theorem. Remember that the central limit theorem states that we will arrive at a normal distribution if our sample is approaching infinity (or a large number) regardless of the shape of the distribution we’re sampling from.

We don’t learn much from sampling data for one hypothetical participant. We can use the replicate function to do the same for 2 participants.

replicate(2, sample(response, size = 22, replace = T))

      [,1] [,2]
 [1,]    5    1
 [2,]    5    2
 [3,]    3    4
 [4,]    1    5
 [5,]    2    5
 [6,]    5    3
 [7,]    5    1
 [8,]    4    4
 [9,]    5    1
[10,]    2    1
[11,]    1    3
[12,]    1    4
[13,]    3    1
[14,]    1    3
[15,]    5    5
[16,]    1    3
[17,]    2    2
[18,]    4    5
[19,]    4    5
[20,]    3    3
[21,]    1    2
[22,]    2    2

Let’s save the data for two participants in two_ppts and calculate the means for each column (i.e. participant) using colMeans.

two_ppts <- replicate(2, sample(response, size = 22, replace = T))
two_ppts_means <- colMeans(two_ppts)

These are the means:

two_ppts_means

[1] 3.227273 2.909091

Here is a histogram of two participant means:

histogram(data = NULL, x = two_ppts_means)

This was good but still not very impressive in terms of normal distributions. Get a histogram for 10, 100 and then 1000 hypothetical participants. Replace xxx accordingly and create a histogram each time.

# samps <- replicate(xxx, sample(response, size = 22, replace = T))

#samps_means <- colMeans(xxx)

#histogram(data = NULL, x = xxx)

Share the plot that you think shows a normal distribution best in your Teams chat:

To share your plot: Click on Export in the Plots panel, then Copy to Clipboard ... and Copy Plot, then go to your Teams chat, click in the chat box and click CTRL + V to insert the plot (or CMD + V on Mac).

5. Task (bonus)

You feel the previous task wasn’t challenging enough? Try to do the same histogram of the sampling distribution again but instead of means, use total; i.e. replace colMeans with colSums. The central tendency theorem works not just for means but also for totals and even for standard deviations.

Summary

Congratulations, you just

simulated normal distributed data.
learned how to use means and standard deviations to calculate the likelihood of observing a range of possible values.
used a simulation to create a sampling distribution from a sample of non-normal distributed data and thereby
demonstrated the central limit theorem.

References

Baguley, T. (2012). Serious stats: A guide to advanced statistics for the behavioral sciences. Macmillan International Higher Education.

Matloff, N. (2019). Probability and statistics for data science: Math + R + data. CRC Press.

Radloff, L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1(3), 385–401.