What is a sampling distribution?
The sampling distribution of any statistic (e.g., the mean) is the distribution of values we would expect to obtain for that statistic (e.g., the mean) if we drew an infinite number of samples from the population in question and calculated the statistic (e.g., the mean) on each sample.
What is the central limit theorem (CLT)?
The shape of the sampling distribution of the mean approaches normal as the sample size (n) increases, regardless of the shape of the parent population.
However, the rate at which the sampling distribution of the mean approaches normal as the sample size (n) increases depends on the shape of the parent population (e.g., normal verses markedly skewed).
Let’s compare two sampling distributions of means (sampdistA & sampdistB), both of which consist of 1000 samples from a uniform distribution. Whereas sampdistA consists of samples with a sample size of 5, sampdistB consists of samples with a sample size of 50.
Before writing these scripts, let’s write a script for a sampling distribution of the mean that consists of 10 samples with a sample size of 5.
# To open a new script, use File > New Document (or, if you're using R
# Studio, use File > New File > R Script)
# How do we indicate the number of iterations (i.e., the number of samples)?
# Save it as a variable!
num_samples <- 10
# How do we indicate the sample size? Save it as a variable!
sample_size <- 5
# For each of the 10 iterations, there will be a sample mean. We need to
# create a vector that stores all of these sample means. How do we do this?
# There are lots of different right answers to this. One easy way is to make
# a vector of the right size filled with zeros, and then we'll over-write
# each of the zeros as we get our sample means.
sample_means <- rep(0, num_samples) # rep(a,b) means 'repeat a b times'
# How do we tell R to run i, ii, and iii for each of the 10 iterations (this
# is known as a for loop)?
for (i in 1:num_samples) {
# i is the counter variable. Each time R goes through the for loop, i will
# move one step forward in the sequence we specified (1:num_samples, which
# means 'from 1 to num_samples, which = 10'). When it gets to the end
# (i=10), then it will stop.
# How do we tell R to randomly select a sample with a sample size of 5 from
# a uniform distribution?
sample <- runif(sample_size)
# How do we tell R to give us the mean of the sample that it just randomly
# selected?
sample_mean <- mean(sample) #note that this variable is different from the storage vector we made above (called 'sample_means')
# How do we tell R to store this sample mean in our column vector of sample
# means?
sample_means[i] <- sample_mean # note the square brackets here to indicate a particular element in the vector. If you used round brackets, R would think you meant that sample_means() was a function.
# This is where the counter variable (i) gets used. This is super important.
# Whenever you make a for loop, you should always have some step within the
# for loop that uses your counter variable.
} # This indicates the end of the for loop.
Run the above code by copy-pasting it into the console or by opening the .rmd file in R or R Studio, then clicking on the line of code you want to run and hitting Cmd-Enter (on a Mac) or Ctrl-R (on a PC).
Make a simulation with small samples (n=5):
num_samples <- 100 # Note that we're running more iterations now.
sample_size <- 5
sample_means <- rep(0, num_samples)
for (i in 1:num_samples) {
sample <- runif(sample_size)
sample_mean <- mean(sample)
sample_means[i] <- sample_mean
}
hist(sample_means, main = "Sampling Distirbution with Sample Size 5")
# hist(sample_means, main=paste('Sampling Distirbution with Sample Size',
# sample_size)
Make a new simulation with larger samples (n=50):
# Note that when we run this code, we would overwrite all the variables we
# made before if we just used the same names. I added 'B' to the
# sample_means vector since I didn't want to overwrite the original
# sample_means vector. The other variables I don't care about.
num_samples <- 100
sample_size <- 50 # This is the only real change in the code
sample_meansB <- rep(0, num_samples)
for (i in 1:num_samples) {
sample <- runif(sample_size)
sample_mean <- mean(sample)
sample_meansB[i] <- sample_mean
}
hist(sample_meansB, main = "Sampling Distirbution with Sample Size 50") # Remember to change the plot title, too, unless it's automatically generated from the value of the variable sample_size (see below)
# hist(sample_means, main=paste('Sampling Distirbution with Sample Size',
# sample_size)
What if we wanted to display both histograms next to each other, in the same figure?
Use the par() function. You run this before you generate plots to set the parameters for those figures. To tell it to include more than one plot in the same figure, use par(mfrow) to set up a matrix of plots. Set mfrow=c(r,c) where r is the number of rows of plots you want, and c is the number of columns. For example, par(mfrow=c(2,3)) will set up a matrix of 6 plots, in 2 rows of 3 plots each.
See Quick R for details and more examples.
par(mfrow = c(1, 2))
hist(sample_means, main = "N=5") # our first histogram
hist(sample_meansB, main = "N=50") # our second histogram
What if you wanted to overlay a normal curve on these, to see how well the histograms fit a normal distirbution?
See UCLA guide for details and more examples.
# Use curve(), and the function drnom(), which uses the formula for the
# normal distribution. Remember the normal dist is defined by a mean and SD,
# so we have to include those. Use the means and SDs from your sampling
# distributions.
meanA <- mean(sample_means)
sdA <- sd(sample_means)
meanB <- mean(sample_meansB)
sdB <- sd(sample_meansB)
par(mfrow = c(1, 2))
hist(sample_means, freq = FALSE, breaks = 20, main = "N=5")
curve(dnorm(x, mean = meanA, sd = sdA), col = "blue", add = TRUE)
hist(sample_meansB, freq = FALSE, breaks = 20, main = "N=50")
curve(dnorm(x, mean = meanB, sd = sdB), col = "blue", add = TRUE)
Question: Do the histograms of the two sampling distributions differ?
Question: If you wanted to systematically compare lots of possible sample sizes, what would you do?