Sampling Distributions for the Mean

The U.S. Department of Education hosts a website where they make available data about all undergraduate degree-granting institutions of higher education in the United States. In this exercise, we will work with the population of 559 public universities with recorded information from the year 2013.

In this activity you will be taking repeated samples, recording results, and observing histograms of sample statistics in order to explore their probability distributions.

To complete this activity, copy and paste the code in the boxes below into RStudio on your own machine.

Part A: Start Small

Task 1

Read in the dataset and construct a density histogram of the average yearly cost of attendance. The code below will also calculate and plot a vertical line at the mean of this variable.

collegecost <- read.csv("http://msu.edu/~fairbour/data/collegecost.csv")
hist(collegecost$cost,
     main = "Public Universities in the United States, 2013",
     xlab = "Average yearly cost of attendance in $",
     ylab = "Probability",
     prob = TRUE)
abline(v = mean(collegecost$cost), col = "red", lwd = 3)
mean(collegecost$cost)

In a few words, describe the data distribution. What is its shape? Where is it centered? How wide is it?

Task 2

Take samples of size n = 5

Now, take a simple random sample of 5 of these universities and store the results in a vector called x. Calculate the mean for the sample.

x <- sample(collegecost$cost, 5) #take a sample
x #look at your sample results
mean(x) #Record this sample mean on your worksheet

Run the code above 4 more times and make note of the mean of each sample. Calculate the average of your results as instructed and answer the questions on the worksheet.

Task 3

Increase the sample size to n = 20

Let’s repeat this for a larger sample size. Make note of the mean of each sample.

x <- sample(collegecost$cost, 20) #Take a sample
mean(x) #Record this sample mean on your worksheet.

Run the code above 4 more times and make note of the mean of each sample on your worksheet. Calculate the average of your results as instructed and answer the questions on the worksheet.

Task 4

Increase the sample size to n = 50

Repeat once more for an even larger sample size. Again, make a note of the mean of each sample.

x <- sample(collegecost$cost, 50) #Take a sample.
mean(x) #Record this sample mean on your worksheet.

Run the code above 4 more times and make note of the mean of each sample on your worksheet. Calculate the average of your results as instructed and answer the questions on the worksheet.

Part B: Go big or go home

Task 5

You just took 5 samples for each sample size, n = 5, n = 20, and n = 50. Now you’re going to have R take 500 samples of n = 5 universities and record the mean of each sample in a vector. Copy and paste the code below into and R script file and run it on your own machine. Highlight and run the code several times to see what changes as the samples change.

n <- 5 #Specify the sample size, n
#Create a vector to store the sample means and draw the samples
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$cost, n))}

#Calculate the mean of the 500 sample means and compare it to 
#the population mean
mean(xbar) #the mean of the sample means
mean(collegecost$cost) #the population mean

#Create a histogram of the 500 sample means 
#with a line at the population mean
hist(xbar, 
     prob = TRUE, 
     breaks=12, 
     xlim = c(5000, 35000),
     main = "Sample Means",
     xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$cost), col = "red", lwd = 2)

Compare the sample means histogram to the data histogram. What do they have in common? How are they different?

Task 6

Change the sample size to n = 20 and repeat.

Copy and paste the code below into and R script file and run it on your own machine.

n<-20
#Create a vector to store the sample means and draw the samples
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$cost, n))}

#Calculate the mean of the 500 sample means and compare it to 
#the population mean
mean(xbar) #the mean of the sample means
mean(collegecost$cost) #the population mean

#Create a histogram of the 500 sample means 
#with a line at the population mean
hist(xbar, 
     prob = TRUE, 
     breaks = 12, 
     xlim = c(5000, 35000),
     main = "Sample Means",
     xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$cost), col = "red", lwd = 2)

Compare this sample means histogram to the data histogram and to your previous histogram. What do they have in common? How are they different?

Task 7

Now change the sample size to n = 50 and repeat.

Copy and paste the code below into and R script file and run it on your own machine.

n<-50
#Create a vector to store the sample means and draw the samples
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$cost, n))}

#Calculate the mean of the 500 sample means and compare it to 
#the population mean
mean(xbar) #the mean of the sample means
mean(collegecost$cost) #the population mean

#Create a histogram of the 500 sample means 
#with a line at the population mean
hist(xbar, 
     prob = TRUE, 
     breaks = 12, 
     xlim = c(5000, 35000),
     main = "Sample Means",
     xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$cost), col = "red", lwd = 2)

Compare this sample means histogram to the data histogram and to your previous histograms. What do they have in common? How are they different?

Part C: Repeat with a different variable

College Student data

Another variable in the college data from 2013 is the number of undergraduate degree-seeking students. This variable is coded in the collegecost data set as students. Use the code below to construct a histogram for this variable.

Task 8

Copy and paste the code below into and R script file and run it on your own machine.

hist(collegecost$students,
     main = "Public Universities in the United States, 2013",
     xlab = "Number of undergraduate degree-seeking students",
     ylab = "Probability",
     prob = TRUE)
abline(v = mean(collegecost$students), col = "red", lwd = 3)
mean(collegecost$students)

In a few words, describe this data distribution. What is its shape? Where is it centered? How wide is it?

Task 9

We’ll skip the “start small” stage for this variable and jump to taking hundreds of samples at once. You’re going to again have R take 500 samples of n = 5 universities and record the mean of each sample in a vector. Copy and paste the code below into and R script file and run it on your own machine. Highlight and run the code several times to see what changes as the samples change.

n <- 5 #Specify the sample size, n
#Create a vector to store the sample means and draw the samples
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$students, n))}

#Calculate the mean of the 500 sample means and compare it to 
#the population mean
mean(xbar) #the mean of the sample means
mean(collegecost$students) #the population mean

#Create a histogram of the 500 sample means 
#with a line at the population mean
hist(xbar, 
     prob = TRUE, 
     breaks = 12, 
     xlim = c(0, 50000),
     main = "Sample Means",
     xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$students), col = "red", lwd = 2)

Compare the sample means histogram to the data histogram. What do they have in common? How are they different?

Task 10

Let’s change the sample size to n = 20 and repeat.

Use the code below to have R take 500 samples of n = 20 universities and record the mean of each sample.

n<-20
#Create a vector to store the sample means and draw the samples
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$students, n))}

#Calculate the mean of the 500 sample means and compare it to 
#the population mean
mean(xbar) #the mean of the sample means
mean(collegecost$students) #the population mean

#Create a histogram of the 500 sample means 
#with a line at the population mean
hist(xbar, 
     prob = TRUE, 
     breaks = 12, 
     xlim = c(0, 50000),
     main = "Sample Means",
     xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$students), col = "red", lwd = 2)

Compare this sample means histogram to the data histogram and to your previous histogram. What do they have in common? How are they different?

Task 11

Now change the sample size to n = 50 and repeat.

Use the code below to have R take 500 samples of n = 50 universities and record the mean of each sample.

n<-50
#Create a vector to store the sample means and draw the samples
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$students, n))}

#Calculate the mean of the 500 sample means and compare it to 
#the population mean
mean(xbar) #the mean of the sample means
mean(collegecost$students) #the population mean

#Create a histogram of the 500 sample means 
#with a line at the population mean
hist(xbar, 
     prob = TRUE, 
     breaks = 12, 
     xlim = c(0, 50000),
     main = "Sample Means",
     xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$students), col = "red", lwd = 2)

Compare this sample means histogram to the data histogram and to your previous histograms. What do they have in common? How are they different?

Part D: A Biased Point Estimate

We can use a similar process to look a the sampling distribution for the sample median. Let’s see if the sample median is a good estimator for the sample mean.

Task 12: Cost

Copy and paste the code below to create 500 sample medians for the cost variable, using a sample size of 50. R will draw a red line to indicate the population mean, and a blue line to indicate the average sample median.

n<-50
#Create a vector to store the sample medians and draw the samples
sampmed = rep(0,500)
for(i in 1:500) {sampmed[i] = median(sample(collegecost$cost, n))}

#Calculate the mean of the 500 sample medians and compare it to 
#the population mean
mean(sampmed) #the mean of the sample medians
mean(collegecost$cost) #the population mean

#Create a histogram of the 500 sample medians 
#with a line at the population mean
hist(sampmed, 
     prob = TRUE, 
     breaks = 12, 
     xlim = c(16000, 22000),
     main = "Sample Medians",
     xlab = "Sample median yearly cost")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$cost), col = "red", lwd = 2)
abline(v = mean(sampmed), col = "blue", lwd = 2)

Run the above code repeatedly. Does the average sample median line up with the population mean?

Task 13: Students

In the code below, we’ll create 500 sample medians for the students variable, using a sample size of 50. R will draw a red line to indicate the population mean, and a blue line to indicate the average sample median.

n<-50
#Create a vector to store the sample medians and draw the samples
sampmed = rep(0,500)
for(i in 1:500) {sampmed[i] = median(sample(collegecost$students, n))}

#Calculate the mean of the 500 sample medians and compare it to 
#the population mean
mean(sampmed) #the mean of the sample medians
mean(collegecost$students) #the population mean

#Create a histogram of the 500 sample medians 
#with a line at the population mean
hist(sampmed, 
     prob = TRUE, 
     breaks = 12, 
     xlim = c(0, 15000),
     main = "Sample Medians",
     xlab = "Sample median number of students")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$students), col = "red", lwd = 2)
abline(v = mean(sampmed), col = "blue", lwd = 2)

Run the above code repeatedly. Does the average sample median line up with the population mean? Is it better or worse when the original data is skewed?