Module 4, Part 2: Sampling Distributions

Sampling Distributions for the Mean

As mentioned previously, each time you open a new RStudio session, you need to run the following three commands.

require(mosaic)
require(openintro)
require(MASS)

The U.S. Department of Education hosts a website where they make available data about all undergraduate degree-granting institutions of higher education in the United States. In this exercise, we will work with a sample of 506 public universities and their recorded information from the year 2013.

First, we read in the dataset and construct a density histogram of the average yearly cost of attendance. We will also calculate and plot a vertical line at the mean of this variable.

collegecost <- read.csv("http://www.math.usu.edu/cfairbourn/Stat2300/RStudioFiles/data/collegecost.csv")

hist(collegecost$yearly,
     main = "Public Universities in the United States, 2013",
     xlab = "Average yearly cost of attendance in $",
     prob = TRUE)
abline(v = mean(collegecost$yearly), col = "red", lwd = 3)

mean(collegecost$yearly)

## [1] 19719.76

Let’s take a simple random sample of 5 of these universities and store the results in a vector called x. We will also calculate the mean for the sample.

x <- sample(collegecost$yearly, 5)
#look at the numbers in this sample and calculate the mean
x #data for this sample

## [1] 23905 19380 15058 18042 22332

mean(x) #this sample mean

## [1] 19743.4

Run the code below 3 more times and make note of the mean of each sample

x <- sample(collegecost$yearly, 5)
x
mean(x)

Increase the sample size to n = 20

Let’s repeat this for a larger sample size. Make note of the mean of each sample.

x <- sample(collegecost$yearly, 20)
x #data for this sample

##  [1] 25799 19048 15242 22483 22936 19511 13713 19611 32715 25931 17749
## [12] 16227 21966 19917 16877 17839 22796 24997 24835  7715

mean(x) #this sample mean

## [1] 20395.35

Run the code below 3 more times and make note of the mean of each sample

x <- sample(collegecost$yearly, 20)
x
mean(x)

Increase the sample size to n = 50

Repeat once more for an even larger sample size. Again, make a note of the mean of each sample.

x <- sample(collegecost$yearly, 50)
x #data for this sample

##  [1] 16427 25086 24696 13204 15112 12946 19462 19843 21432 15611 23707
## [12] 19415 21160 22267 20449 18724 21004 21601 15272 16345 28011 18227
## [23] 29787 15066 17767 19851 17350 15021  7715 13308 18406 21222 22420
## [34] 19192 22909 20252 19058 31474 20372 17959 30858 17553 15753 15935
## [45] 15218 13531 25513 16878 15002 15531

mean(x) #this sample mean

## [1] 19218.04

Run the code below 3 more times and make note of the mean of each sample

x <- sample(collegecost$yearly, 50)
x
mean(x)

Now we’re going to have R take 500 samples of n = 5 universities and record the mean of each sample.

Specify the sample size, n. Then create a vector to store the sample means and draw the samples.

n <- 5
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$yearly, n))}

Calculate the mean of the 500 sample means and compare it to the population mean, then create a histogram of the 500 sample means with a line at the population mean.

mean(xbar) #the mean of the sample means

## [1] 19767.86

mean(collegecost$yearly) #the population mean

## [1] 19719.76

hist(xbar, 
     prob = TRUE, 
     breaks = 12, 
     xlim = c(5000, 35000),
     main = "Sample Means",
     xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$yearly), col = "red", lwd = 2)

Compare the sample means histogram to the data histogram. What do they have in common? How are they different?

Let’s change the sample size to n = 20 and repeat.

n<-20
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$yearly, n))}

mean(xbar) #the mean of the sample means

## [1] 19832.4

mean(collegecost$yearly) #the population mean

## [1] 19719.76

hist(xbar, 
     prob = TRUE, 
     breaks = 12, 
     xlim = c(5000, 35000),
     main = "Sample Means",
     xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$yearly), col = "red", lwd = 2)

Finally, let’s change the sample size to n = 50 and repeat one last time.

n <- 50
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$yearly, n))}

mean(xbar) #the mean of the sample means

## [1] 19689.06

mean(collegecost$yearly) #the population mean

## [1] 19719.76

hist(xbar, 
     prob = TRUE, 
     breaks = 12, 
     xlim = c(5000, 35000),
     main = "Sample Means",
     xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$yearly), col = "red", lwd = 2)

USU STAT 2300 Module 4.2

Camille Fairbourn

Module 4, Part 2: Sampling Distributions

Sampling Distributions for the Mean