As mentioned previously, each time you open a new RStudio session, you need to run the following three commands.
require(mosaic)
require(openintro)
require(MASS)
The U.S. Department of Education hosts a website where they make available data about all undergraduate degree-granting institutions of higher education in the United States. In this exercise, we will work with a sample of 506 public universities and their recorded information from the year 2013.
First, we read in the dataset and construct a density histogram of the average yearly cost of attendance. We will also calculate and plot a vertical line at the mean of this variable.
collegecost <- read.csv("http://www.math.usu.edu/cfairbourn/Stat2300/RStudioFiles/data/collegecost.csv")
hist(collegecost$yearly,
main = "Public Universities in the United States, 2013",
xlab = "Average yearly cost of attendance in $",
prob = TRUE)
abline(v = mean(collegecost$yearly), col = "red", lwd = 3)
mean(collegecost$yearly)
## [1] 19719.76
Let’s take a simple random sample of 5 of these universities and store the results in a vector called x. We will also calculate the mean for the sample.
x <- sample(collegecost$yearly, 5)
#look at the numbers in this sample and calculate the mean
x #data for this sample
## [1] 23905 19380 15058 18042 22332
mean(x) #this sample mean
## [1] 19743.4
Run the code below 3 more times and make note of the mean of each sample
x <- sample(collegecost$yearly, 5)
x
mean(x)
Increase the sample size to n = 20
Let’s repeat this for a larger sample size. Make note of the mean of each sample.
x <- sample(collegecost$yearly, 20)
x #data for this sample
## [1] 25799 19048 15242 22483 22936 19511 13713 19611 32715 25931 17749
## [12] 16227 21966 19917 16877 17839 22796 24997 24835 7715
mean(x) #this sample mean
## [1] 20395.35
Run the code below 3 more times and make note of the mean of each sample
x <- sample(collegecost$yearly, 20)
x
mean(x)
Increase the sample size to n = 50
Repeat once more for an even larger sample size. Again, make a note of the mean of each sample.
x <- sample(collegecost$yearly, 50)
x #data for this sample
## [1] 16427 25086 24696 13204 15112 12946 19462 19843 21432 15611 23707
## [12] 19415 21160 22267 20449 18724 21004 21601 15272 16345 28011 18227
## [23] 29787 15066 17767 19851 17350 15021 7715 13308 18406 21222 22420
## [34] 19192 22909 20252 19058 31474 20372 17959 30858 17553 15753 15935
## [45] 15218 13531 25513 16878 15002 15531
mean(x) #this sample mean
## [1] 19218.04
Run the code below 3 more times and make note of the mean of each sample
x <- sample(collegecost$yearly, 50)
x
mean(x)
Now we’re going to have R take 500 samples of n = 5 universities and record the mean of each sample.
Specify the sample size, n. Then create a vector to store the sample means and draw the samples.
n <- 5
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$yearly, n))}
Calculate the mean of the 500 sample means and compare it to the population mean, then create a histogram of the 500 sample means with a line at the population mean.
mean(xbar) #the mean of the sample means
## [1] 19767.86
mean(collegecost$yearly) #the population mean
## [1] 19719.76
hist(xbar,
prob = TRUE,
breaks = 12,
xlim = c(5000, 35000),
main = "Sample Means",
xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$yearly), col = "red", lwd = 2)
Compare the sample means histogram to the data histogram. What do they have in common? How are they different?
Let’s change the sample size to n = 20 and repeat.
n<-20
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$yearly, n))}
mean(xbar) #the mean of the sample means
## [1] 19832.4
mean(collegecost$yearly) #the population mean
## [1] 19719.76
hist(xbar,
prob = TRUE,
breaks = 12,
xlim = c(5000, 35000),
main = "Sample Means",
xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$yearly), col = "red", lwd = 2)
Finally, let’s change the sample size to n = 50 and repeat one last time.
n <- 50
xbar = rep(0,500)
for(i in 1:500) {xbar[i] = mean(sample(collegecost$yearly, n))}
mean(xbar) #the mean of the sample means
## [1] 19689.06
mean(collegecost$yearly) #the population mean
## [1] 19719.76
hist(xbar,
prob = TRUE,
breaks = 12,
xlim = c(5000, 35000),
main = "Sample Means",
xlab = "Mean")
legend("topright",c("n = ",n))
abline(v = mean(collegecost$yearly), col = "red", lwd = 2)