The U.S. Department of Education hosts a website where they make available data about all undergraduate degree-granting institutions of higher education in the United States. In this exercise, we will work with the population of 559 public universities with recorded information from the year 2013.
In this activity we will be taking repeated samples and observing histograms of sample means in order to explore the distribution.
Below is a histogram of the average yearly cost of attendance. The vertical line marks the mean of this variable, \(\mu =\) 19471.3.
In a few words, describe the data distribution. What is its shape? Where is it centered? How wide is it?
Now, suppose we take a simple random sample of 5 of these universities and store the results. A possible sample could be: 17876, 21256, 19303, 15874, 18724. The mean of this particular sample is 18606.6. How does this compare to the population mean?
We could repeat this many times, collecting multiple samples of size n = 5 and recording the sample mean:
Sample 1: 23052, 22974, 20630, 13026, 14906, Sample mean = 18917.6
Sample 2: 18406, 17576, 13756, 17213, 18166, Sample mean = 17023.4
Sample 3: 19763, 20252, 17508, 19851, 23437, Sample mean = 20162.2
Sample 4: 16898, 22290, 19468, 26086, 22800, Sample mean = 21508.4
Sample 5: 13735, 16397, 22485, 21101, 16547, Sample mean = 18053
etc. Notice how the sample means vary, and approximately how far they are from the population mean.
Let’s repeat this for a larger sample size. This time we will take repeated samples of size 20. since it would get unwieldy to look at the sample data, let’s just report the sample mean for each sample.
Sample 1 mean = 19919.35
Sample 2 mean = 19073.15
Sample 3 mean = 18882.35
Sample 4 mean = 18467.75
Sample 5 mean = 18945.8
etc. Notice again how the sample means are varying, and approximately how far they are from the population mean.
Let’s repeat this once more for an even larger sample size. Again, make a note of how the sample means are varying, and approximately how far they are from the population mean.
Sample 1 mean = 19849.94
Sample 2 mean = 20314.2
Sample 3 mean = 18686.08
Sample 4 mean = 18847.7
Sample 5 mean = 19591.42
We just took 5 samples for each sample size, n = 5, n = 20, and n = 50. Now we’re going to have a computer take 500 samples of n = 5 universities and record the mean of each sample in a vector, and then report the average of the sample means and draw a histogram of the sample means. The red line on the plot is drawn at the population mean.
The mean of the 500 sample means when n = 5 is 19514.08
The population mean is 19471.3
Compare the sample means histogram to the data histogram. What do they have in common? How are they different?
Let’s compare the results above to the situation where n = 20.
The mean of the 500 sample means when n = 20 is 19454.93
The population mean is 19471.3
Compare this sample means histogram to the data histogram and to the previous histogram. What do they have in common? How are they different?
Let’s change the sample size to 50 and see what happens.
The mean of the 500 sample means when n = 50 is 19491.2
The population mean is 19471.3
Compare this sample means histogram to the data histogram and to your previous histograms. What do they have in common? How are they different?
Another variable in the college data from 2013 is the number of undergraduate degree-seeking students.
The data histogram for the number of undergraduate students is shown below. The population mean is 10181.91 and is highlighted with the red vertical line.
What is different about this data distribution? What is its shape? Where is it centered? How wide is it?
We’ll skip the “start small” stage for this variable and jump to taking hundreds of samples at once. We’re going to again have the computer take 500 samples of n = 5 universities and record the mean of each sample in a vector, then draw a histogram of the sample means.
The mean of the 500 sample means when n = 5 is 10538.84
The population mean is 10181.91
Compare the sample means histogram to the data histogram. What do they have in common? How are they different?
Here are the results for n = 20.
The mean of the 500 sample means when n = 20 is 10266.36
The population mean is 10181.91
Compare this sample means histogram to the data histogram and to the previous histogram. What do they have in common? How are they different?
Here are the results for n = 50.
The mean of the 500 sample means when n = 50 is 10175.33
The population mean is 10181.91
Compare this sample means histogram to the data histogram and to your previous histograms. What do they have in common? How are they different?