STA 111 Lab 4

Complete all Questions, and submit final documents in html or PDF form on Canvas.

The Goal

We’ve been talking about ways that we take samples from a population. We are now going to transition in our course into talking about ways we can use samples to make conclusions about populations. This is important because we are often in a situation where we cannot get information on the entire population but we can get information on a sample. The process of working to make conclusions about a population using the data we have in a sample is called inference.

The first step to learning how to conduct inference is to talk about sampling distributions. That’s what we’re going to work on in our lab today.

The Set Up

Reese’s pieces are a type of chocolate-coated peanut butter candy. They come in 3 colors - red, orange, and yellow.

According to the manufacturer, about 50% of Reese’s Pieces are orange. 50% is a population proportion. This means that if we (1) opened every Reese’s Pieces package in the world, (2) counted the number of orange Reese’s Pieces, and (3) divide that count by the total number of Reese’s Pieces, we should get .50, or 50%.

We often want to know about population parameters. What proportion of people in the world suffer from hunger? What proportion of people in your school believe we need a week long Fall Break rather than just two days off? What proportion of people in the United States believe we need to devote more resources to combating global warming?

These values are important, but they are also very difficult to obtain. We can’t ask every person in the world, a country, or even a school these questions. This also means that if we are given a population parameter ( 50% of Reese’s Pieces are orange) it is also very difficult to verify whether the values we are told are correct.

Typically, this means that instead of a population parameter, we are stuck using a sample statistic that we obtain from a sample from our population of interest. This leaves us with some questions:

Can we use a sample statistic to estimate a population parameter?
If we choose to do so, can we figure out how accurate our estimate is likely to be?

We can also use sample statistics to check to see whether certain values of population parameters are plausible - we’ll do that when we get to hypothesis testing.

Simulation Study: Creating the Population

To answer these questions, we are going to use a simulation study (which just means an experiment using the computer). In a simulation study, we use the computer to create (i.e., simulate) our population of interest. Because we are creating the population, we know what the value of our population parameter is. We then can draw samples from this population and compute a sample statistic. We can then assess how well that sample statistic matches the population parameter.

To do this, go to the link: https://www.rossmanchance.com/applets/2021/oneprop/OneProp.htm?candy=1

You should see something that looks like:

To set up our simulation, we need to make sure the computer knows all the information about our data. We know that:

50% of Reese’s Pieces are orange
36 Reese’s Pieces are in each standard package.

To make sure the computer knows this, look at the upper left side of the screen:

We can see from this that in the population of Reese’s pieces being created by the computer, there is a 50% chance of getting orange. This is what we need, so we don’t need to change that!!

The next step is for us to draw a sample from this population. Specifically, we want to select one standard bag of Reese’s Pieces and count how many are orange. The number of Reese’s Pieces in a standard package is 36, and this does not match what is in the “Number of Candies” box on the Applet, so let’s change it to 36.

And now, we are ready to start!

Drawing a SRS sample

Great! We have our population. Now what? Well, now we need to see what a sample from this population would look like. As we have discussed, we don’t usually have the population. Instead, what we have is information on a sample. There are many different ways to create samples, and today we will use simple random sampling.

Question 1

What must be true for a sample to be a simple random sample (SRS)?

Let’s start off with a simple random sample of 36 Reese’s Pieces M&M’s. This means we want to randomly select 1 bag (36 Reese’s Pieces) from our population. To draw a simple random sample of size 36 from population, make sure your upper right hand panel looks like this, and then click Draw Samples:

You will see animation start where 36 candies are pulled from the population. They are sorted into orange vs other colors, and the proportion of orange candies are counted.

In my sample (YOURS IS LIKELY DIFFERENT!!! It’s random!!!!), 38.9% of the candies are orange.

Question 2

What proportion of Reese’s Pieces in your sample are orange? This is our sample statistic for this sample.

Question 3

Suppose we use the sample statistic in Question 2 as an estimate of our population parameter. Do we over estimate or underestimate, and by how much?

Question 4

Do we expect our sample statistic and our population parameter to be the same? Explain why or why not.

Now, in our simulation study, we know the population parameter. In reality, this is not normally the case. We have just seen that our sample statistic and population parameter may not be the same value, so this is an issue with trying to use a sample statistic to estimate a population parameter - the sample statistic and population parameter may not agree!

Let’s see what happens if we try a different simple random sample.

A second sample

Okay, so this is what happens when we take one sample of size 36 from the population. What if we had a different sample? Let’s find out.

Grab a second random sample of 36 candies. We use the same process as before, by clicking Draw Samples.

Question 5

What proportion of Reese’s Pieces in this sample are orange? Is this sample proportion the same as what you got Question 2?

Hmm. With two different simple random samples, we likely got two different values for the proportion of orange Reese’s Pieces. This is another problem with using a sample statistic to estimate our population parameter. We have just seen that the sample proportions we obtain can be different depending on the sample we draw!

All of this means that is not a good idea to use just one sample statistic to estimate our population parameter. However, we have also discussed that most of the time, a sample is all we have - we can’t get a population or even a second sample. So, what do we do??

Sampling Distributions

It turns out that we can actually estimate how far away a sample statistic is likely to be from the population parameter. This means that instead of stating one value of the parameter of interest, we can build a range of plausible values. For instance, we can say that we expect our population proportion to be between .1 and .15. We can do this using something called a sampling distribution.

We will learn two different ways to build a sampling distribution in this course: using simulation and using probability distributions. For today, we will use our simulation.

It turns out that the sampling distribution is already something your applet has been building as we drew our 2 samples.

Remember, your dots may be in different spots since our samples are different, but there should be 2 dots! Each one represents one of the two sample proportions you have obtained so far: one from Sample 1 and one from Sample 2.

Right now we only have two samples, and that’s not really enough to tell us much. What we want to do is take many, many samples from the population and keep building this graph. This will give us a better idea of the different sample statistics we might expect to get with different samples from the population.

Taking many different samples of the same sample size from the same population, getting a sample proportion from each one, and then looking at all of the different sample proportions we get is called building a sampling distribution.

Building a Sampling Distribution

Right now, we only have two samples. Let’s generate 10 simple random samples from our population. For each of these samples, we will compute the sample proportion of the candies in the sample that are orange.

To do this, adapt your applet to use 10 samples. I also turned off my candy machine at this point, though you are welcome to leave it on!

When you hit Draw Samples, 10 more dots appear on your graph! These are 10 more sample statistics, each from a different sample of candies!

We want many, many samples, not 10, so let’s go ahead and add 100 samples…and then 1000.

Question 6

Describe the sampling distribution you are now seeing. In other words, is the distribution of the sample proportion of orange candies unimodal or multimodal, and is it symmetric, skewed right, or skewed left?

Question 7

Click the button for Summary Statistics right above your sampling distribution. Based on what appears, what is the mean of all the sample statistics? In other words, what is the mean of your sampling distribution?

Question 8

Is the value you get in Question 7 bigger than, smaller than, or roughly equal to the true population parameter of .5?

So, we have a whole bunch of sample proportions - one from each of 1012 samples. Some of these sample proportions are much bigger than .50, and some are much smaller than .50. However, the mean of all of these sample proportions is very similar to .50!! Here we have discovered a very powerful fact - the mean of the sampling distribution (the distribution of sample proportions) is very similar to the population proportion!!

This is a key fact in statistics: On average, if we take enough simple random samples, the average of the sample proportions will approximately equal the population proportion.

Spread

Okay, so the mean of all of the sample proportions is basically the population proportion. However, we started this process with the goal of understanding how far sample statistics tend to be different from the true value of .50. To do this, we need to describe the spread of the sampling distribution.

Recall that we already know a term that describes how different individual data points tend to be from the mean of a sample. This is the standard deviation. So, we need to know the standard deviation of the sampling distribution - how far do individual sample statistics tend to different from the population proportion (the mean of all the sample statistics)? We call the standard deviation of the sampling distribution the standard error (SE).

Question 9

What is the standard error of our sample statistics? In other words, what is the standard deviation of all the sample statistics? Hint: The value should be printed on your sampling distribution by this point in the lab.

Great!! This means that on average, our sample statistics tend to be about this far away from the population proportion of .50. We know that in general, most values tend to be within 2 standard deviations of the mean! This means that if we have a sample proportion \(\hat{p}\), it is plausible to think that the population proportion \(p\) that we want is in the range

\[\hat{p} \pm 2 SE.\]

Question 10

Look back at your first sample statistic, the sample proportion from Question 2. Based on this, use your sample statistic to create a range of plausible values for the population proportion of Reese’s Pieces that are orange. State your range.

Question 11

Is \(p = .50\) in your range of plausible values?

This is the key to our ability to use samples to make conclusions about our population!!! We know that is it not a good idea to just report our sample statistic - it is likely larger or smaller than the actual population proportion. However, sampling distributions allow us to build a range of plausible values for our population proportion when all we have is the information from samples.

The Formula

Using simulation is great, and very powerful. However, what if we are not in a situation where we can use simulation? We will learn soon in our books that if take enough samples, and if certain conditions are met, the standard error (SE) of the sampling proportion is:

\[SE = \sqrt{\frac{p(1-p)}{n}},\]

where \(p\) is the true population parameter and \(n\) is the size of each SRS.

Question 12

Based on the formula above, what should the standard error of the sample proportion be? Is this value similar to what you got in Question 9?

Changing the Sample Size

We have answered the two questions we started off wanting to answer! However, let’s try exploring one more.

We have run our simulation using a sample size of 36. What if we increase that to 100? Would that change our sampling distribution at all? Let’s see.

Question 13

Refresh your Applet screen. Change your sample size to 100, and generate 1000 sample statistics. What is the mean of this sampling distribution?

Question 14

What is the standard error of this sampling distribution? Is this larger or smaller than the standard error you got in Question 9?

What you should see is that as your samples get bigger, our standard error gets smaller. In other words, larger samples tend to give us sample proportions that are more similar to the population proportion than smaller samples.

Question 15

Why do you think that is?

Wrapping it up

In this lab, we have seen that sampling distributions can help us to build a range of plausible values for a population proportion when all we have is a sample proportion. In the next class, we will formalize this with the construction of confidence intervals.

This lab uses the Rossman Chance Applet found at https://www.rossmanchance.com/applets/2021/oneprop/OneProp.htm?candy=1. The activity was written by Nicole Dalzell. Last updated 2025 July 23.