STA 111 Lab 4

Complete all Questions, and submit final documents in html or PDF form on Canvas.

The Goal

We’ve been talking about ways that we take samples from a population. We are now going to transition in our course into talking about ways we can use samples to make conclusions about populations. This is important because we are often in a situation where we cannot get information on the entire population but we can get information on a sample. The process of working to make conclusions about a population using the data we have in a sample is called inference.

The first step to learning how to conduct inference is to talk about sampling distributions. That’s what we’re going to work on in our lab today.

The Set Up

As always, we need to start with some data. We are going to work with data on M&M’s candy. According to the M&M website, 24% of M&M’s are blue. 24% is a population proportion. This means that if we (1) opened every M&M’s package in the world, (2) counted the number of blue M&M’s, and (3) divide that count by the total number of M&M’s, we should get .24, or 24%.

We often want to know about population parameters. What proportion of people in the world suffer from hunger? What proportion of people in your school believe we need a week long Fall Break rather than just two days off? What proportion of people in the United States believe we need to devote more resources to combating global warming?

These values are important, but they are also very difficult to obtain. We can’t ask every person in the world, a country, or even a school these questions. This also means that if we are given a population parameter ( 24% of M&M’s are blue) it is also very difficult to verify whether the values we are told are correct.

Typically, this means that instead of a population parameter, we are stuck using a sample statistic that we obtain from a sample from our population of interest. This leaves us with some questions:

Can we use a sample statistic to estimate a population parameter?
If we choose to do so, can we figure out how accurate our estimate is likely to be?

Simulation Study: Creating the Population

To answer these questions, we are going to use a simulation study (which just means an experiment using the computer). In a simulation study, we create (i.e., simulate) our population of interest. Because we are creating the population, we know what the value of our population parameter is. We then can draw samples from this population and compute a sample statistic. We can then assess how well that sample statistic matches the population parameter.

Let’s do this for our M&M data. We have been told that 24% of all M&Ms are blue. This means the population in our simulation study should have 24% of M&M’s being blue and the remaining 76% of M&M’s being a color other than blue. We can create a data set using the computer that has this property.

Question 1

Is our population parameter a population mean or a population proportion?

Question 2

There are roughly 2.3 million M&M’s in the world. Based on this, how many M&M’s in the world should be blue? Round your answer to the nearest whole number.

2.3 million is a really large number. To make our work a little easier today, we are going to create a population of 23,000 for our simulation study.

We can use R to create a population of 23,000 M&M’s with 24% of them being blue. The code we are going to use to do this is going to look a little different than what we have seen before, and it is not important that we understand how this code works. If you are interested though please feel free to ask!

Copy the code, put it in a chunk, and press play.

# Create our population: Repeat the phrase "not blue" 23000 times
population <- rep("not blue", 23000)
# Choose 24% of these to be blue M&Ms
set.seed(100)
makeBlue <- sample(1:23000, 23000*.24)
# Record these blue M&Ms in our population
population[makeBlue] <- "blue"
rm(makeBlue)

Running this code produces a data set called population in your Environment Tab in RStudio. We are going to treat this data set as our population for our simulation study.

To verify that we have created our population correctly, we want to make a quick table of population to make sure that our counts are correct. There are two different kinds of tables we can make in R.

Count tables show us the raw counts in the data, in other words how many blue and not blue M&M’s are there in our population.
Proportion tables show us the proportion of different values in the population, in other words what proportion of M&M’s in our population are blue or not blue.

This is a count table.

# Count Table
table(population)

This is a proportion table.

# Count Table
prop.table(table(population))

Question 3

Based on the tables, have we created our population correctly?

Drawing a SRS sample

Great! We have our population. Now what? Well, now we need to see what a sample from this population would look like. As we have discussed, we don’t usually have the population. Instead, what we have is information on a sample. There are many different ways to create samples, and today we will use simple random sampling.

Question 4

What must be true for a sample to be a simple random sample (SRS)?

Let’s start off with a simple random sample of 50 M&M’s. This means we want to randomly select 50 M&M’s from our population of 23,000 M&M’s. To draw a simple random sample of size 50 from population, we can use the sample function:

sample(population, size = 50)

Let’s break this down.

The sample command tells R to collect a simple random sample.
We then need to give R an object, population, to take a sample from.
The size argument tells R the size of the sample that we want.

Running the whole code therefore collects a simple random sample of size 50 from the population!

There is one small problem. We don’t really want the 50 values in the sample to print out on the screen. Instead, we want to store these 50 values in a data set so that we can work with them. To do that, use the following code:

SRS1 <- sample(population, size = 50)

Take a look at your Environment Tab. You should see a data set called SRS1 that contains 50 values.

This is the same code as before, but we have added <- in the front. This is the command that tells R to store the results as a data set rather than printing them out on the screen. We then have to tell R the name of the data set: SRS1.

Question 5

If I wanted to store the sample under the name SimpleRandomSample1, how would I need to change the code in the chunk above?

Okay, we have our sample. Let’s use it.

Question 6

What proportion of M&M’s in this sample are blue? This is our sample statistic for this sample.

Question 7

Suppose we use the sample statistic in Question 6 as an estimate of our population parameter. Do we over estimate or underestimate, and by how much?

Question 8

Do we expect our sample statistic and our population parameter to be the same? Explain why or why not.

Now, in our simulation study, we know the population parameter. In reality, this is not normally the case. We have just seen that our sample statistic and population parameter may not be the same value, so this is an issue with trying to use a sample statistic to estimate a population parameter.

Let’s see what happens, though, if we try a different simple random sample.

A second sample

Okay, so this is what happens when we take one sample of size 50 from the population. What if we had a different sample? SRS1 represents only one possible sample we could have drawn from the population. What if we choose a different simple random sample? Would our sample proportion be different? In other words, do sample statistics vary from sample to sample? Let’s find out.

Grab a second random sample of 50 M&Ms. We use the same process as before.

SRS2 <- sample(population, size = 50)

Question 9

What proportion of M&M’s in this sample are blue? Is this sample proportion the same as what you got Question 6?

Hmm. With two different simple random samples, we likely got two different values for the proportion of blue M&M’s. This is another problem with using a sample statistic to estimate our population parameter. We have just seen that the sample proportions we obtain can be different depending on the sample we draw! This is called sampling variability, meaning that we can get different estimates of the proportion of blue M&M’s due to differences in the data in each sample.

All of this means that is not a good idea to use just one sample statistic to estimate our population parameter. However, we have also discussed that most of the time, a sample is all we have - we can’t get a population or even a second sample. So, what do we do??

Quantifying Sampling Variability

It turns out that we can actually estimate how far away a sample statistic is likely to be from the population parameter. This means that instead of stating one value of the parameter of interest, we can build a range of plausible values. For instance, we can say that we expect our population proportion to be between .1 and .15. We can do this using something called a sampling distribution.

We will learn two different ways to build a sampling distribution in this course: using simulation and using probability distributions. For today, we will use our simulation.

Okay, what do we need to do? Well, we have seen that with two different random samples, we get two different sample statistics. We can also determine how far each of these two sample statistics are away from the population parameter.

Question 10

Look at the sample proportions you got in Question 6 (from SRS1) and Question 9 (in SRS2). For each of these, compute \(p - \hat{p}\), where \(p\) is the population proportion and \(\hat{p}\) is the sample proportion. This tells us how far off each of our sample statistics were from the population parameter.

It turns out that sample statistics can be greater than, less than, or equal to the population parameter. What we want to do is take many, many samples from the population. This will give us a better idea of the different sample statistics we might expect to get with different samples from the population. Taking many different samples of the same sample size from the same population, getting a sample proportion from each one, and then looking at all of the different sample proportions we get is called building a sampling distribution.

Building a Sampling Distribution

Right now, we only have two samples. Let’s generate 5000 simple random samples from our population. For each of these samples, we will compute the sample proportion of the M&Ms in the sample that are blue.

To do this, we will use the following code. Again, this code is more complicated than what you will need to do in R for this course. Basically, this is (1) drawing a SRS of 50 M&M’s from our population, (2) finding the proportion of M&M’s in the sample that are blue, and (3) recording that sample proportion.

sample_prop50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(population, size = 50)
   sample_prop50[i] <- sum(samp=="blue")/50
}

rm(i,samp)

The result of the code is a data set called sample_prop50. The first number in this data set represents the proportion of M&M’s in the first simple random sample that were blue.

Question 11

Look in your Environment Tab. What proportion of M&M’s in the third simple random sample are blue?

Now, we have 5000 sample proportions. That is a lot of numbers to look at! Luckily, we know that we can use data visualizations to explore numeric data! Let’s make a histogram to visualize our sampling distribution.

hist(sample_prop50, col="gold", xlab = "Sample Proportion", main = "Figure 1")

Question 12

Describe the sampling distribution. In other words, is the distribution unimodal or multimodal, and is it symmetric, skewed right, or skewed left?

Question 13

What is the average of all 5000 sample proportions? In other words, what is the mean of sample_prop50?

Question 14

Is the value you get in Question 13 bigger than, smaller than, or roughly equal to the true population parameter of .24?

So, we have a whole bunch of sample proportions - one from each of 5000 samples. Some of these sample proportions are much bigger than .24, and some are much smaller than .24. However, the mean of all of these sample proportions is very similar to .24. Here we have discovered a very powerful fact - the mean of the sampling distribution (the distribution of sample proportions) is very similar to the population proportion!!

This is a key fact in statistics: On average, if we take enough simple random samples, the average of the sample proportions will approximately equal the population proportion.

Spread

Okay, so the mean of all of the sample proportions is basically the population proportion. However, we started this process with the goal of understanding how far sample statistics tend to be from the true value of .24. To do this, we need to describe the spread of the sampling distribution.

Recall that we already know a term that describes how different individual data points tend to be from the mean of a sample. This is the standard deviation. So, we need to know the standard deviation of the sampling distribution - how far do individual sample statistics tend to different from the population proportion (the mean of all the sample statistics)? We call the standard deviation of the sampling distribution the standard error (SE).

Question 15

What is the standard error of our sample statistics? In other words, what is the standard deviation (sd) of sample_prop50?

Great!! This means that on average, our sample statistics tend to be about this far away from the population proportion of .24. We know that in general, most values tend to be within 2 standard deviations of the mean! This means that if we have a sample proportion \(\hat{p}\), it is plausible to think that the population proportion \(p\) that we want is in the range

\[\hat{p} \pm 2 SE.\]

Question 16

Look back at your first sample statistic, the sample proportion from SRS1. Based on this, use your sample statistic to create a range of plausible values for the population proportion of M&M’s that are blue. State your range. Is \(p = .24\) in your range of plausible values?

This is the idea of sampling variability, and is key to our ability to use samples to make conclusions about our population. We know that is it not a good idea to just report our sample statistic. However, using sampling variability allows us to build a range of plausible values for our population proportion when all we have is the information from samples.

The Formula

Using simulation is great, and very powerful. However, what if we are not in a situation where we can use simulation? We will learn soon in our books that if take enough samples, and if certain conditions are met, the standard error (SE) of the sampling proportion is:

\[SE = \sqrt{\frac{p(1-p)}{n}},\]

where \(p\) is the true population parameter and \(n\) is the size of each SRS.

Question 17

Based on the formula above, what should the standard error of the sample proportion be? Is this value similar to what you got in Question 15?

Changing the Sample Size

We have answered the two questions we started off wanting to answer! However, let’s try exploring one more.

We have run our simulation using a sample size of 50. What if we increase that to 100? Would that change our sampling distribution at all? Let’s see.

Question 18

Here is the code we used to run our simulation study. We now want to change our sample size to 100. To do this, just change all the 50’s that you see to 100’s. Make the change and then run the code and plot your sampling distribution. Change the title of the sampling distribution to Figure 2.

sample_prop50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(population, size = 50)
   sample_prop50[i] <- sum(samp=="blue")/50
}

rm(i,samp)

Question 19

What is the standard error (sd) of this second sampling distribution (sample_prop100)? Is this larger or smaller than the standard error you got in Question 15?

What you should see is that as your samples get bigger, our standard error gets smaller. In other words, larger samples tend to give us sample proportions that are more similar to the population proportion than smaller samples.

Question 20

Why do you think that is?

Wrapping it up

In this lab, we have seen that sampling variability can help us to build a range of plausible values for a population proportion when all we have is a sample proportion. In the next lab, we will formalize this with the construction of confidence intervals.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 May 31.