Covid-19: Estimating Prevalence

Herd immunity refers to the state when enough people in the population become immune to COVID-19 that the spread of the coronavirus slows or stops. The big question is what proportion of the population needs to be immune to achieve this protection.

If herd immunity is a possibility, it would help policy makers and the public know how many people are immune. Since it is infeasible to test everyone (the population), we can take a sample.

Assume that you are the epidemiologist for the state of California, which has

• Population = 40,000,000

• Number of People with Immunity = 9,000,000

Both of these are population parameters. Remember that the 2nd parameter (9 MM with immunity) is an assumed number and in real-life, we don’t know it. How many people do we test? If you can randomly sample individuals in California and test them for immunity, how large should your sample size be?

1. Evaluate the following sample sizes:
• n = 100
• n = 1,000
• n = 10,000
• n = 100,000
Which sample size gets you a better estimate of the population proportion?

Note: set your seed as 1234 to generate the same random samples as the answer key.

Hint: You should first generate a dataset with 40,000,000 observations using the population parameters above. The dataset will simulate the population

2. One way to evaluate the sampling distribution is to report what percentage of samples are within 3% of the true population proportion. The idea is that the larger the sample size, the greater the percentage of samples that should be within 3% (For example, if we take 100 samples of n = 1,000,000, almost all the samples will have sample proportions within 3% of the true population proportion.)
Take 100 samples of each sample size (n = 100, n = 1000, etc.) For each sample size, compare the percent of samples that produce sample proportions within 3% of the true population proportion. As the sample size increases, what happens to the percent of samples that fall within 3% of the true population proportion? Does this provide rationale for using larger sample sizes?

Hint: this will involve a loop

3. Clearly using the largest sample size is almost always ideal. What are the issues with testing a large number of people? How would this factor into the number of people that should be tested?