Lab 8: Inference for Categorical Data

Gallup is an organization that conducts extensive polls aimed at exploring a variety of facets about societal opinions, political views, and more. In December of 2017, the Gallup organization released an article called The 2017 Update on Americans and Religion. The article explored the relationship between religion and political party, the degree to which Americans considered themselves religious as well as the proportion of Americans identifying with particular religions. In this lab, we are going to use the survey results to explore changes in how Americans self-identify when it comes to their religious beliefs.

While we usually start our process with EDA, in this case, Gallup has done this work for us. Accordingly, we are going to begin by taking a look at the article and using it to answer the following questions.

Early in this course, we talked about the need to confirm the reliability of your data before using it to make any conclusions. One thing to look for to assess the validity of supplied data is information like margins of error, sampling methods, and sample sizes. If this information is provided, this lends more support to the claim that the data presented can be safely used. It also allows us to assess any potential biases that may result from the data collection methods.

Inference with a single proportion

The Gallup poll provides sample statistics, that is, calculations made from the sample. We are more interested in what information this sample can provide us about population parameters. Based on the survey results, we are able to determine what proportion of people in the sample reported being highly religious. Our goal is to estimate the proportion of adults in the United States who would report being highly religious. As we have done for population means, we are going to use confidence intervals and hypothesis tests to take information from the sample and make conclusions about the population.

Write out the conditions for inference that we need to check before constructing a 95% confidence interval for the proportion of adults in the United States in 2017 who identify as highly religious. Are you confident that all of the necessary conditions are met? Explain.

What is the standard error associated with the point estimate \(\hat{p}\) we would use for building a confidence interval?

Construct and interpret a 95% confidence interval for the proportion of American adults who identified as highly religious in 2017.

Why do you think the margin of error is so small?

How does the proportion affect the margin of error?

The section "How does the proportion affect the margin of error?" is adapted from a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

Confidence intervals require both a point estimate and a margin of error.

Imagine that you are asked by a friend who had read this article what is meant by a margin of error. Without using any calculations, explain what a margin of error is.

We know that we can build a variety of different confidence intervals by changing the critical value, and hence changing the margin of error. It turns out that more than just the critical value impacts the margin of error. We have seen in class that sample size also impacts the margin error, but that's not all, either. Imagine you've set out to survey 1000 people on two questions: is your favorite color blue? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.

Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval: \(ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n}\). Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\).

The first step is to make a vector p that is a sequence from \(0\) to \(1\) with each number separated by \(0.01\). This is our vector of proportions. We can then create a vector of the margin of error (me) associated with each of these values of p using the familiar approximate formula (\(ME = 2 \times SE\)). Lastly, we plot the two vectors against each other to reveal their relationship.

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2*sqrt(p*(1 - p)/n)
plot(me ~ p)

Based on the plot, describe the relationship between the proportion and the margin of error. Does it look like the margin of error is higher for more extreme values of p (i.e., values closer to 0 or 1) or for more moderate levels of p (i.e., choices closer to .5)?

Which of the following is false about the relationship between \(p\) and \(ME\).

A) The \(ME\) reaches a minimum at \(p = 0\).
B) The \(ME\) reaches a minimum at \(p = 1\).
C) The \(ME\) is maximized when \(p = 0.5\).
D) The most conservative estimate when calculating a confidence interval occurs when \(p\) is set to 1.

To explore this a little, let's compute two proportions using our same article.

Based on the survey, what percentage of American identified as Catholic in 2017? What about Jewish?

Compute the margin of error for a 95% confidence interval for both proportions in the previous question. These results came from the same sample. Explain why the margins of error are not the same.

Comparing Two Proportions

The Gallup poll has been conducted for multiple years, meaning that it is a longitudinal survey, or a survey with results that can be traced over time. To see trends over time, take a look at this article. We are going to use this to compare two different proportions.

Based on the article, what proportion of American adults self identified as Protestant in 2000? In 2017?

We would like to build a confidence interval for the change in the proportion of adults who self-identify as protestant in 2000 versus 2017. We could of course do this by hand, but R makes our life easier. To perform either a one sample or two sample proportion test, or build confidence intervals, using R, we use the prop.test function.

Before we can use the prop.test function, we have to do one quick step. The function requires us to give counts rather than sample proportions. Luckily, it is easy to convert from one to the other. If you have a sample size and a sample proportion, we simply multiply the two to get the sample count.

count2017 <- 126965 * .38

count2000 <- 126965 * .52

Approximately many American adults in the survey self identified as Protestant in 2000? In 2017?

Now we are ready to use the function to build our confidence interval.

prop.test(x = c(count2000,count2017), n = c(126965, 126965), alternative = "two.sided", conf.level = .95 )

Let's go through the arguments of this function.

The first argument is x = c(count2000,count2017), which is the two different counts that we are interested in comparing. Because of the order we have chosen, this function will test \(p_{2000}-p_{2017}\).
The second argument is n = c(126965,126965) tells R the sample size for the 2000 and 2017 surveys. In this case, the sample sizes are the same.
The third argument, alternative, tells R what our alternative hypothesis is: less than or equal to ("less"), greater than or equal to ("more") or is not equal to ("two.sided") the value specified in the null hypothesis. If we are only building a confidence interval, as we are here, we should set alternative = "two.sided".
The conf.level is the confidence level used to define our confidence interval.

Now that we understand the structure, let's see the results.

Write down and interpret the 95% confidence interval.

Is there significant evidence that the proportion of Americans who self-identify as Protestant has decreased at least 10% from 2000 from 2017? Note: You should be able to state the result of this hypothesis test without actually running the rest. State the level of significance you would need to use to be able to do this.

This lab was written by Dr. Nicole Dalzell at Wake Forest University. It is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.

STA 111 Lab 8: Inference for Categorical Data

The Data

Inference with a single proportion

How does the proportion affect the margin of error?

Comparing Two Proportions