Inference for categorical data

In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

These are sample statics. Surveying the enitre population is expensive and might not be feasible.

The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

Observations are independent and randomly collected. Sampling size is less than 10% of the population.

load("more/atheism.RData")

What does each row of Table 6 correspond to? What does each row of atheism correspond to?

Table6 cases represnts Per country, % of religious, not religious, atheist and did not respond. ‘atheism’ cases represents individual responses in a given country and the year the poll was taken.

Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why? Both percentages are the same.

us12 <- subset(atheism, nationality == "United States" & year == "2012")
us12_atheist <- subset(atheism, nationality == "United States" & year == "2012" & response =='atheist')

nrow(us12_atheist)/nrow(us12)

## [1] 0.0499002

Inference on proportions

Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

Observations are independent. The poll is based on a simple random sample and is less than 10% of the U.S. population.

Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

ME = critical value * standard error of the point estimate = 0.0069 * 1.96 = 0.013524

Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

china12_atheist <- subset(atheism, nationality == "China" & year == "2012" )
india12_atheist <- subset(atheism, nationality == "India" & year == "2012" )



inference(china12_atheist$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.47 ;  n = 500 
## Check conditions: number of successes = 235 ; number of failures = 265 
## Standard error = 0.0223 
## 95 % Confidence interval = ( 0.4263 , 0.5137 )

inference(india12_atheist$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0302 ;  n = 1092 
## Check conditions: number of successes = 33 ; number of failures = 1059 
## Standard error = 0.0052 
## 95 % Confidence interval = ( 0.0201 , 0.0404 )

China ME = 0.0223 * 1.96 India ME = 0.0052 *1.96 Conditions for inference are met, Sample size is less than 10% of the population and smapleing is simple random sampling. sample contains more than 10 expected successes and more than 10 expected failures.

How does the proportion affect the margin of error?

Describe the relationship between p and me. It follows a normal curve. There is an inverse Correlation between p and me. From the center, p and me move in opposite directions. ## Success-failure condition

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

Describe the sampling distribution of sample proportions at \(n = 1040\) and \(p = 0.1\). Be sure to note the center, spread, and shape.
Hint: Remember that R has functions such as mean to calculate summary statistics.

summary(p_hats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981

sd(p_hats)

## [1] 0.009287382

Normal distiribution, centered at a mean of about .1, with a standard deviation of 0.0092

Repeat the above simulation three more times but with modified sample sizes and proportions: for \(n = 400\) and \(p = 0.1\), \(n = 1040\) and \(p = 0.02\), and \(n = 400\) and \(p = 0.02\). Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does \(n\) appear to affect the distribution of \(\hat{p}\)? How does \(p\) affect the sampling distribution?

p <- 0.1
n <- 400
p_hats_400 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats_400[i] <- sum(samp == "atheist")/n
}

p <- 0.2
n <- 400
p_hats_1040 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats_1040[i] <- sum(samp == "atheist")/n
}

p <- 0.2
n <- 400
p_hats_400_2 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats_400_2[i] <- sum(samp == "atheist")/n
}

par(mfrow = c(2, 2))
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.38))
hist(p_hats_1040, main = "p = 0.2, n = 1040", xlim = c(0, 0.38))
hist(p_hats_400, main = "p = 0.1, n = 400", xlim = c(0, 0.38))
hist(p_hats_400_2, main = "p = 0.2, n = 400", xlim = c(0, 0.38))

par(mfrow = c(1, 1))

I changed the max xlimit to .38. All of these are normal distributions. From these N doesnt change the shape of the distribution. These distribution seems to be centered around the p-value. Pvalue shifts the center of the distribution and also the spread.

If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?

We should be able to use normal approximation for Australia, but for Ecuador, the expected number of success is only .02*400 = 8 so it doesnt meet the condition.

On your own

The question of atheism was asked by WIN-Gallup International in a similar survey that was conducted in 2005. (We assume here that sample sizes have remained the same.) Table 4 on page 13 of the report summarizes survey results from 2005 and 2012 for 39 countries.

Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.

a. Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012?
Hint: Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of athiests in both years, and determine whether they overlap.

spain12_atheist <- subset(atheism, nationality == "Spain" & year == "2012" )
spain05_atheist <- subset(atheism, nationality == "Spain" & year == "2005" )


par(mfrow = c(2, 2))
inference(spain12_atheist$response, est = "proportion", type = "ci", method = "theoretical", 
      success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )

inference(spain05_atheist$response, est = "proportion", type = "ci", method = "theoretical", 
      success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1003 ;  n = 1146 
## Check conditions: number of successes = 115 ; number of failures = 1031 
## Standard error = 0.0089 
## 95 % Confidence interval = ( 0.083 , 0.1177 )

par(mfrow = c(1, 1))

Confidence interval 2005 (.083, .118) Confidence interval 2012 (.073 and .107) Because there is so much overlap, this doesn’t provide strong evidence that there has been a change.

b. Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?

usa12_atheist <- subset(atheism, nationality == "United States" & year == "2012" )
usa05_atheist <- subset(atheism, nationality == "United States" & year == "2005" )


par(mfrow = c(2, 2))
inference(usa12_atheist$response, est = "proportion", type = "ci", method = "theoretical", 
      success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

inference(usa05_atheist$response, est = "proportion", type = "ci", method = "theoretical", 
      success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.01 ;  n = 1002 
## Check conditions: number of successes = 10 ; number of failures = 992 
## Standard error = 0.0031 
## 95 % Confidence interval = ( 0.0038 , 0.0161 )

par(mfrow = c(1, 1))

Confidence interval 2005 ( 0.0038 , 0.0161 ) Confidence interval 2012 ( 0.0364 , 0.0634 )

There is no overlap, which suggests that there has been a change in the atheism index in the US.

If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance?
Hint: Look in the textbook index under Type 1 error.

There are 39 countries listed in Table 4. Even if there is no change, we would have detected them with probability 0.05. Since we’re concerned with the total number of “successes” in a series of independent trials, what we have is a binomial random variable. The expected value of the binomial is 39×0.05 = 1.95

Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for \(p\). How many people would you have to sample to ensure that you are within the guidelines?

ME = z* × SE

0.01 = 1.96* \(\sqrt{0.5*.05/n}\) and solving for n we get

 (.5*.5)/(.01/1.96)^2

## [1] 9604

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.