Inference for Categorical Data

The survey

Exercise 1

In the first paragraph, several key findings are reported. Do these precentages appear to be sample statisitcs (dervived from the data sample) or population parameters?

They appear to be sample statisitcs (from the data sample), since it is looking at percent atheist, non-reglious, and relgious-which is a question in the survey. It looks at differences in several populations, but that is not a population parameter.

Exercise 2

The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

We would assume that the sample is random, meaning each countries population is equally represented. So if china represents x percent of the world population, we would expect approximately x percent of the surveyed individuals to be from china. This is probably not realistic.

The data

download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")

Exercise 3

What does each row of Table 6 correspond to? What does each row of atheism correspond to?

From Table 6, each row represents a different country and their responses to the survey. In the atheism dataset, each row represents a single response (including their nationality and the year).

Exercise 4

Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

Since 50 people responded “atheist” out of 1002, the proportion would be approximately 5% which does agree with Table 6.

us12 <- subset(atheism, nationality == "United States" & year == "2012")
plot(us12$response)

table(us12$response)
## 
##     atheist non-atheist 
##          50         952
50/(952+50)
## [1] 0.0499002
(50+2)/ (952+50+4) # Wilson-Adjusted Proportion
## [1] 0.05168986

Inference on Proportions

Exercise 5

Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

The following conditions must be met: each cell has to have a count of greater than 5 and the observations must be randomly collected. The first condition is met, however, as stated before, we are not confident that the data is random.

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Warning: package 'BHH2' was built under R version 4.0.4
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

This means we are 95% confident that the proportion of the US population who identify as “atheist” is between 3.64% and 6.34%

0.0499 - 0.0364
## [1] 0.0135

Exercise 6

Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

Would the margin of error just be the variation from the reported proportion? So 1.35%? I think so (because margin of error should be z value at 95% confidence multiplied by the standard error):

1.96*(0.0069)
## [1] 0.013524

Exercise 7

Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

afg12 <- subset(atheism, nationality == "Afghanistan" & year == "2012")

For afghanistan, the conditions for inference were not met. We learned there need to be 5 responses per cell, but for the inference function, there appears to require 10 per cell.

sweden12 <- subset(atheism, nationality == "Sweden" & year == "2012")
inference(sweden12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0808 ;  n = 495 
## Check conditions: number of successes = 40 ; number of failures = 455 
## Standard error = 0.0122 
## 95 % Confidence interval = ( 0.0568 , 0.1048 )

Margin of error for Sweden:2.39%

1.96*(0.0122)
## [1] 0.023912
neth12 <- subset(atheism, nationality == "Netherlands" & year == "2012")
inference(neth12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1395 ;  n = 509 
## Check conditions: number of successes = 71 ; number of failures = 438 
## Standard error = 0.0154 
## 95 % Confidence interval = ( 0.1094 , 0.1696 )

Margin of error for the Netherlands: 3.02%

1.96*0.0154
## [1] 0.030184

How does the proportion affect the margin of error?

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt (p* (1-p) / n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

Exercise 8

Describe the relationship between p and me

I am not sure the correct term, but the relationship is semi-circular. As the proportion increases, the margin of error increases up until the proportion hits 0.5, then the margin of error decreases again.

Success-failure Condition

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

Exercise 9

Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape.

Hint: Remember that R has functions such as mean to calculate summary statistics.

mean(p_hats)
## [1] 0.09969
min(p_hats)
## [1] 0.07019231
max(p_hats)
## [1] 0.1298077
max(p_hats) - min(p_hats)
## [1] 0.05961538

The shape is fairly normal with a center (mean) of approximately 0.1. The spread (range) is approximately 0.07.

Exercise 10

Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p^? How does p affect the sampling distribution?

p <- 0.1
n1 <- 400
p_hats1 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n1, replace = TRUE, prob = c(p, 1-p))
  p_hats1[i] <- sum(samp == "atheist")/n1
}

hist(p_hats1, main = "p = 0.1, n = 400", xlim = c(0, 0.18))

mean(p_hats1)
## [1] 0.099759
min(p_hats1)
## [1] 0.0525
max(p_hats1)
## [1] 0.155
max(p_hats1) - min(p_hats1)
## [1] 0.1025
p2 <- 0.02
n <- 1040
p_hats2 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p2, 1-p2))
  p_hats2[i] <- sum(samp == "atheist")/n
}

hist(p_hats2, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))

mean(p_hats2)
## [1] 0.01995423
min(p_hats2)
## [1] 0.005769231
max(p_hats2)
## [1] 0.03942308
max(p_hats2) - min(p_hats2)
## [1] 0.03365385
p2 <- 0.02
n1 <- 400
p_hats3 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n1, replace = TRUE, prob = c(p2, 1-p2))
  p_hats3[i] <- sum(samp == "atheist")/n1
}

hist(p_hats3, main = "p = 0.1, n = 400", xlim = c(0, 0.18))

mean(p_hats3)
## [1] 0.0198785
min(p_hats3) - max(p_hats3)
## [1] -0.0475
par(mfrow = c(2,2))

p <- 0.1
n <- 1040
p2 <- 0.02
n1 <- 400
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

p_hats1 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n1, replace = TRUE, prob = c(p, 1-p))
  p_hats1[i] <- sum(samp == "atheist")/n1
}

hist(p_hats1, main = "p = 0.1, n = 400", xlim = c(0, 0.18))

p_hats2 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p2, 1-p2))
  p_hats2[i] <- sum(samp == "atheist")/n
}

hist(p_hats2, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))

p_hats3 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n1, replace = TRUE, prob = c(p2, 1-p2))
  p_hats3[i] <- sum(samp == "atheist")/n1
}

hist(p_hats3, main = "p = 0.1, n = 400", xlim = c(0, 0.18))

The smaller n is, the wider the spread (taking into account the mean proportion). So the range of p_hats2 is smaller than p_hats becuase the mean proportion is smaller. However, the spread of p_hat1 is wider than p_hats because the n is smaller. In the addition, the smaller the mean proportion, the less normal the distribution. However, the distribution becomes more normal for a smaller p as the n increases.

Exercise 11

If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?

I think it is fair to use inference and report margin of error on Australia since the distribution seems fairly normal. However, it may not be sensible to use inference and report margin of error since condition that the distrubtion must be normal is violated. The plot above shows a skewing distribution at n=400 and p=0.02.

On Your Own

Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.

a. Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012?

###Hint: Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of athiests in both years, and determine whether they overlap.

The null hypothesis: There is no change in the atheist index in Spain from 2005 to 2012

The alternative hypothesis: There is a change in the atheist index in Spain from 2005 to 2012

par(mfrow = c(1,1))
p4 <- 0.09
n4 <- 1146
p_hats4 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n4, replace = TRUE, prob = c(p4, 1-p4))
  p_hats4[i] <- sum(samp == "atheist")/n4
}

hist(p_hats4, main = "Spain, 2012, p = 0.09, n = 1146", xlim = c(0, 0.18))

spain12 <- subset(atheism, nationality == "Spain" & year == "2012")
spain05 <- subset(atheism, nationality == "Spain" & year == "2005")
inference(spain12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )
inference(spain05$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1003 ;  n = 1146 
## Check conditions: number of successes = 115 ; number of failures = 1031 
## Standard error = 0.0089 
## 95 % Confidence interval = ( 0.083 , 0.1177 )

The 95% confidence interval for spain in 2005 is ( 0.083 , 0.1177 ) while the 95% confidence interval for spain in 2012 is ( 0.0734 , 0.1065 ). Even though p_hat appears to decrease, the confidence intervals overlap meaning we fail to reject the null hypothesis and that there is no evidence that there is a change in atheist index in spain from 2005 to 2012.

###b. Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?

The null hypothesis: There is no change in the atheist index in the US from 2005 to 2012

The alternative hypothesis: There is a change in the atheist index in the US from 2005 to 2012

us05 <- subset(atheism, nationality == "United States" & year == "2005")
us12 <- subset(atheism, nationality == "United States" & year == "2012")

inference(us05$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.01 ;  n = 1002 
## Check conditions: number of successes = 10 ; number of failures = 992 
## Standard error = 0.0031 
## 95 % Confidence interval = ( 0.0038 , 0.0161 )
inference(us12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

The 95% confidence interval in the US in 2005 is ( 0.0038 , 0.0161 ) while the 95% confidence interval in the US in 2012 is ( 0.0364 , 0.0634 ). Since the confidence intervals do NOT overlap, we can reject the null hypothesis. In other words, there is evidence that the atheist index changed from 2005 to 2012.

If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance?

Hint: Look in the textbook index under Type 1 error.

For a 95% confidence interval or a significance level of 0.05, we would expect that 5% of the time we would get a type I error. The significance level is set to the frequency at which you would “accept” a type I error.

Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines?

Hint: Refer to your plot of the relationship between p and margin of error. Do not use the data set to answer this question.

The maximum margin of error would occur at a p of 0.5. So we would use this p to find n

ME = 1.96 sqrt ( 0.5*(1-0.5) / n), our ME would be 0.01

( 0.01 / 1.96 )^2 = 0.5 * 0.5 / n

(0.01/1.96)^2 / .25 = 1 / n

0.25 / (0.01/1.96)^2
## [1] 9604

Seems high, but it would seem we would need a minimum of n = 9604 to achieve no greater than a 1% margin of error if were unsure of the p.