Inference for categorical data

Exercise 1) In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

These percentages appear to be sample statistics because this study was done through a poll and doesn’t account for the entire population.

Exercise 2) The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

The samples are independant. As long as the sample size is large enough (at least 10% of the population) we are okay.

download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")

Exercise 3) What does each row of Table 6 correspond to? What does each row of atheism correspond to?

Each row in Table 6 represents the answers to the pole for each country of religous, non religous, or Atheist % breakdown.

Exercise 4) Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

us12 <- subset(atheism, nationality == "United States" & year == "2012")
us12A <- subset(atheism, nationality == "United States" & year == "2012" & response == "atheist")
nrow(us12A)/nrow(us12)

## [1] 0.0499002

We get a value of 4.99% which lines up with Table 6 showing that 5% of the US is Atheist.

Exercise 5) Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")

## Warning: package 'BHH2' was built under R version 3.6.3

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

1) The observations are independant 2) formed by a random sample (poll). 2)The sample size of 1002 which is greater than 30. All coniditions are met.

Exercise 6) Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

The Margin of Error = z-Score * SE.

1.96*0.0069

## [1] 0.013524

Exercise 7) Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

Fran12 <- subset(atheism, nationality == "France" & year == "2012")
Fran12A <- subset(atheism, nationality == "France" & year == "2012" & response == "atheist")
nrow(Fran12A)/nrow(Fran12)

## [1] 0.2873223

inference(Fran12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.2873 ;  n = 1688 
## Check conditions: number of successes = 485 ; number of failures = 1203 
## Standard error = 0.011 
## 95 % Confidence interval = ( 0.2657 , 0.3089 )

Germ12 <- subset(atheism, nationality == "Germany" & year == "2012")
Germ12A <- subset(atheism, nationality == "Germany" & year == "2012" & response == "atheist")
nrow(Germ12A)/nrow(Germ12)

## [1] 0.1494024

inference(Germ12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1494 ;  n = 502 
## Check conditions: number of successes = 75 ; number of failures = 427 
## Standard error = 0.0159 
## 95 % Confidence interval = ( 0.1182 , 0.1806 )

France Margin of Error

1.96*0.2873223

## [1] 0.5631517

Germany Margin of Error

1.96*0.1494024

## [1] 0.2928287

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

Exercise 8) Describe the relationship between p and me

ME increases with p until the max then decreases with p. The graph is in the shape of a parabola.

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

###Exercise 9) Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape.Hint: Remember that R has functions such as mean to calculate summary statistics.

This is a unimolar histogram, normally distributed. The sample mean is close to the populations mean.

Exercise 10) Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p^? How does p affect the sampling distribution?

par(mfrow = c(2, 2))


hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

p <- 0.1
n <- 400
p_hats2 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats2[i] <- sum(samp == "atheist")/n
}


hist(p_hats2, main = "p = 0.1, n = 400", xlim = c(0, 0.18))

p <- 0.02
n <- 1040
p_hats3 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats3[i] <- sum(samp == "atheist")/n
}


hist(p_hats3, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))

p <- 0.02
n <- 400
p_hats4 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats4[i] <- sum(samp == "atheist")/n
}


hist(p_hats4, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

Exercise 11) If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?