Inferences for Categorical Data

The survey

Exercise 1

In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?
Answer: The data was derived from a poll, not a controlled randomized experiment, therefore they are based in sample statistics (derived from the data sample). We do not know the exact population parameters from the poll.

Exercise 2

The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?
Answer: In order to generalize the report’s findings to the global human population we must assume that the sample observation are independent. This means that interviewers involved in the study would take measures to ensure that the participants are selected with minimized bias and that their methodology throughout the interview minimizes bias. We know that only 57 countries on five continents were included in the poll so there are many levels of bias control and controlled processes that would have to be utilized to ensure the independence of the samples were preserved. The sample size of participants must have been selected via random sampling at each location.
Second, in order to generalize the report’s findings to the global human population sample size must be less than 10% of the population. The sample sizes should be at least 10 or greater (fitting with the requirement that np ≥ 10 and n(1-p) ≥ 10).
This assumption does seems like a reasonable assumption given that the sample sizes are large, however with a study spanning such breath of language, culture, geographic and population demographics there are many potential sources of bias, which should be addressed within the design, methodology and execution of the survey.
download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")
names(atheism)
## [1] "nationality" "response"    "year"
head(atheism)
##   nationality    response year
## 1 Afghanistan non-atheist 2012
## 2 Afghanistan non-atheist 2012
## 3 Afghanistan non-atheist 2012
## 4 Afghanistan non-atheist 2012
## 5 Afghanistan non-atheist 2012
## 6 Afghanistan non-atheist 2012

Exercise 3

What does each row of Table 6 correspond to? What does each row of atheism correspond to?
Answer: Each row of Table 6 corresponds to Countries included in the survey. The associated columns record for each country the sample size and percentage of respondents who answered the question that they were: “a religious person”, “not a religious person”, “a convinced atheist”, or “Don’t know/no response”.
Each row of atheism records each individual participant’s nationality, response to the categorical question and the year.

Exercise 4

Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
us12 <- subset(atheism, nationality == "United States" & year == "2012")
us12ath <- subset(atheism, nationality == "United States" & year == "2012" & response == "atheist")
nrow(us12ath)/nrow(us12)
## [1] 0.0499002
Answer: The proportion of atheist responses calculated from the dataset, 5%, does agree with the percentage in Table 6.

Inference on proportions

Exercise 5

Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?
Answer: The conditions for inference to construct a 95% confidence interval for the proportion of atheists in the US in 2012 are: 1. The observation are independent; 2. The poll is based on a simple random sample; and 3. The poll’s random sample population distribution is normal, which can be checked by verifying that np ≥ 10 and n(1-p) ≥ 10).
The sample size of participants in the US is 1002. The report states that the poll was conducted independently. The report stated that random samples were used. Regarding the US the sample size matches the criteria.
inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")
## Warning: package 'BHH2' was built under R version 3.6.3
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

Exercise 6

Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?
Answer: The margin of error = the Z-score multiplied by the standard error. The R output provided a standard error of 0.0069. We must multiply this SE by the 95% confidence interval Z-score of 1.96.
0.0069*1.96
## [1] 0.013524
Thus, the margin of error based on the R output is 0.0135 or 1.35%.

Exercise 7

Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.
Brazil12 <- subset(atheism, nationality == "Brazil" & year == "2012")

inference(Brazil12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.01 ;  n = 2002 
## Check conditions: number of successes = 20 ; number of failures = 1982 
## Standard error = 0.0022 
## 95 % Confidence interval = ( 0.0056 , 0.0143 )
SE = 0.0022
Z_score = 1.96
ME = SE * Z_score
ME = 0.0022 * 1.96
0.0022 * 1.96
## [1] 0.004312
Answer: The 95% confidence interval for the proportion of atheists in Brazil in 2012 is between 0.0056 and 0.0143. The standard error for this interval is 0.0022, resulting in a margin of error of 0.0043 or .43%.
czr12 <- subset(atheism, nationality == "Czech Republic" & year == "2012")
inference (czr12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.3 ;  n = 1000 
## Check conditions: number of successes = 300 ; number of failures = 700 
## Standard error = 0.0145 
## 95 % Confidence interval = ( 0.2716 , 0.3284 )
SE =  0.0145 
Z_score = 1.96
ME = SE * Z_score
ME =  0.0145 * 1.96
0.0145 * 1.96
## [1] 0.02842
Answer: The 95% confidence interval for the proportion of atheists in the Czech Republic in 2012 is between 0.2716 and 0.3284. The standard error for this interval is 0.0145, resulting in a margin of error of 0.0284 or 2.84%.

How does proportion affect the margin of error?

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

Exercise 8

Describe the relationship between p (proportion) and me (the margin of error).
Answer: The margin of error has a direct relationship to the population proportion. The plot of the margin of error and population proportion creates a parabolic shape with the vertex, or axis of symmetry, at approximately 0.5 on the population proportion x axis. Between 0 and 0.5 population proportion, the two axis have as x (population proportion) increases, y (margin of error) also increases. However, between 0.5 and 1.0 an inverse relationship develops, as x increases, y decreases.

Success-failure condition

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

Exercise 9

Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape. Hint: Remember that R has functions such as mean to calculate summary statistics.
summary(p_hats)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981
sd(p_hats)
## [1] 0.009287382
boxplot(p_hats)

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18), probability = TRUE)
abline(v = mean(p_hats),col = "royalblue",lwd = 2)
abline(v = median(p_hats),
       col = "red",
       lwd = 2)
legend(x = "topright", # location of legend within plot area
       c("Mean (0.0999)", "Median (0.1000)"),
       col = c("royalblue", "red"),
       lwd = c(2, 2))
m <- mean(p_hats)
std <- sd(p_hats)

curve(dnorm(x, mean=m, sd=std), 
      col="darkblue", lwd=2, add=TRUE, yaxt="n")

Answer: The sampling distribution of sample proportions at n=1040 and p=0.1 is nearly normal. The mean (0.0999) and the median (0.1) are equivalent to the population mean of 0.1. The standard of error (sd above) is very low at 0.009.

Exercise 10

Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p^? How does p affect the sampling distribution?
#to make 2 columns and 2 rows of histogram
par(mfrow = c(2, 2))

#first histogram
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

#second simulation
p <- 0.1
n <- 400
p_hats2 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats2[i] <- sum(samp == "atheist")/n
}

#second histogram
hist(p_hats2, main = "p = 0.1, n = 400", xlim = c(0, 0.18))

#third simulation
p <- 0.02
n <- 1040
p_hats3 <- rep(0, 5000)


for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats3[i] <- sum(samp == "atheist")/n
}

#third histogram
hist(p_hats3, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))

#fourth simulation
p <- 0.02
n <- 400
p_hats4 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats4[i] <- sum(samp == "atheist")/n
}

#fourth histogram
hist(p_hats4, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

summary(p_hats)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981
summary(p_hats2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05250 0.09000 0.10000 0.09976 0.11000 0.15500
summary(p_hats3)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.005769 0.017308 0.020192 0.019954 0.023077 0.039423
summary(p_hats4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01500 0.02000 0.01988 0.02500 0.04750
Answer: Descriptions:
p_hats2: p = 0.1, n = 400. In this simulation the mean and median are nearly identical to the first simulation. The range is slightly wider with a minimum of 0.05 (instead of p_hats min of 0.07) and a max of 0.165 (instead of p_hats max of 0.134). The distribution is normal.
p_hats3: p = 0.02, n = 1040. In this simulation the mean and median are much smaller, at 0.0199 and 0.02 respectively. The data all occurs between 0.00 and 0.04. The distribution appears normal, albeit with a much narrower range (0.0067 to 0.042 compared to 0.07 to 0.134), with all of the data recorded with lower figures than the minimum of p_hats.
p_hats4: p = 0.02, n = 400. In this simulation the mean and median are closely aligned with the p_hats3. The range is slightly larger in p_hats4 than in p_hats3, however the maximum figure here, 0.05, is still smaller than the minimum figure in p_hats, 0.07. The distribution is normal.
Overall this tells us that as the sample size, n, increases, the data range will narrow. As the range narrows the margin of error narrows. P impacts the location of the sampling distribution. The histograms with the lower p value show the data points to the right of the histograms (toward x = 0). A larger p-value moves the overall distribution to the left (toward x = infinity), as is shown in the fifth and sixth simulations and histograms below with p = 0.2 (fifth has n = 400 and sixth has n = 1040).
#to make 2 columns and 2 row of histogram
par(mfrow = c(2, 2))

#fifth simulation
p <- 0.2
n <- 400
p_hats5 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats5[i] <- sum(samp == "atheist")/n
}

#fifth histogram
hist(p_hats5, main = "p = 0.2, n = 400", xlim = c(0, 0.18))
hist(p_hats5, main = "p = 0.2, n = 400", xlim = c(0, 0.30))

#sixth simulation
p <- 0.2
n <- 1040
p_hats6 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats6[i] <- sum(samp == "atheist")/n
}

#sixth histogram
hist(p_hats6, main = "p = 0.2, n = 1040", xlim = c(0, 0.18))
hist(p_hats6, main = "p = 0.2, n = 400", xlim = c(0, 0.30))

par(mfrow = c(1,1))

Exercise 11

If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
0.02 * 400
## [1] 8
400 * (1 - 0.02)
## [1] 392
Answer: Given the shape of the sampling distribution for Ecuador and Australia, I think it is only sensible to proceed with inference and report margin of errors for Australia. Australia meets all of the requirements to apply the inference. There is a normal distribution a sufficient sample size and np ≥ 10 and n(1-p) ≥ 10. However, Ecuador has a skewed distribution and np < 10. Therefore it would not be sensible to proceed with Ecuador to apply the inference, as the report does.

On Your Own

1. Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.
a. Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012? Hint: Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of atheists in both years, and determine whether they overlap.
par(mfrow = c(1, 2))

spain05 <- subset(atheism,nationality == "Spain" & year == "2005")
spain05atheists <- subset(spain05,response == "atheist")

spain12 <- subset(atheism,nationality == "Spain" & year == "2012")
spain12atheists <- subset(spain12, response == "atheist")

inference(spain05$response, est = "proportion",type = "ci",method = "theoretical", success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1003 ;  n = 1146 
## Check conditions: number of successes = 115 ; number of failures = 1031 
## Standard error = 0.0089 
## 95 % Confidence interval = ( 0.083 , 0.1177 )
inference(spain12$response, est = "proportion",type = "ci",method = "theoretical", success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )
ME = SE * Z_score
ME = 0.0089 * 1.96
0.0089 * 1.96
## [1] 0.017444
ME = SE * Z_score
ME = 0.0085 * 1.96
0.0085 * 1.96
## [1] 0.01666
p_spn05 = 0.1003
n_spn05 = 1146 
p_spn12 = 0.09
n_spn12 = 1145 

PE_spn = p_spn12 - p_spn05

SE_spn = sqrt((p_spn05*(1-p_spn05)/n_spn05)+(p_spn12*(1-p_spn12)/n_spn12))
SE_spn
## [1] 0.01225854
#Control interval for difference between proportion in 2005 and 2012
PE_spn + (1.96*SE_spn)
## [1] 0.01372674
PE_spn - (1.96*SE_spn)
## [1] -0.03432674
Answer:
Spain 05: 95% confidence interval: 0.083 , 0.1177 standard error: 0.0089 margin of error: 0.017
Spain 12: 95% confidence interval: 0.0734 , 0.1065 standard error: 0.0085 margin of error: 0.017
There is not convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012. The 95% confidence intervals for both years are nearly identical. The standard errors and margins of error are both consistent with each other. The inference bar graphs appear to show a minor reduction of atheists in 2012 compared t0 2005, the data suggests this could simply be a result of differences in sampling distributions.
In addition, the control interval (-0.03 to 0.01) includes zero. This means that there is evidence that the true atheism index between this two years are the same.
b. Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?
usa05 <- subset(atheism, nationality == "United States" & year == "2005")
usa12 <- subset(atheism, nationality == "United States" & year == "2012")

inference(usa05$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.01 ;  n = 1002 
## Check conditions: number of successes = 10 ; number of failures = 992 
## Standard error = 0.0031 
## 95 % Confidence interval = ( 0.0038 , 0.0161 )
inference(usa12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )
ME = SE * Z_score
ME = 0.0031 * 1.96
0.0031 * 1.96
## [1] 0.006076
ME = SE * Z_score
ME = 0.0069 * 1.96
0.0069 * 1.96
## [1] 0.013524
p_usa05 = 0.01
n_usa05 = 1002 
p_usa12 = 0.05
n_usa12 = 1002 

PE_usa = p_usa12 - p_usa05

SE_usa = sqrt(((p_usa05*(1-p_usa05))/n_usa05)+((p_usa12*(1-p_usa12))/n_usa12))
SE_usa
## [1] 0.007568714
#Control interval for difference between proportion in 2005 and 2012
PE_usa + (1.96*SE_usa)
## [1] 0.05483468
PE_usa - (1.96*SE_usa)
## [1] 0.02516532
Answer: There is evidence that there is a change in the atheism index in the USA from 2005 to 2012.
USA 05: 95% confidence interval: 0.0038 , 0.0161 standard error: 0.0031 margin of error: 0.017
USA 12: 95% confidence interval: 0.0364 , 0.0634 standard error: 0.0069 margin of error: 0.017
The control interval (0.025 to 0.05) does not include 0 so we reject the null hypothesis that the data for the years 2012 and 2005 has not changed.
2. If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance? Hint: Look in the textbook index under Type 1 error.
Answer: Since we have a 95% confidence interval I would expect 5% of the countries that detect a change, could in fact be due to sampling error. In this case, if the population atheism levels shows that there is no change, then these will be examples of Type 1 errors - rejecting the null hypothesis that the data in year 2005 and 2012 are the same, when in fact it is true.
3.Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines? Hint: Refer to your plot of the relationship between p and margin of error. Do not use the data set to answer this question.

#ME = 1.96 * SE #SE = sqrt((p(1-p))/n) #ME = z* sqrt( p(p-1)/n )

#ME / z* = sqrt( p(p-1)/n )

#(ME / z*)^2 = p(p-1)/n

#(ME / z*)^2 / (p(p-1)) = 1/n

  p(p-1)

#n = ———– #(ME / z)^2 p(p-1) #n = ———– # (ME / z)^2

  0.5(1-0.5)

#n = ———– (0.01 / 1.96)^2

 0.5*(1-0.5)
## [1] 0.25
 (0.01 / 1.96)^2
## [1] 2.603082e-05
 0.25/2.603082e-05
## [1] 9604
Answer: The sample size should be greater than or equal to 9604.