Inference for Categorical Data

Exercise 1: In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

Answer: These percentages appear to be sample statistics as they are derived from a collection of samples taken from around the world.

Exercise 2: The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

Answer: We must assume that the samples are independent and randomly sampled and that the sample sizes are large in order to make generalizations. These seem like reasonable assumptions as there are a lot of people in the world and therefore it would be very likely that these samples are independent. However, we must remember that surveys have voluntary bias.

Exercise 3: What does each row of Table 6 correspond to? What does each row of atheism correspond to?

download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")

Answer: Each row of Table 6 corresponds to an individual country. Each row of atheism corresponds to one observation.

Exercise 4: Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

us12 <- subset(atheism, nationality == "United States" & year == "2012")
summary(us12)
##         nationality          response        year     
##  United States:1002   atheist    : 50   Min.   :2012  
##  Afghanistan  :   0   non-atheist:952   1st Qu.:2012  
##  Argentina    :   0                     Median :2012  
##  Armenia      :   0                     Mean   :2012  
##  Australia    :   0                     3rd Qu.:2012  
##  Austria      :   0                     Max.   :2012  
##  (Other)      :   0

Answer: The proportion of atheist responses in the United States is 50/1002 or 0.0499 or 4.99%. This agrees with the percentage in Table 6 of 5%.

Exercise 5: Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

Answer: Sampling is randomized and observations are independent. The data collected likely nearly fits these conditions except that the data is voluntarily given and taken from specific polling sites which does not indicate that the data have been collected completely randomly. However, this may be as close as we can currently get to these conditions due to the large size of the United States population. There must also be at least 10 successes and 10 failures, which is met in this proportion.

Exercise 6: Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")
## Warning: package 'BHH2' was built under R version 4.0.4
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )
qt(0.975, df = 1001)*0.0069
## [1] 0.01354012

Answer: The margin of error for the estimate of the proportion of atheists in the US in 2012 is 0.0135.

Exercise 7: Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

ar12 <- subset(atheism, nationality == "Argentina" & year == "2012")
summary(ar12)
##       nationality         response        year     
##  Argentina  :991   atheist    : 70   Min.   :2012  
##  Afghanistan:  0   non-atheist:921   1st Qu.:2012  
##  Armenia    :  0                     Median :2012  
##  Australia  :  0                     Mean   :2012  
##  Austria    :  0                     3rd Qu.:2012  
##  Azerbaijan :  0                     Max.   :2012  
##  (Other)    :  0
inference(ar12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0706 ;  n = 991 
## Check conditions: number of successes = 70 ; number of failures = 921 
## Standard error = 0.0081 
## 95 % Confidence interval = ( 0.0547 , 0.0866 )
qt(0.975, df = 990)*0.0081
## [1] 0.01589514
ca12 <- subset(atheism, nationality == "Canada" & year == "2012")
summary(ca12)
##       nationality          response        year     
##  Canada     :1002   atheist    : 90   Min.   :2012  
##  Afghanistan:   0   non-atheist:912   1st Qu.:2012  
##  Argentina  :   0                     Median :2012  
##  Armenia    :   0                     Mean   :2012  
##  Australia  :   0                     3rd Qu.:2012  
##  Austria    :   0                     Max.   :2012  
##  (Other)    :   0
inference(ca12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0898 ;  n = 1002 
## Check conditions: number of successes = 90 ; number of failures = 912 
## Standard error = 0.009 
## 95 % Confidence interval = ( 0.0721 , 0.1075 )
qt(0.975, df = 1001)*0.009
## [1] 0.01766103

Answer: The conditions for inference are met. The margin of error for Argentina is 0.0158 and the margin of error for Canada is 0.0176.

Exercise 8: Describe the relationship between p and me.

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

Answer: The margin of error is greatest when the population proportion is 0.5 and the lowest when the population proportion is nearer to 0 and 1.0.

Exercise 9: Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape. Hint: Remember that R has functions such as mean to calculate summary statistics.

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

summary(p_hats)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981

Answer: The sampling distribution of sample proportions at n = 1040 and p = 0.1 is approximately normal although a little right skewed. The center is at 0.09991 and it ranges from 0.06923 to 0.13269.

Exercise 10: Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p^? How does p affect the sampling distribution?

par(mfrow = c(2,2))

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

p <- 0.1
n <- 400
p_hats1 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats1[i] <- sum(samp == "atheist")/n
}

hist(p_hats1, main = "p = 0.1, n = 400", xlim = c(0, 0.18))
summary(p_hats1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05250 0.09000 0.10000 0.09976 0.11000 0.15500
p <- 0.02
n <- 1040
p_hats2 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats2[i] <- sum(samp == "atheist")/n
}

hist(p_hats2, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))
summary(p_hats2)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.005769 0.017308 0.020192 0.019954 0.023077 0.039423
p <- 0.02
n <- 400
p_hats3 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats3[i] <- sum(samp == "atheist")/n
}

hist(p_hats3, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

summary(p_hats3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01500 0.02000 0.01988 0.02500 0.04750

Answer: It appears that the first two graphs (p_hats and p_hats1) are similar and that the last two graphs (p_hats2 and p_hats3) are similar as well. There is a greater effect of p on the p^ distributions than n. A lower p will shift the distribution to left, however it appears that a higher n may decrease the frequency of the p^ distribution.

Exercise 11: If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?

Answer: Australia matches the sampling distribution of p_hats, while Ecuador matches the sampling distribution of p_hats3. Due to the fact that the spread of the sampling distribution for Australia is greater, I believe that is is sensible to report the me and proceed with inference. However, the spread of the sampling distribution for Ecuador is very narrow and therefore I do not think it is sensible to proceed with inference and report margin of errors.