Inference for Categorical Data

The Survey

Exercise 1: In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

The finding that the 13% of global population is atheist, 23% of the global population is non-religious, and 59% of the global population is religious is based on polling. Therefore, these percentages are sample statistics derived from a data sample of 50,000 men and women across 57 countries rather than population parameters.

Exercise 2: The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

To generalize the report’s finding to the global human population the report must be based on independent observations from multiple randomly selected countries that represent the different homogeneous strata of the global population (i.e. Asia, Middle East, Europe, South America, etc.). That is, a stratified random sampling method probably should’ve been used given that 57 of the world’s 195 countries. Since they claim the 57 countries they randomly sampled spanned 5 continents and, given that one of the 7 continents has no population and that the other (Oceania) is relatively small, this seems like a reasonable assumption.

The Data

download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")

Exercise 3: What does each row of Table 6 correspond to? What does each row of atheism correspond to?

#Each row of table 6 corresponds to a country and its respective sample size. 
head(atheism)

##   nationality    response year
## 1 Afghanistan non-atheist 2012
## 2 Afghanistan non-atheist 2012
## 3 Afghanistan non-atheist 2012
## 4 Afghanistan non-atheist 2012
## 5 Afghanistan non-atheist 2012
## 6 Afghanistan non-atheist 2012

str(atheism)

## 'data.frame':    88032 obs. of  3 variables:
##  $ nationality: Factor w/ 57 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ response   : Factor w/ 2 levels "atheist","non-atheist": 2 2 2 2 2 2 2 2 2 2 ...
##  $ year       : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...

#Each row of the atheism dataframe correpsponds to an observation of a single individual from a particular country.

Exercise 4: Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

library(plyr)
us12 <- subset(atheism, nationality == "United States" & year == "2012")
usatheism <- subset(us12,response == "atheist")
us_ath_count <- count(usatheism)
us_total <- count(us12)
us_ath_count

##     nationality response year freq
## 1 United States  atheist 2012   50

us_total

##     nationality    response year freq
## 1 United States     atheist 2012   50
## 2 United States non-atheist 2012  952

us_ath_perc <- 50/1002
us_ath_perc

## [1] 0.0499002

#Yes it does agree with the percentage in Table 6. The actual percentage is 4.99%, which is close enough to 5% that we can round it to be 5%.

Inference on Proportions

Exercise 5: Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

The US sample must have been randomly selected, the observations within the US sample must be independent of one another, and the population distribution must be normal(i.e np ≥ 10 and n(1-p) ≥ 10). It would seem that all of the conditions have been met given such a large average sample size.

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

Exercise 6: Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

While the Report finds the margin of error for surveys of this kind to be 3-5%, the R output finds the margin of error of the proportion of US atheists to be about 1.35% (i.e. 0.0069 * 1.96).

Exercise 7: Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

china12 <- subset(atheism,nationality == "China" & year == "2012")
inference(china12$response,est = "proportion",type = "ci",method = "theoretical",success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.47 ;  n = 500 
## Check conditions: number of successes = 235 ; number of failures = 265 
## Standard error = 0.0223 
## 95 % Confidence interval = ( 0.4263 , 0.5137 )

germany12 <- subset(atheism,nationality == "Germany" & year == "2012")
inference(germany12$response,est = "proportion",type = "ci",method = "theoretical",success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1494 ;  n = 502 
## Check conditions: number of successes = 75 ; number of failures = 427 
## Standard error = 0.0159 
## 95 % Confidence interval = ( 0.1182 , 0.1806 )

It seems that the conditions of inference are met for both Germany and China. The margin of error of the proportion of Chinese atheists is 4.3% while the margin of error of the proportion of German atheists is 3.1%. Both margins of error are within the reported margin of error interval for surveys of this kind.

How does the Proportion Affect the Margin of Error?

Exercise 8: Describe the relationship between p and me

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

#For any given country in this dataset, as the population proportion approaches 0.5, the margin of error approaches a maximum of 0.030 or 3%. As the population proportion approaches either 1 or 0, on the other hand, the margin of error approaches 0 since this would indicate a probability of either 100% or 0%, which would mean that there would be no margin of error in the estimation of the population parameter p. That is, for a sample size of 1000, the maximum margin of error is roughly 3%.

Success-Failure Condition

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

summary(p_hats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981

sd(p_hats)

## [1] 0.009287382

Exercise 9: Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape.

The sampling distribution of sample proportions at n = 1040 and p = 0.1 is nearly normal with a fairly symmetrical shape about the center p = 0.1 and a minimal amount of spread. This can be inferred by the fact that the mean and median are nearly the same and the very low standard error of 0.009.

Exercise 10: Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p̂ ? How does p affect the sampling distribution?

n1 = 400
p1 = 0.1

n2 = 1040
p2 = 0.02

n3 = 400
p3 = 0.02

p_hats1 <- rep(0,5000)

for(i in 1:5000){
  samp1 <- sample(c("atheist","non_atheist"),n1,replace = TRUE,prob = c(p1, (1-p1)))
  p_hats1[i] <- sum(samp1 == "atheist")/n1
}

p_hats2 <- rep(0,5000)

for(i in 1:5000){
  samp2 <- sample(c("atheist","non_atheist"),n2,replace = TRUE,prob = c(p2, (1-p2)))
  p_hats2[i] <- sum(samp2 == "atheist")/n2
}

p_hats3 <- rep(0,5000)

for(i in 1:5000){
  samp3 <- sample(c("atheist","non_atheist"),n3,replace = TRUE,prob = c(p3, (1-p3)))
  p_hats3[i] <- sum(samp3 == "atheist")/n3
}

par(mfrow=c(2,2))
hist(p_hats1, main = "p = 0.1, n = 400", xlim = c(0, 0.18))
hist(p_hats2, main = "p = 0.02, n = 1040", xlim = c(0, 0.05))
hist(p_hats3, main = "p = 0.02, n = 400", xlim = c(0, 0.05))

summary(p_hats1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05250 0.09000 0.10000 0.09976 0.11000 0.15500

SE1 <- sd(p_hats1)
ME1 <- 2*SE1
SE1

## [1] 0.01504642

ME1

## [1] 0.03009284

summary(p_hats2)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.005769 0.017308 0.020192 0.019954 0.023077 0.039423

sd(p_hats2)

## [1] 0.004422505

SE2 <- sd(p_hats2)
ME2 <- 2*SE2
SE2

## [1] 0.004422505

ME2

## [1] 0.008845009

summary(p_hats3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01500 0.02000 0.01988 0.02500 0.04750

sd(p_hats3)

## [1] 0.006976835

SE3 <- sd(p_hats3)
ME3 <- 2*SE3
SE3

## [1] 0.006976835

ME3

## [1] 0.01395367

n appears to affect the sampling distribution’s spread or standard error. The larger the value of n, the smaller the standard error and, therefore, margin of error will be. For example, the histogram with proportion p = 0.1 and n = 400 is much more spread out than the histogram with the same proportion and n = 1040. This can also be inferred from the histograms with population proportion p = 0.02 and n = 400 and n = 1040, where the latter histogram had a much smaller standard of error. Moreover, as n increases the histogram’s shape becomes more and more normal.

p affects the margin of error of the samplng distribution and skews the distribution. The lower the proportion p, the more right skewed the histogram will be and, although it is not shown in these histograms, I would assume the higher the proportion p (e.g. p = 0.9), the more left skewed the histogram will be. In this case, changing p from 0.1 to 0.02 caused right skew and caused the margin of error to decrease. The larger the proportion p, the larger the margin of error.

Exercise 11: If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?

Given the shape of the respective sampling distributions I think it would only be sensible to proceed with inference and report margin of errors for Australia given that its shape is fairly normal and given that the values of np and n(1-p) are sufficiently large. The population distribution for Ecuador, however, is skewed enough or non-normal enough to consider it not fit for inference testing. Moreover, the value of np is below 10 case of Ecuador.

On Your Own

1. Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.

a. Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012?

Hint: Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of athiests in both years, and determine whether they overlap.

spain05 <- subset(atheism,nationality == "Spain" & year == "2005")
spain05atheists <- subset(spain05,response == "atheist")

spain12 <- subset(atheism,nationality == "Spain" & year == "2012")
spain12atheists <- subset(spain12, response == "atheist")

inference(spain05$response, est = "proportion",type = "ci",method = "theoretical", success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1003 ;  n = 1146 
## Check conditions: number of successes = 115 ; number of failures = 1031 
## Standard error = 0.0089 
## 95 % Confidence interval = ( 0.083 , 0.1177 )

inference(spain12$response, est = "proportion",type = "ci",method = "theoretical", success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )

#There is substantial overlap between the confidence intervals. In fact, the confidence interval for the mean atheist proportion in 2012 is entirely within that of 2005. This makes sense when one notices that p is very similar between 2005 and 2012, meaning that there has not been a meaningful change in the population proportion of atheists in Spain from 2005 to 2012.

b. Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?

usatheism05 <- subset(atheism,nationality == "United States" & year == "2005")
inference(usatheism05$response, est = "proportion", type = "ci",method = "theoretical",success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.01 ;  n = 1002 
## Check conditions: number of successes = 10 ; number of failures = 992 
## Standard error = 0.0031 
## 95 % Confidence interval = ( 0.0038 , 0.0161 )

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

#The confidence intervals for the proportion of US atheists in 2005 and 2012 do not overlap at all. The confidence interval for the proportion of US atheists from 2005 was .38% to 1.6% while the confidence interval for the proportion of US atheists from 2012 was 3.64% to 6.34%. Therefore, we have substantial evidence that the population proportion of US atheists in 2012 differs from that of 2005. That is, the population proportion of atheists in the US increased from 2005 to 2012.

2. If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance?

Hint: Look in the textbook index under Type 1 error.

I would expect to detect a change in 5% of the countries in the table due to chance or sampling error alone due to the fact that the sampling distribution of the global population is normally distributed and due to the fact that 5% of the country atheist proportions would be beyond the 95% of the sampling distribution.

3. Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines?

Hint: Refer to your plot of the relationship between p and margin of error. Do not use the data set to answer this question.

Because the ME, in the case of a 95% confidence interval, is equal to ME = 1.96*SE, we know that SE = 0.01/1.96 = 0.005. Letting our desired SE = 0.005, we can plug this into the equation for the standard of error using p = 0.5 and find that our sample size n must be greater than or equal to 9,600.