Inference for categorical data

In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

Sample Statistics

The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

Sample is less than 10% population-Yes
Samples are independent of each other-Yes

load("more/atheism.RData")
head(atheism)

##   nationality    response year
## 1 Afghanistan non-atheist 2012
## 2 Afghanistan non-atheist 2012
## 3 Afghanistan non-atheist 2012
## 4 Afghanistan non-atheist 2012
## 5 Afghanistan non-atheist 2012
## 6 Afghanistan non-atheist 2012

What does each row of Table 6 correspond to? What does each row of atheism correspond to?

Table6- Each row represents a country
Atheism df- Each row of atheism represents a single person in a country

Does it agree with the percentage in Table 6? If not, why?

library(plyr)
us12 <- subset(atheism, nationality == "United States" & year == "2012")
count(us12)

##     nationality    response year freq
## 1 United States     atheist 2012   50
## 2 United States non-atheist 2012  952

our_pct <- count(us12)[1,4]/1002
paste(round(our_pct*100,2),"%")

## [1] "4.99 %"

Yes the answer in atheist agrees with the answer listed in the pdf table

Inference on proportions

Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

The sample observations are independent - YES
The success failure conditions are met-YES

If the conditions for inference are reasonable, we can either calculate the standard error and construct the interval by hand, or allow the inference function to do it for us.

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Warning: package 'BHH2' was built under R version 3.4.4

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

.013

Z <- 1.96
Stand_err <- 0.0069
Mar_Err <- Z * Stand_err
Mar_Err

## [1] 0.013524

Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

pak12 <- subset(atheism, nationality == "Pakistan" & year == "2012")
count(pak12)

##   nationality    response year freq
## 1    Pakistan     atheist 2012   54
## 2    Pakistan non-atheist 2012 2650

inference(pak12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.02 ;  n = 2704 
## Check conditions: number of successes = 54 ; number of failures = 2650 
## Standard error = 0.0027 
## 95 % Confidence interval = ( 0.0147 , 0.0252 )

Spain12 <- subset(atheism, nationality == "Spain" & year == "2012")
count(Spain12)

##   nationality    response year freq
## 1       Spain     atheist 2012  103
## 2       Spain non-atheist 2012 1042

inference(Spain12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )

pak_se <- .0027
z <-1.96
spain_se <- .0085

pak_me <-spain_se*z
spain_me <- pak_se*z

paste("Margin of error in pakistan is ",pak_me," margin of error in Spain is ",spain_me)

## [1] "Margin of error in pakistan is  0.01666  margin of error in Spain is  0.005292"

Describe the relationship between p and me.

The margin of error reaches the maximum when the population proportion is .5 and is parabolic before and after.

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

Describe the sampling distribution of sample proportions at \(n = 1040\) and \(p = 0.1\). Be sure to note the center, spread, and shape.
Hint: Remember that R has functions such as mean to calculate summary statistics.

The center seems to be at around .1. The distribution seems normal. The mean and median are both .1 as well.

library(psych)

summary(p_hats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981

describe(p_hats)

##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 5000  0.1 0.01    0.1     0.1 0.01 0.07 0.13  0.06 0.06    -0.09
##    se
## X1  0

Repeat the above simulation three more times but with modified sample sizes and proportions: for \(n = 400\) and \(p = 0.1\), \(n = 1040\) and \(p = 0.02\), and \(n = 400\) and \(p = 0.02\). Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does \(n\) appear to affect the distribution of \(\hat{p}\)? How does \(p\) affect the sampling distribution?

The center of the distribution alongside mean and median follows the P value. Higher n values lead to a more normal distribution

p <- 0.1
n <- 400
phats1 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  phats1[i] <- sum(samp == "atheist")/n
}

p <- 0.02
n <- 1040
phats2 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  phats2[i] <- sum(samp == "atheist")/n
}

p <- 0.02
n <- 400
phats3 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  phats3[i] <- sum(samp == "atheist")/n
}

par(mfrow = c(2, 2))

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
hist(phats1, main = "p = 0.1, n = 400", xlim = c(0, 0.18))
hist(phats2, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))
hist(phats3, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

Once you’re done, you can reset the layout of the plotting window by using the command par(mfrow = c(1, 1)) command or clicking on “Clear All” above the plotting window (if using RStudio). Note that the latter will get rid of all your previous plots.

par(mfrow = c(1, 1))

If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?

n_aus <- 1040
p_aus <- .1
aus_atheist <- 1040*.1
aus_not_atheist <- 1040*.9
ecuador_n <- 400
ecuador_atheist <- 400*.02
ecuador_not_atheist <- 400 * .98

paste("any samples under 10 arent large enough for inference. Austraillia's samples are Atheist=",aus_atheist,"not atheist =",aus_not_atheist,"ecuador's numbers are atheist=",ecuador_atheist," not atheist",ecuador_not_atheist)

## [1] "any samples under 10 arent large enough for inference. Austraillia's samples are Atheist= 104 not atheist = 936 ecuador's numbers are atheist= 8  not atheist 392"

Although the condition for inference for Ecuador isn’t met as one grouping only has 8 responses, it’s presumably close enough to proceed * * *

On your own

The question of atheism was asked by WIN-Gallup International in a similar survey that was conducted in 2005. (We assume here that sample sizes have remained the same.) Table 4 on page 13 of the report summarizes survey results from 2005 and 2012 for 39 countries.

Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.

a. Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012?
\(H_O\) There is no change in atheism index from 2005-2012

\(H_A\) There is a change in the atheism index from 2005-2012
Conditions for inference are met
The confidence intervals in Spain 05 and 12 overlap, which indicates that we fail to reject the null and can’t state that there is a change in the atheism index between those years

library(kableExtra)
library(knitr)
Spain05 <- subset(atheism, nationality == "Spain" & year == "2005")
kable(count(Spain05$response == 'atheist'))

x	freq
FALSE	1031
TRUE	115

Spain_inf_05 <- inference(Spain05$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1003 ;  n = 1146 
## Check conditions: number of successes = 115 ; number of failures = 1031 
## Standard error = 0.0089 
## 95 % Confidence interval = ( 0.083 , 0.1177 )

Spain12 <- subset(atheism, nationality == "Spain" & year == "2012")
count(Spain12$response == 'atheist')

##       x freq
## 1 FALSE 1042
## 2  TRUE  103

Spain_inf_12 <- inference(Spain12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )

b. Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?

$H_O$ There is no change in atheism index from 2005-2012
$H_A$ There is a change in the atheism index from 2005-2012

conditions for inference are met
The confidence intervals in US 05 and 12 don’t overlap, which indicates that we can reject the null and state there appears to be a change in the atheism index between those years in the US

United_States05 <- subset(atheism, nationality == "United States" & year == "2005")
kable(count(United_States05$response == 'atheist'))

x	freq
FALSE	992
TRUE	10

United_States_inf_05 <- inference(United_States05$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.01 ;  n = 1002 
## Check conditions: number of successes = 10 ; number of failures = 992 
## Standard error = 0.0031 
## 95 % Confidence interval = ( 0.0038 , 0.0161 )

United_States12 <- subset(atheism, nationality == "United States" & year == "2012")
kable(count(United_States12$response == 'atheist'))

x	freq
FALSE	952
TRUE	50

United_States_inf_12 <- inference(United_States12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance?
Hint: Look in the textbook index under Type 1 error.

Our confidence interval means that we are 95% confident that results can’t be attributed to random chance. This implies we will make a type 1 error 5% of the time. Therefore with a sample of 39 countries, we would expect to see 2 countries falsely identified

39*.05

## [1] 1.95

Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for \(p\). How many people would you have to sample to ensure that you are within the guidelines?

SE = \(\sqrt {\frac {P_0(1-P_0)}{N}}\)= \(\sqrt {\frac {.5(1-.5)}{N}}\)

ME= Z* SE

\(ME^2= Z^2 {\frac {.5(1-.5)}{N}}\)

\(N= (Z^2 {.5(1-.5)})/ ME^2\)

n= 9604

z=1.96
ME=.01
n= z^2*(.5*.5)/ME^2

*Hint:* Refer to your plot of the relationship between $p$ and margin of 
error. Do not use the data set to answer this question.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.