Inference of Categorical Data: The Survey

Exercise 1

These are sample statistics.

Exercise 2

We must assume that the data came from truly worldwide sources, and that all data was randomly collected from all sources. It’s not likely that happened- this is not a reasonable assumption.

The Data

download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")

Exercise 3

Each row in table 6 corresponds to a country where a sample of the population was surveyed about their self-perceived level of religiosity. Each row of the atheism table corresponds to a single response, divided only into atheist vs non-atheist.

Exercise 4

us12 <- subset(atheism, nationality == "United States" & year == "2012")
proportion <- 49/1000
proportion
## [1] 0.049
Percent <- proportion* 100
Percent
## [1] 4.9

The calculated proportion of atheists is 0.049, or 4.9%. This matches very closely to the 5% in Table 6.

Inference on Proportions

Exercise 5

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

Exercise 6

Based on the R output, the margin of error is 0.0069

Exercise 7

Arg12 <- subset(atheism, nationality == "Argentina" & year == "2012")
Aus12 <- subset(atheism, nationality == "Australia" & year == "2012")
inference(Arg12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0706 ;  n = 991 
## Check conditions: number of successes = 70 ; number of failures = 921 
## Standard error = 0.0081 
## 95 % Confidence interval = ( 0.0547 , 0.0866 )
inference(Aus12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1001 ;  n = 1039 
## Check conditions: number of successes = 104 ; number of failures = 935 
## Standard error = 0.0093 
## 95 % Confidence interval = ( 0.0818 , 0.1183 )

Argentina: Margin = 0.0081 Australia: Margin = 0.0093

How Does Proportion Effect the Margin of Error

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

### Exercise 8

The margin of error peaks at the center of the population proportion, right around 0.5

Success-Failure Condition

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

mean(p_hats)
## [1] 0.09969

Exercise 9

For p= 0.1, n= 1040, the spread is relatively normal between 0.065 and 0.14. The peak/center is just shy of 0.10, and the mean is 0.0999

Exercise 10

par(mfrow = c(2, 2))
p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

p <- 0.1
n <- 400
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 400", xlim = c(0, 0.18))


p <- 0.02
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))


p <- 0.02
n <- 400
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

P= 0.1, n= 400 is still relatively. normal, with a similar peak to the first. The spread now covers 0.03 to 0.16. P=0.02, n=1040 is shifted completely to the left, and lies between 0.00 and 0.03 with a mean around 0.015. P=0.02, n=400 is also shifted left, with a similar mean and spread as p=0.02, n=1040. N does not appear to have as much impact on the sampling distribution as p does. When P decreased from .1 to .02, the distribution shifted completely left.

Exercise 11

Since both distributions are relatively normal, it could be sensible to proceed with inference and margins of error. However, it would have to be stated what the p-hat and the n is to make sure that all calculations are to the appropriate scale and there’s not a false apples-apples comparison.