Inference for categorical data

The Data

download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")

Exercise 3.What does each row of Table 6 correspond to? What does each row of atheism correspond to?

head(atheism)

##   nationality    response year
## 1 Afghanistan non-atheist 2012
## 2 Afghanistan non-atheist 2012
## 3 Afghanistan non-atheist 2012
## 4 Afghanistan non-atheist 2012
## 5 Afghanistan non-atheist 2012
## 6 Afghanistan non-atheist 2012

Each row of table 6 corresponds to individual country in the world with percentages of response in 4 categories: a religious person, not a religious person, a convinced atheist, and no response/don’t know.

Each row of atheism corresponds to individual person in the country whether he or she is atheist.

Exercise 4. Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

us12 <- subset(atheism, nationality == "United States" & year == "2012")
summary(us12$response)

##     atheist non-atheist 
##          50         952

####total responses = 1002

(50/1002)*100

## [1] 4.99002

The proportion of atheist responses in United States is 5% and this agrees with the percentage for United States in Table 6.

##Inference on proportions ### Exercise 5. Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

The conditions to construct 95% confidence interval: sampels are independent, data samples are randomly selected, and needs at least 10 expected success and failures

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

Exercise 6. Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

`margin of error` <- 1.96*0.0069
`margin of error`

## [1] 0.013524

Margin of error is 0.0135

Exercise 7. Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

Belgium and Hong Kong

Bg12 <- subset(atheism, nationality == "Belgium" & year == "2012")
summary(Bg12$response)

##     atheist non-atheist 
##          42         485

inference(Bg12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0797 ;  n = 527 
## Check conditions: number of successes = 42 ; number of failures = 485 
## Standard error = 0.0118 
## 95 % Confidence interval = ( 0.0566 , 0.1028 )

`margin of error`<- 1.96*0.0118
`margin of error`

## [1] 0.023128

HK12 <- subset(atheism, nationality == "Hong Kong" & year == "2012")
summary(HK12$response)

##     atheist non-atheist 
##          45         455

inference(HK12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 500 
## Check conditions: number of successes = 45 ; number of failures = 455 
## Standard error = 0.0128 
## 95 % Confidence interval = ( 0.0649 , 0.1151 )

`margin of error`<- 1.96*0.0128
`margin of error`

## [1] 0.025088

Both countries have met the conditions for constructing confidence interval Belgium’s 95 % Confidence interval = (0.0566, 0.1028) with margin of error of 0.0231. Hong Kong’s 95 % Confidence interval = (0.0649, 0.1151) with margin of error of 0.0251.

##How does the proportion affect the margin of error?

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

Exercise 8.Describe the relationship between p and me.

The plot follows an upside down parabola shape. The margin of error is at highest when the population proportion is at 50%. As the population proportion moves to either extreme ends of 0 or 1, margin of error decreases.

##Success-failure condition

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

Exercise 9. Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape.

summary(p_hats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981

The sampling distribution appears to be normally distributed with a bell shaped curve. It has median of 0.09904, range between 0.07013 and 0.12981, and mean is 0.09969. The mean is close to the poulation mean of 0.1

Exercise 10. Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p^? How does p affect the sampling distribution?

for n=400 and p=0.1

p <- 0.1
n <- 400
p_400.1 <- rep(0, 5000)

for(i in 1:5000){
  samp400.1 <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_400.1[i] <- sum(samp400.1 == "atheist")/n
}

for n=1040 and p=0.2

p <- 0.2
n <- 1040
p_1040.2 <- rep(0, 5000)

for(i in 1:5000){
  samp1040.2 <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_1040.2[i] <- sum(samp1040.2 == "atheist")/n
}

for n=400 and p=0.2

p <- 0.2
n <- 400
p_400.2 <- rep(0, 5000)

for(i in 1:5000){
  samp400.2 <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_400.2[i] <- sum(samp400.2 == "atheist")/n
}

par(mfrow = c(2, 2))
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
hist(p_400.1, main = "p = 0.1, n = 400", xlim = c(0.05, 0.2))
hist(p_1040.2, main = "p = 0.2, n = 1040", xlim = c(0.15, 0.3))
hist(p_400.2, main = "p = 0.2, n = 400", xlim = c(0.1, 0.3))

sample size affects the spread and shape of the distribution. Spread becomes smaller as sample size gets bigger and as sample size gets bigger, the shape gets closer to bell shaped curve.

Exercise 11.If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?

Australia’s success and failure condition has been met because they are greater than 10 because np=10.4, So it is sensible to proceed with inference and report margin of errors. However, in the case for Ecuador, it did not meet the requirement for success-failure condition because np = 8, which is less than 10. So Ecuador dataset should not proceed to inference and margin of errors as the reports did.

0.01*1040

## [1] 10.4

(1-0.01)*1040

## [1] 1029.6

0.02*400

## [1] 8

(1-0.02)*400

## [1] 392