Math217_MarilynMacy_Lab8

Inference for categorical data

head(atheism)

##   nationality    response year
## 1 Afghanistan non-atheist 2012
## 2 Afghanistan non-atheist 2012
## 3 Afghanistan non-atheist 2012
## 4 Afghanistan non-atheist 2012
## 5 Afghanistan non-atheist 2012
## 6 Afghanistan non-atheist 2012

Exercise 1

In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?
The percentages appear to be population parameters, the term “global” is used several times in the paragraph.

Exercise 2

The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?
The sampling method must be an unbiased random sampling from the global population. It’s not a reasonable assumption, it wouldn’t be possible to reach all of the world’s population to take a random sample for a poll. So the poll would contain convenience bias, selection bias and volunteer bias.

Exercise 3

What does each row of Table 6 correspond to? What does each row of atheism correspond to?

The rows in Table 6 contain the total responses for a country in 2012. The rows in atheism contain the response from an individual in a particular year.

Exercise 4

Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States.

us12 <- subset(atheism, nationality == "United States" & year == "2012")

Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

usa <- count(filter(us12, us12$response == 'atheist'))/count(us12) 
usa <- as.numeric(usa)
round(usa, 3)

## [1] 0.05

Yes, the calculated percentage agrees with the percentage in Table 6.

Exercise 5

Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

p * n = 0.05 * 1002 = 50.1 > 10
(1 - p) * 0.05 = 951.9 > 10

The sample size is large enough.

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Warning: package 'BHH2' was built under R version 4.0.5

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

Exercise 6

Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

ME = 0.0499 - 0.0364 = 0.0135

Exercise 7

Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met.

lt12 <- subset(atheism, nationality == "Lithuania" & year == "2012")
nt12 <- subset(atheism, nationality == "Netherlands" & year == "2012")

Lithuania

n = 1015
p = 0.07
n * p = 10.15
n * (1 - p) = 1004.85

The sample size just meets the conditions for inference.

inference(lt12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0099 ;  n = 1015 
## Check conditions: number of successes = 10 ; number of failures = 1005 
## Standard error = 0.0031 
## 95 % Confidence interval = ( 0.0038 , 0.0159 )

ME = 0.0099 - 0.0038 = 0.0061

Netherlands

n = 509
p = 0.14
n * p = 71.26
n * (1 - p) = 437.74
The sample size meets the conditions for inference.

inference(nt12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1395 ;  n = 509 
## Check conditions: number of successes = 71 ; number of failures = 438 
## Standard error = 0.0154 
## 95 % Confidence interval = ( 0.1094 , 0.1696 )

ME = 0.14 - 0.1094 = 0.031

Exercise 8

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

How does the proportion affect the margin of error? Describe the relationship between p and me.

The relationship is an inverse parabola.

Exercise 9

Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape.

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

The distribution is normal. The mean and median are both 0.10 and the std. deviation is 0.009. The range = (0.067, 0.14)

summary(p_hats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981

paste("std. deviation = ", sd(p_hats))

## [1] "std. deviation =  0.00928738233415376"

Exercise 10

Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p-hat? How does p affect the sampling distribution?

# Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. 

par(mfrow = c(2, 2))

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

p <- 0.1
n <- 400
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 400", xlim = c(0, 0.18))

p <- 0.02
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))

p <- 0.02
n <- 400
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

A smaller n widens the spread of the distribution. A smaller proportion narrows the spread.

Exercise 11

If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?

The sampling distributions for Australia and Ecuador are normal so it is acceptable to proceed with inference.

1. Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.

a. Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012?

\(H_0\): The proportion of atheists in Spain in 2005 equals the proportion of atheists in Spain in 2012
\(H_A\): The proportion of atheists in Spain in 2005 is different from the proportion in 2012.

sp05 <- subset(atheism, nationality == "Spain" & year == "2005")
sp12 <- subset(atheism, nationality == "Spain" & year == "2012")

Spain 2005

inference(sp05$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist", eda_plot = FALSE)

## Single proportion -- success: atheist 
## Summary statistics: p_hat = 0.1003 ;  n = 1146 
## Check conditions: number of successes = 115 ; number of failures = 1031 
## Standard error = 0.0089 
## 95 % Confidence interval = ( 0.083 , 0.1177 )

Spain 2012

inference(sp12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist", eda_plot = FALSE)

## Single proportion -- success: atheist 
## Summary statistics: p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )

There is an overlap in the confidence intervals for 2005 and 2012, (0.083, 0.1065), \(H_0\) is not rejected and there isn’t sufficient evidence that the proportion of atheists has changed.

b. Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?
\(H_0\): The proportion of atheists in the US in 2005 equals the proportion of atheists in 2012
\(H_A\): The proportion of atheists in the US in 2005 is different from the proportion in 2012

US 2005

us05 <- subset(atheism, nationality == "United States" & year == "2005")
inference(us05$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist", eda_plot = FALSE)

## Single proportion -- success: atheist 
## Summary statistics: p_hat = 0.01 ;  n = 1002 
## Check conditions: number of successes = 10 ; number of failures = 992 
## Standard error = 0.0031 
## 95 % Confidence interval = ( 0.0038 , 0.0161 )

US 2012

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist", eda_plot = FALSE)

## Single proportion -- success: atheist 
## Summary statistics: p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

The confidence intervals for the proportion of atheists in the United States for 2005 and 2012 do not overlap, we reject \(H_0\) there is sufficient evidence that the proportion of atheists has increased in the United States between 2005 and 2012.

2.

If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance?

A type I error could occur for two (5% * 39 = 1.95) of the countries.

3.

Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines?

n \(\ge (t^* / me)^2\) * (p)(1-p) = (2 / .01)² * (p)(1-p) = 40,000 * (p)(1-p)

Math217_MarilynMacy_Lab8_CategoricalData

Marilyn Macy

10/29/2021