Data 606 Lab6

To access the press release for the poll, conducted by WIN-Gallup International, click on the following link:

https://github.com/jbryer/DATA606/blob/master/inst/labs/Lab6/more/Global_INDEX_of_Religiosity_and_Atheism_PR__6.pdf

Exercise 1

In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

Because the findings are based on a poll, I think that most likely the percentages reported are based on sample statistics.

Exercise 2

The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

To generalize the report’s findings to the global human population, we must assume that the sampling was random and is sufficient in size.

Conducting studies that are random that is sufficient in size and covers a very large geographic areas are expensive. So, I think that to assume that findings of this report is generalizable for the global human population would not be reasonable.

load("Lab6/more/atheism.RData")

library(knitr)

## Warning: package 'knitr' was built under R version 3.4.4

Exercise 3

What does each row of Table 6 correspond to? What does each row of atheism correspond to?

Each row on Table 6 corresponds to summary data for each 57 countries to the 3 categories: religious person, not a religious person, and convinced atheist. Each row provides the percentages of respondents who said ‘yes’ to each category.

Exercise 4

Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

The table us12 has 1002 observations, which match the count reported in Table 6 for United States.

The calculated percent of atheist based on us12 is 5%. This matches the published data on Table 6 for United States.

us12 <- subset(atheism, nationality == "United States" & year == "2012")
kable(us12[1:10,])

	nationality	response	year
49926	United States	non-atheist	2012
49927	United States	non-atheist	2012
49928	United States	non-atheist	2012
49929	United States	non-atheist	2012
49930	United States	non-atheist	2012
49931	United States	non-atheist	2012
49932	United States	non-atheist	2012
49933	United States	non-atheist	2012
49934	United States	non-atheist	2012
49935	United States	non-atheist	2012

nrow(us12) #number of observations

## [1] 1002

#percent of atheist
round((nrow(us12[us12$response=="atheist",])/nrow(us12))*100,0)

## [1] 5

Exercise 5

Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

The sampling distribution for p_sample, taken from a sample of size n from a population with a true proportion p, is nearly normal when

the sample observations are independent and

we expected to see at least 10 successes and 10 failures in our sample, i.e. np , 10 and n(1 - p) , 10. This is called the success-failure condition.

The text above is taken from the book. There are two conditions that need to be met. First, the observations are independent, and secondly the success-failure condition is met.

In this case, the poll is assumed to be random and consists of less than 10% of the population.

To test the success-failure condition, we don’t have the true proportion of atheists in the United States, but we can use the estimate value of 5%. n = 1002.

p * n = 50.1 > 10 (1 - p) * n = 951.9 > 10

So, the success-failure condition is met.

Yes, I am fairly confident that the conditions are met.

n <- 1002
p <- 0.05
p * n

## [1] 50.1

(1 - p) * n

## [1] 951.9

Exercise 6

Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

The margin or error is 0.0135.

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Warning: package 'BHH2' was built under R version 3.4.4

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

Single proportion – success: atheist Summary statistics: p_hat = 0.0499 ; n = 1002 Check conditions: number of successes = 50 ; number of failures = 952 Standard error = 0.0069 95 % Confidence interval = ( 0.0364 , 0.0634 )

start <- 0.0364
end <- 0.0634
estimate <- 0.0499
print(margin1 <- estimate - start)

## [1] 0.0135

print(margin2 <- end - estimate)

## [1] 0.0135

Exercise 7

Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

canada12 inference result:

Single proportion – success: atheist
Summary statistics: p_hat = 0.0898 ; n = 1002
Check conditions: number of successes = 90 ; number of failures = 912
Standard error = 0.009
95 % Confidence interval = ( 0.0721 , 0.1075 )

france12 inference result:

Single proportion – success: atheist
Summary statistics: p_hat = 0.2873 ; n = 1688
Check conditions: number of successes = 485 ; number of failures = 1203
Standard error = 0.011
95 % Confidence interval = ( 0.2657 , 0.3089 )

The margin of error for Canada is 0.0177.

The margin of error for France is 0.0216.

canada12 <- subset(atheism, nationality == "Canada" & year == "2012")
france12 <- subset(atheism, nationality == "France" & year == "2012")

inference(canada12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0898 ;  n = 1002 
## Check conditions: number of successes = 90 ; number of failures = 912 
## Standard error = 0.009 
## 95 % Confidence interval = ( 0.0721 , 0.1075 )

inference(france12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.2873 ;  n = 1688 
## Check conditions: number of successes = 485 ; number of failures = 1203 
## Standard error = 0.011 
## 95 % Confidence interval = ( 0.2657 , 0.3089 )

canada12_p_hat <- 0.0898
france12_p_hat <- 0.2873
  
canada12_ci <- c( 0.0721 , 0.1075 )
france12_ci <- c( 0.2657 , 0.3089 )

print(canada12_margin <- canada12_p_hat - canada12_ci[1])

## [1] 0.0177

print(france12_margin <- france12_p_hat - france12_ci[1])

## [1] 0.0216

Exercise 8

Describe the relationship between p and me.

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

The margin of error is greatest when p is at 0.50, and starts to decrease as p approaches towards 0 and 1.

Exercise 9

Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape. Hint: Remember that R has functions such as mean to calculate summary statistics.

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

The distribution of p_hats looks normal with mean of 0.0997 and standard deviation of 0.0093. The shape looks symmetrical.

round(mean(p_hats),4)

## [1] 0.0997

round(sd(p_hats),4)

## [1] 0.0093

Exercise 10

Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p^? How does p affect the sampling distribution?

The spread of the distribution is wider when n = 400.

get.p_hats <- function(p,n){
  p_hats <- rep(0, 5000)
  
  for(i in 1:5000){
    samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
    p_hats[i] <- sum(samp == "atheist")/n
  }
  return(p_hats)
}
p_hats_1040_10perc <- get.p_hats(0.1, 1040)
p_hats_400_10perc <- get.p_hats(0.1, 400)
p_hats_1040_2perc <- get.p_hats(0.02, 1040)
p_hats_400_2perc <- get.p_hats(0.02, 400)

par(mfrow = c(2, 2))
hist(p_hats_1040_10perc, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
hist(p_hats_400_10perc, main = "p = 0.1, n = 400", xlim = c(0, 0.18))
hist(p_hats_1040_2perc, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))
hist(p_hats_400_2perc, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

par(mfrow = c(1, 1))

Exercise 11

If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?

Yes, because the distribution of p_hats looks approximately normal.

autralia.p_hats <- get.p_hats(0.1, 1040)
ecuador.p_hats <- get.p_hats(0.02, 400)
par(mfrow = c(1, 2))
hist(autralia.p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
hist(ecuador.p_hats, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

On your own

The question of atheism was asked by WIN-Gallup International in a similar survey that was conducted in 2005. (We assume here that sample sizes have remained the same.) Table 4 on page 13 of the report summarizes survey results from 2005 and 2012 for 39 countries.

Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.

Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012?

Hint: Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of atheists in both years, and determine whether they overlap.

Spain 2005 - Inference results

Single proportion – success: atheist
Summary statistics: p_hat = 0.1003 ; n = 1146
Check conditions: number of successes = 115 ; number of failures = 1031
Standard error = 0.0089
95 % Confidence interval = ( 0.083 , 0.1177 )

Spain 2012 - Inference results

Single proportion – success: atheist
Summary statistics: p_hat = 0.09 ; n = 1145
Check conditions: number of successes = 103 ; number of failures = 1042
Standard error = 0.0085
95 % Confidence interval = ( 0.0734 , 0.1065 )

The confidence interval for 2005 is ( 0.083 , 0.1177 ).

The confidence interval for 2012 is ( 0.0734 , 0.1065 ).

As you can see, these confidence interval overlap. So, there isn’t convincing evidence that the index of atheism changed between 2005 and 2012.

spain05 <- subset(atheism, nationality == "Spain" & year == "2005")
spain12 <- subset(atheism, nationality == "Spain" & year == "2012")

inference(spain05$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1003 ;  n = 1146 
## Check conditions: number of successes = 115 ; number of failures = 1031 
## Standard error = 0.0089 
## 95 % Confidence interval = ( 0.083 , 0.1177 )

inference(spain12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )

Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?

United States 2005 - Inference results:

Single proportion – success: atheist
Summary statistics: p_hat = 0.01 ; n = 1002
Check conditions: number of successes = 10 ; number of failures = 992
Standard error = 0.0031
95 % Confidence interval = ( 0.0038 , 0.0161 )

United States 2012 - Inference results:

Single proportion – success: atheist
Summary statistics: p_hat = 0.0499 ; n = 1002
Check conditions: number of successes = 50 ; number of failures = 952
Standard error = 0.0069
95 % Confidence interval = ( 0.0364 , 0.0634 )

Yes, there is convincing evidence that the index of atheism changed in the United States between 2005 and 2012 because the confidence interval do not overlap.

us05 <- subset(atheism, nationality == "United States" & year == "2005")

inference(us05$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.01 ;  n = 1002 
## Check conditions: number of successes = 10 ; number of failures = 992 
## Standard error = 0.0031 
## 95 % Confidence interval = ( 0.0038 , 0.0161 )

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance?

Hint: Look in the textbook index under Type 1 error.

Type 1 error occurs when we reject the null hypothesis (in this case there is no change in the atheism index) when in fact it is true. If in fact there is no change, at the 5% significance level, we would expect to detect a change in 5% of the countries simply by chance.

Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines?

Hint: Refer to your plot of the relationship between p and margin of error. Do not use the data set to answer this question.

At 95% confidence interval, the margin of error is 1.96 x sqrt(p(1-p)/n). So if we need the margin of error to be less than 1%, then 1.96 x sqrt(p(1-p)/n) < .01.

Since we do not know p, we can use p = 0.5 since the margin of error is largest when p is at this value. We can then get sample size that should give us a margin of error no greater than .01.