library(openintro)
download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")
Exercise 1
In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?
The first paragraph introduced sample statistics of atheism.
Exercise 2
The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?
We would have to assume the sampling methods were both independent and random if we want them to be generalizable to all of the world. Because there is so much variety in both access to data questions (surveys,etc.) I don’t think this is necessarily reasonable, unless extreme caution was taken to get a truly random independent sample.
Exercise 3
What does each row of Table 6 correspond to? What does each row of atheism correspond to?
Each row of table 6 is a country broken down by the percentages that people fit into the various categories of religious, not religious, convinced athesiest, and unsure. The sample size is also included so this is not the entire population of the country.
Exercise 4
Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
us2012 <- subset(atheism, nationality == "United States" & year == "2012")
nrow(us2012)
## [1] 1002
us2012.atheist <- subset(us2012, response == "atheist" )
nrow(us2012.atheist)
## [1] 50
us2012.nonatheist <- subset(us2012, response == "non-atheist")
nrow(us2012.nonatheist)
## [1] 952
p.atheist <- nrow(us2012.atheist)/nrow(us2012)
p.atheist
## [1] 0.0499002
Our P-value here is 0.0499 which we can round to 0.05, the same as the value of 5% stated in table 6.
Exercise 5
Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?
inference (us2012$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:

## p_hat = 0.0499 ; n = 1002
## Check conditions: number of successes = 50 ; number of failures = 952
## Standard error = 0.0069
## 95 % Confidence interval = ( 0.0364 , 0.0634 )
The conditions we need are normality, independence, and randomness. All of these are satisfied.
Exercise 6
Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?
1.96* standard error (0.0069 ) = 0.013524
Exericse 7
Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.
fiji2012 <- subset(atheism, nationality == "Fiji" & year == "2012")
inference(fiji2012$response, est = "proportion", type = "ci", method = "theoretical",
success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:

## p_hat = 0.0098 ; n = 1018
## Check conditions: number of successes = 10 ; number of failures = 1008
## Standard error = 0.0031
## 95 % Confidence interval = ( 0.0038 , 0.0159 )
cameroon2012 <- subset(atheism, nationality == "Cameroon" & year == "2012")
inference(cameroon2012$response, est = "proportion", type = "ci", method = "theoretical",
success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:

## p_hat = 0.0298 ; n = 504
## Check conditions: number of successes = 15 ; number of failures = 489
## Standard error = 0.0076
## 95 % Confidence interval = ( 0.0149 , 0.0446 )
n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

Exercise 8
Describe the relationship between p and me.
The relationship between p and me is quadratic, it increases up until a p of about 0.5 then begins to decrease again.
p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

Exercise 9
Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape. Hint: Remember that R has functions such as mean to calculate summary statistics.
This sampling distribution is fairly symmetrical and unimodal, with the mean around 0.10.
Exericse 10
Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p̂ ? How does p affect the sampling distribution?
p <- 0.1
n<- 400
phats <- rep(0,5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
phats[i] <- sum(samp == "atheist")/n}
p <- 0.02
n <- 400
phats1 <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
phats1[i] <- sum(samp == "atheist")/n}
p <- 0.02
n <- 1040
phats2 <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
phats2[i] <- sum(samp == "atheist")/n}
par(mfrow = c(2, 2))
hist(phats2, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))
hist(phats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
hist(phats, main = "p = 0.1, n = 400", xlim = c(0, 0.18))
hist(phats1, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

The first two charts with a larger n have a smaller spread. Looking between the two charts with the same n, the larger p gives a smaller spread.
Exercise 11
If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
We can proceed with the inference and margin of error for Australia because the sample size is large enough and we have a larger number of atheists, however we cannot do this with Ecuador.
On your own 1
Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.
- Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012? Hint: Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of athiests in both years, and determine whether they overlap.
spain2012 <- subset(atheism, nationality == "Spain" & year == "2012")
inference(spain2012$response, est = "proportion", type = "ci", method = "theoretical",
success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:

## p_hat = 0.09 ; n = 1145
## Check conditions: number of successes = 103 ; number of failures = 1042
## Standard error = 0.0085
## 95 % Confidence interval = ( 0.0734 , 0.1065 )
spain2005 <- subset(atheism, nationality == "Spain" & year == 2005)
nrow(spain2005)
## [1] 1146
## [1] 1145
spain2005.atheist <- subset(spain2005, response == "atheist")
nrow(spain2005.atheist)
## [1] 115
spain2012.atheist <- subset(spain2012, response == "atheist")
nrow(spain2012.atheist)
## [1] 103
inference(spain2012$response, est = "proportion", type = "ci", method = "theoretical",
success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:

## p_hat = 0.09 ; n = 1145
## Check conditions: number of successes = 103 ; number of failures = 1042
## Standard error = 0.0085
## 95 % Confidence interval = ( 0.0734 , 0.1065 )
inference(spain2005$response, est = "proportion", type = "ci", method = "theoretical",
success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:

## p_hat = 0.1003 ; n = 1146
## Check conditions: number of successes = 115 ; number of failures = 1031
## Standard error = 0.0089
## 95 % Confidence interval = ( 0.083 , 0.1177 )
The confidence intervals for the two years are fairly similar and overlap, so there is not much change.
- Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?
Single proportion – success: atheist Summary statistics: p_hat = 0.0499 ; n = 1002 Check conditions: number of successes = 50 ; number of failures = 952 Standard error = 0.0069 95 % Confidence interval = ( 0.0364 , 0.0634 ) Single proportion – success: atheist Summary statistics: p_hat = 0.01 ; n = 1002 Check conditions: number of successes = 10 ; number of failures = 992 Standard error = 0.0031 95 % Confidence interval = ( 0.0038 , 0.0161 )
For some reason my lab will not publish with that code in there so I had to delete it and here is what it produced, sorry.
This time our 95% confidence intervals do not overlap ( 0.0364 , 0.0634 ) and ( 0.0038 , 0.0161 ) so we can say there was a change between 2005 and 2012 in atheism in the US.
On your own 2
If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance? Hint: Look in the textbook index under Type 1 error.
There are 39 countries in table 4, so 39*0.05 = 1.95, or about 2 countries we would expect to see a change in.
On your own 3
Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines?
zscore <- 1.96
ME <- .1
P <- .5
N <- ((zscore^2) * .25/ME^2)
N
## [1] 96.04
Our sample size would have to be at least 96.04, which is 97 since we cannot have 0.04 of a person.
