setwd("C:/Users/Robert/Documents/R/win-library/3.2/IS606/labs/Lab6")
load("more/atheism.RData")
===============================================
Exercise 1
In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?
Sample statistics. These values and proportions are based on sample data that are estimates of population parameters.
===============================================
Exercise 2
The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?
We have to assume that the samples taken were truly random. Given the mode of communication and varieties of method inherant in a global survey, it is difficult to assume consistency but rather a patchwork of results from different surveys.
===============================================
Exercise 3
What does each row of Table 6 correspond to? What does each row of atheism correspond to?
The percentage value represents the proportion of respondants who answered “yes” to being an atheist per nation. The results per row amount to 100%.
===============================================
Exercise 4
Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
us12 <- subset(atheism, nationality == "United States" & year == "2012")
us12atheist<-subset(us12,response=="atheist")
nrow(us12atheist)/nrow(us12)
## [1] 0.0499002
The proportion, rounded very slightly, is the same and agrees with the table.
===============================================
Exercise 5
Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?.
A sample size of 1000 is sufficient, and they are independent. The success-failure condition for true / false, regarding atheist or non-atheist, are both over 0.
suppressWarnings(library(BHH2))
inference(us12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.0499 ; n = 1002
## Check conditions: number of successes = 50 ; number of failures = 952
## Standard error = 0.0069
## 95 % Confidence interval = ( 0.0364 , 0.0634 )
===============================================
Exercise 6
Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?
#margin of error
(0.0634-0.0364)/2
## [1] 0.0135
===============================================
Exercise 7
Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.
For the following:
Ireland has a success/failure count above 10 and a sample size of 1010.
Brazil has a success/failure count above 10 and a sample size of 2002.
##Ireland
ireland<-subset(atheism,nationality=="Ireland"& year =="2012")
inference(ireland$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.099 ; n = 1010
## Check conditions: number of successes = 100 ; number of failures = 910
## Standard error = 0.0094
## 95 % Confidence interval = ( 0.0806 , 0.1174 )
#Brazil
brazil<-subset(atheism,nationality=="Brazil"& year =="2012")
inference(brazil$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.01 ; n = 2002
## Check conditions: number of successes = 20 ; number of failures = 1982
## Standard error = 0.0022
## 95 % Confidence interval = ( 0.0056 , 0.0143 )
===============================================
Exercise 8
Describe the relationship between p and me.
n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")
This plot represents how the population proportion decreases the margin of error as you move away from an evenly split distribution.
===============================================
Exercise 9
Describe the sampling distribution of sample proportions at n=1040 n=1040 and p=0.1 p=0.1. Be sure to note the center, spread, and shape. Hint: Remember that R has functions such as mean to calculate summary statistics.
p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
summary(p_hats)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07019 0.09327 0.09904 0.09969 0.10580 0.12980
max(p_hats)-min(p_hats)
## [1] 0.05961538
The plot is centered around 0.10, appears to be very normal, a and has approximately a 6.3% range.
===============================================
Exercise 10
Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400n=400 and p=0.1p=0.1, n=1040n=1040 and p=0.02p=0.02, and n=400n=400 and p=0.02p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does nn appear to affect the distribution of p^ p^? How does pp affect the sampling distribution?
par(mfrow = c(2,2))
#SET ONE
p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
#SET TWO
p <- 0.1
n <- 400
p_hats <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.1, n = 400", xlim = c(0, 0.18))
#SET Three
p <- 0.02
n <- 1040
p_hats <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))
#SET Four
p <- 0.02
n <- 400
p_hats <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.02, n = 400", xlim = c(0, 0.18))
#clear
par(mfrow = c(1, 1))
The distributions remain normal. The p value shifts the center of the distribution and the sample size n effects the spread.
===============================================
Exercise 11
If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
#calculate MOE to determine suitability
#Australia
australia <- subset(atheism, nationality == "Australia" & year == "2012")
australiaAtheist<-subset(australia,response=="atheist")
propAus<-nrow(australiaAtheist)/nrow(australia)
moeAus<-1.96*sqrt(propAus*(1-propAus)/nrow(australiaAtheist))
propAus
## [1] 0.1000962
moeAus
## [1] 0.0576828
#ecuador
ecuador <- subset(atheism, nationality == "Ecuador" & year == "2012")
ecuadorAtheist<-subset(ecuador,response=="atheist")
propEcua<-nrow(ecuadorAtheist)/nrow(ecuador)
moeEcua<-1.96*sqrt(propEcua*(1-propEcua)/nrow(ecuadorAtheist))
propEcua
## [1] 0.01980198
moeEcua
## [1] 0.09654334
The margin of errors differ, but are both under the 3-5% moe’s listed. This creates a divide between each dataset, whereby, if combined for analysis, the error would be enhanced.
===============================================
===============================================
1
Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.
a. Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012?
H0 : There is no difference in the means for Spain for 2012 and 2005 HA : There is a difference +/- in these means
Successes and failures are both over 10 for each year. The samples are under 10% for the national population, and the data is independent.
#using the custom inference function
spain2012 <- subset(atheism, nationality == "Spain" & year == "2012")
nrow(spain2012)
## [1] 1145
spain2005 <- subset(atheism, nationality == "Spain" & year == "2005")
nrow(spain2005)
## [1] 1146
inference(spain2012$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.09 ; n = 1145
## Check conditions: number of successes = 103 ; number of failures = 1042
## Standard error = 0.0085
## 95 % Confidence interval = ( 0.0734 , 0.1065 )
inference(spain2005$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.1003 ; n = 1146
## Check conditions: number of successes = 115 ; number of failures = 1031
## Standard error = 0.0089
## 95 % Confidence interval = ( 0.083 , 0.1177 )
There is an apparantly minor decrease in atheist responses between these two dates. As the data sample sizes are so close, and the confidence intervals overlapping, it is a statistical likelihood that the results are due to slight abberation rather than an actual change.
b. Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?
#using the custom inference function
usa2012 <- subset(atheism, nationality == "United States" & year == "2012")
nrow(usa2012)
## [1] 1002
usa2005 <- subset(atheism, nationality == "United States" & year == "2005")
nrow(usa2005)
## [1] 1002
inference(usa2012$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.0499 ; n = 1002
## Check conditions: number of successes = 50 ; number of failures = 952
## Standard error = 0.0069
## 95 % Confidence interval = ( 0.0364 , 0.0634 )
inference(usa2005$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.01 ; n = 1002
## Check conditions: number of successes = 10 ; number of failures = 992
## Standard error = 0.0031
## 95 % Confidence interval = ( 0.0038 , 0.0161 )
Successes and failures are both over 10 for each year. The samples are under 10% for the national population, and the data is independent.
In contrast to Spain, the data for the United States suggests a legitimate increase in atheist responses, as the confidence intervals are not overlapping.
===============================================
2
If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance?
We would expect there to be a 5% chance of a type I error. So, 5% of the records listed would expectedly be an incorrect assumption of no change.
===============================================
3
Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for pp. How many people would you have to sample to ensure that you are within the guidelines?
If MOE is defined as MOE = 1.96 x sqrt(p*(1-p/n)), we can replace the values as given to compute the sample size.
samplesize<- ((1.96^2)*0.01*0.99)/0.0001
samplesize
## [1] 380.3184