Grando 6 Lab

if (Sys.info()["sysname"] == "Windows") {
    setwd("~/Masters/DATA606/Week6/Lab/Lab6")
} else {
    setwd("~/Documents/Masters/DATA606/Week6/Lab/Lab6")
}
load("more/atheism.RData")
require(ggplot2)
## Loading required package: ggplot2

Exercise 1 - In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

Answer:

The percentages appear to be sample statistics.

Exercise 2 - The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

Answer:

The sampling method must be random. The sample represents less than 10% of the population and each country had much more than 30 samples taken. Since the question indicates that we are looking at the entire world, it’s unclear if there is some bias from not sampling the harder to reach countries. However, the number of surveys are extensive so it appears that the randomness assumption is reasonable.

Exercise 3 - What does each row of Table 6 correspond to? What does each row of atheism correspond to?

Answer:

summary(atheism)
##              nationality           response          year     
##  Pakistan          : 5409   atheist    : 5498   Min.   :2005  
##  France            : 3359   non-atheist:82534   1st Qu.:2005  
##  Korea, Rep (South): 3047                       Median :2012  
##  Ghana             : 2995                       Mean   :2009  
##  Macedonia         : 2418                       3rd Qu.:2012  
##  Peru              : 2414                       Max.   :2012  
##  (Other)           :68390

Each row from Table 6 is a different country that was sampled for the study. For each country the sample size is reported along with the proportions of who responded as religious, non-religious, atheist, and uncertain.

Each row of the atheism data table is an individual survey response. For each person, the table reports the nationality, whether the person considers themself an atheist or non-atheist, and what year the response was recorded.

Exercise 4 - Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

Answer:

us12 <- subset(atheism, nationality == "United States" & year == 
    "2012")
us12_atheist <- subset(atheism, nationality == "United States" & 
    year == "2012" & response == "atheist")
p_atheist_us <- nrow(us12_atheist)/nrow(us12)
p_atheist_us
## [1] 0.0499002
# using the inference function
inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
    success = "atheist")
## Warning: package 'BHH2' was built under R version 3.3.3
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

The response rate for atheist in the United States for the 2012 survey is approximately 5%. The result is consistent with Table 6 which reports a response of 5% for “convinced atheist”.

Exercise 5 - Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

Answer:

  1. the sample observations are independent.

Yes, the number of surveys are less than 10% of the population and appear to be random.

  1. we expected to see at least 10 successes and 10 failures in our sample, i.e. np 10 and n(1 p) 10. This is called the success-failure condition.

Yes, the number of successes (atheist) are 50 and failures (non-athiest) are 952.

Exercise 6 - Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

Answer:

The margin of error is:

n_us <- nrow(us12)
ME <- 1.96 * sqrt(p_atheist_us * (1 - p_atheist_us)/n_us)
ME
## [1] 0.01348211

Exercise 7 - Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

Answer:

I will select Germany and Peru for this problem. First, I will perform the analysis for Germany:

g12 <- subset(atheism, nationality == "Germany" & year == "2012")
g12_atheist <- subset(atheism, nationality == "Germany" & year == 
    "2012" & response == "atheist")

As with the US sample, it appears we can assume the samples were independent. Additionally, the number of successes (atheist) are 75 and failures (non-athiest) are 427.

The margin of error for german respondents is:

p_atheist_g <- nrow(g12_atheist)/nrow(g12)
n_g <- nrow(g12)
ME <- 1.96 * sqrt(p_atheist_g * (1 - p_atheist_g)/n_g)
ME
## [1] 0.03118495
# using the inference function
inference(g12$response, est = "proportion", type = "ci", method = "theoretical", 
    success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1494 ;  n = 502 
## Check conditions: number of successes = 75 ; number of failures = 427 
## Standard error = 0.0159 
## 95 % Confidence interval = ( 0.1182 , 0.1806 )

We are 95% confident that the proportion of german residents who consider themselves athesists is between 0.1182 and 0.1806.

Peru:

p12 <- subset(atheism, nationality == "Peru" & year == "2012")
p12_atheist <- subset(atheism, nationality == "Peru" & year == 
    "2012" & response == "atheist")

As with the US sample, it appears we can assume the samples were independent for those respondents from Peru. Additionally, the number of successes (atheist) are 36 and failures (non-athiest) are 1171.

The margin of error for respondents from Peru is:

p_atheist_p <- nrow(p12_atheist)/nrow(p12)
n_p <- nrow(p12)
ME <- 1.96 * sqrt(p_atheist_p * (1 - p_atheist_p)/n_p)
ME
## [1] 0.009596765
# using the inference function
inference(p12$response, est = "proportion", type = "ci", method = "theoretical", 
    success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0298 ;  n = 1207 
## Check conditions: number of successes = 36 ; number of failures = 1171 
## Standard error = 0.0049 
## 95 % Confidence interval = ( 0.0202 , 0.0394 )

We are 95% confident that the proportion of peruvian residents who consider themselves athesists is between 0.0202 and 0.0394.

Exercise 8 - Describe the relationship between p and me.

Answer:

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

As p increases from 0 to 0.5, the margin of error increases. As p increases from 0.5 to 0, it decreases. The increase and decrease are mirrored acrosse the x = 0.5 vertical line. The maximum margin of error happens when p = 0.5.

Exercise 9 - Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape. Hint: Remember that R has functions such as mean to calculate summary statistics.

Answer:

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for (i in 1:5000) {
    samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, 
        prob = c(p, 1 - p))
    p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

Summary statistics:

summary(p_hats)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10580 0.12980
sd(p_hats)
## [1] 0.009287382

The distribution is centered around 0.09969 with a standard deviation of .00925 and IQR of 0.01253. The data appear to be normally distributed with a standard deviation of 0.0092874

Exercise 10 - Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does nn appear to affect the distribution of pp? How does pp affect the sampling distribution?

Answer:

p <- 0.1
n <- 400
p_hats_2 <- rep(0, 5000)

for (i in 1:5000) {
    samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, 
        prob = c(p, 1 - p))
    p_hats_2[i] <- sum(samp == "atheist")/n
}

p <- 0.02
n <- 1040
p_hats_3 <- rep(0, 5000)

for (i in 1:5000) {
    samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, 
        prob = c(p, 1 - p))
    p_hats_3[i] <- sum(samp == "atheist")/n
}

p <- 0.02
n <- 400
p_hats_4 <- rep(0, 5000)

for (i in 1:5000) {
    samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, 
        prob = c(p, 1 - p))
    p_hats_4[i] <- sum(samp == "atheist")/n
}

par(mfrow = c(2, 2))
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
hist(p_hats_2, main = "p = 0.1, n = 400", xlim = c(0, 0.18))
hist(p_hats_3, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))
hist(p_hats_4, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

par(mfrow = c(1, 1))

It appears that the larger the n is, the spread decreases. As p decreases, it appears the spread (standard deviation) also decreases along with the center obviously changing as well.

Exercose 11 - If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?

Answer:

au12 <- subset(atheism, nationality == "Australia" & year == 
    "2012")
au12_atheist <- subset(atheism, nationality == "Australia" & 
    year == "2012" & response == "atheist")

The number of successes (atheist) are 104 and failures (non-athiest) are 935 for the australian respondents. This meets the minimum requirements to proceed.

ec12 <- subset(atheism, nationality == "Ecuador" & year == "2012")
ec12_atheist <- subset(atheism, nationality == "Ecuador" & year == 
    "2012" & response == "atheist")

The number of successes (atheist) are 8 and failures (non-athiest) are 396 for the Ecuadorian respondents. There are not enough successes to meet the minimum requirements (10 successes) so a margin of error estimate would not be appropriate for these results.

Question 1 - Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.

a. Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012? Hint: Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of athiests in both years, and determine whether they overlap.

Answer:

Since we have already determined independence for the overall survey setup, we only need to check for a minimum of 10 samples for each success and failure

The hypothesis for this question is:

\[{ H }_{ O }:\quad { p }_{ Spain\ 2012 }\quad -\quad { p }_{ Spain\ 2005 }\quad =\quad 0\quad \\ \quad { H }_{ A }:\quad { p }_{ Spain\ 2012 }\quad -\quad { p }_{ Spain\ 2005 }\quad \neq \quad 0\\ \]

sp05 <- subset(atheism, nationality == "Spain" & year == "2005")
sp05_atheist <- subset(atheism, nationality == "Spain" & year == 
    "2005" & response == "atheist")
sp12 <- subset(atheism, nationality == "Spain" & year == "2012")
sp12_atheist <- subset(atheism, nationality == "Spain" & year == 
    "2012" & response == "atheist")

The number of successes (atheist) are 115 and failures (non-athiest) are 1031 for the 2005 data. The number of successes (atheist) are 103 and failures (non-athiest) are 1042 for the 2012 data. The minimum number of successes and failures have been met.

inference(sp05$response, est = "proportion", type = "ci", method = "theoretical", 
    success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1003 ;  n = 1146 
## Check conditions: number of successes = 115 ; number of failures = 1031 
## Standard error = 0.0089 
## 95 % Confidence interval = ( 0.083 , 0.1177 )
inference(sp12$response, est = "proportion", type = "ci", method = "theoretical", 
    success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )

We are 95% confident that the proportion of Spanish residents who considered themselves athesists in 2005 is between 0.083 and 0.1177. We are also 95% confident that the proportion of Spanish residents who considered themselves athesists in 2012 is between 0.0734 and 0.1065. Since these confidence intevals overlap, there is not sufficient evidence to reject the null hypothesis. We fail to reject the null hypothesis that there is no difference in the proportion of Spanish respondents between 2005 and 2012.

b. Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?

Answer:

Since we have already determined independence for the overall survey setup, we only need to check for a minimum of 10 samples for each success and failure

The hypothesis for this question is:

\[{ H }_{ O }:\quad { p }_{ United\ States\ 2012 }\quad -\quad { p }_{ United\ States\ 2005 }\quad =\quad 0\quad \\ \quad { H }_{ A }:\quad { p }_{ United\ States\ 2012 }\quad -\quad { p }_{ United\ States\ 2005 }\quad \neq \quad 0\\ \]

us05 <- subset(atheism, nationality == "United States" & year == 
    "2005")
us05_atheist <- subset(atheism, nationality == "United States" & 
    year == "2005" & response == "atheist")
us12 <- subset(atheism, nationality == "United States" & year == 
    "2012")
us12_atheist <- subset(atheism, nationality == "United States" & 
    year == "2012" & response == "atheist")

The number of successes (atheist) are 10 and failures (non-athiest) are 992 for the 2005 data. The number of successes (atheist) are 50 and failures (non-athiest) are 952 for the 2012 data. The minimum number of successes and failures have been met.

inference(us05$response, est = "proportion", type = "ci", method = "theoretical", 
    success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.01 ;  n = 1002 
## Check conditions: number of successes = 10 ; number of failures = 992 
## Standard error = 0.0031 
## 95 % Confidence interval = ( 0.0038 , 0.0161 )
inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
    success = "atheist")
## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

We are 95% confident that the proportion of American residents who considered themselves athesists in 2005 is between 0.0038 and 0.0161. We are also 95% confident that the proportion of American residents who considered themselves athesists in 2012 is between 0.0364 and 0.0634. Since these confidence intevals do not overlap, there is sufficient evidence to reject the null hypothesis. We reject the null hypothesis that there is no difference in the proportion of American respondents between 2005 and 2012.

Question 2 - If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance? Hint: Look in the textbook index under Type 1 error.

Answer:

Since we are using a significance level of 0.05, we would expect to detect a change in 5% of the countries that, in fact, had no change.

Question 3 - Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines? Hint: Refer to your plot of the relationship between p and margin of error. Do not use the data set to answer this question.

Answer:

The following fomula summarizes the formula, and rearrangement, for margin of error for a proportional sample. The largest margin of error occurs when p = 0.5; therefore, to make a conservative estimate, we should use this proportion to ensure we will sample the necessary amount of people. The number of people necessary to sample is:

\[ME\quad =\quad z*\sqrt { \frac { p(1-p) }{ n } } <\quad 0.01\\ ME\quad =\quad 1.96*\sqrt { \frac { 0.5(1-0.5) }{ n } } <\quad 0.01\\ ME\quad =\quad { 1.96 }^{ 2 }*\frac { 0.5(1-0.5) }{ n } <\quad { 0.01 }^{ 2 }\\ ME\quad =\quad { 1.96 }^{ 2 }*\frac { 0.5(1-0.5) }{ { 0.01 }^{ 2 } } <\quad n\]

ceiling(1.96^2 * 0.5 * (1 - 0.5)/0.01^2)
## [1] 9604