##Exercise 1: In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters? The first pararagraph discusses percentages and referes to the population. Though these percentages are found from a sample, they believe they have a large enough sample size and representative sample that they can make inferences about population. ##Exercise 2: To title something the “Global Index of Religiosity and Atheism” and make generalizatiosn about the global human population, we would assume that sampling was done acrros all countries. It seems like a reasonable assumption unless global was defined operationally to mean something other than representing the World population. 57/195 countries does not seem to be a global representation.
download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##Exercise 3:What does each row of Table 6 correspond to? What does each row of atheism correspond to? Each row represents a sample from a country. In dataset ‘athiesm’ each row represents an observation or case (a respondent). ##Exercise 4: Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why? 4.990% is percentage of US athiest responses in sample coompared to the 5% reported which was rounded to 1 significant figure.
us12 <- subset(atheism, nationality == "United States" & year== "2012");
us12%>%
filter(response!='non-atheist')%>%
summary("atheist")
## nationality response
## Afghanistan : 0 atheist :50
## Argentina : 0 non-atheist: 0
## Armenia : 0
## Australia : 0
## Austria : 0
## Azerbaijan : 0
## Belgium : 0
## Bosnia and Herzegovina : 0
## Brazil : 0
## Bulgaria : 0
## Cameroon : 0
## Canada : 0
## China : 0
## Colombia : 0
## Czech Republic : 0
## Ecuador : 0
## Fiji : 0
## Finland : 0
## France : 0
## Georgia : 0
## Germany : 0
## Ghana : 0
## Hong Kong : 0
## Iceland : 0
## India : 0
## Iraq : 0
## Ireland : 0
## Italy : 0
## Japan : 0
## Kenya : 0
## Korea, Rep (South) : 0
## Lebanon : 0
## Lithuania : 0
## Macedonia : 0
## Malaysia : 0
## Moldova : 0
## Netherlands : 0
## Nigeria : 0
## Pakistan : 0
## Palestinian territories (West Bank and Gaza): 0
## Peru : 0
## Poland : 0
## Romania : 0
## Russian Federation : 0
## Saudi Arabia : 0
## Serbia : 0
## South Africa : 0
## South Sudan : 0
## Spain : 0
## Sweden : 0
## Switzerland : 0
## Tunisia : 0
## Turkey : 0
## Ukraine : 0
## United States :50
## Uzbekistan : 0
## Vietnam : 0
## year
## Min. :2012
## 1st Qu.:2012
## Median :2012
## Mean :2012
## 3rd Qu.:2012
## Max. :2012
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
us12%>%
summary("atheist")
## nationality response
## Afghanistan : 0 atheist : 50
## Argentina : 0 non-atheist:952
## Armenia : 0
## Australia : 0
## Austria : 0
## Azerbaijan : 0
## Belgium : 0
## Bosnia and Herzegovina : 0
## Brazil : 0
## Bulgaria : 0
## Cameroon : 0
## Canada : 0
## China : 0
## Colombia : 0
## Czech Republic : 0
## Ecuador : 0
## Fiji : 0
## Finland : 0
## France : 0
## Georgia : 0
## Germany : 0
## Ghana : 0
## Hong Kong : 0
## Iceland : 0
## India : 0
## Iraq : 0
## Ireland : 0
## Italy : 0
## Japan : 0
## Kenya : 0
## Korea, Rep (South) : 0
## Lebanon : 0
## Lithuania : 0
## Macedonia : 0
## Malaysia : 0
## Moldova : 0
## Netherlands : 0
## Nigeria : 0
## Pakistan : 0
## Palestinian territories (West Bank and Gaza): 0
## Peru : 0
## Poland : 0
## Romania : 0
## Russian Federation : 0
## Saudi Arabia : 0
## Serbia : 0
## South Africa : 0
## South Sudan : 0
## Spain : 0
## Sweden : 0
## Switzerland : 0
## Tunisia : 0
## Turkey : 0
## Ukraine : 0
## United States :1002
## Uzbekistan : 0
## Vietnam : 0
## year
## Min. :2012
## 1st Qu.:2012
## Median :2012
## Mean :2012
## 3rd Qu.:2012
## Max. :2012
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
us_athiest_percentage<-50/1002; us_athiest_percentage
## [1] 0.0499002
##Exercise 5: Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met? We are 95% confident that 3.64% to 6.34% of the U.S. population in 2012 identified as atheist.
inference(us12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.0499 ; n = 1002
## Check conditions: number of successes = 50 ; number of failures = 952
## Standard error = 0.0069
## 95 % Confidence interval = ( 0.0364 , 0.0634 )
##Exercise 6: Margin of error for this 95% confidence interval is (ME = 2*SE) 1.38%.
SE<-.0069
ME<-SE*2;ME
## [1] 0.0138
##Exercise 7:Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals. CI (France 2012): We are 95% confident that 26.57% to 30.89% of the French identify as athiests if sample is representative of population. CI (South Korea 2012): We are 95% cnfident that 13.24% to 16.83% of South Koreans identify as athiests if sample is representative of population.
france12 <- subset(atheism, nationality == "France" & year== "2012")
southkorea12 <- subset(atheism, nationality == "Korea, Rep (South)" & year== "2012")
inference(france12$response, est = "proportion", type = "ci", method = "theoretical",
success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.2873 ; n = 1688
## Check conditions: number of successes = 485 ; number of failures = 1203
## Standard error = 0.011
## 95 % Confidence interval = ( 0.2657 , 0.3089 )
inference(southkorea12$response, est = "proportion", type = "ci", method = "theoretical",
success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.1504 ; n = 1523
## Check conditions: number of successes = 229 ; number of failures = 1294
## Standard error = 0.0092
## 95 % Confidence interval = ( 0.1324 , 0.1683 )
Exercise 8: Describe the relationship between p and me The relationship between p and me is somewhat parabolic (*unsure if it is). From [0.0, 0.5), ME is increasing, and from (0.5,1], ME is decreasing. ME(max) occurs at p=.5 This makes sense when looking at the formula. Intuitively thinking of this like probability, we would argue the closer the probability of two events in happening, the more margin of error or the more likely the other event will occur OR As one probability increases in likeliness and the other decreases in likeliness, the less margin of error.
n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt((p * (1 - p))/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")
##Exercise 9: Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape. Hint: Remember that R has functions such as mean to calculate summary statistics. The sampling distribution looks normal with a center at 0.0998 and a sd of 0.00934.
p <- 0.10
n <- 1040
p_hats <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
summary(p_hats)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981
s<-sd(p_hats);s
## [1] 0.009287382
mu<-.0998
sd<-.00934
##Exercise 10: Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400and p=0.1, n=1040and p=0.02, and n=400and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p̂ ? How does p affect the sampling distribution? ***(Answer below histograms)
p <- 0.10
n <- 400
p_hats2 <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hats2[i] <- sum(samp == "atheist")/n
}
p <- 0.02
n <- 1040
p_hats3 <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hats3[i] <- sum(samp == "atheist")/n
}
p <- 0.02
n <- 400
p_hats4 <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hats4[i] <- sum(samp == "atheist")/n
}
par(mfrow = c(2, 2))
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
hist(p_hats2, main = "p = 0.1, n = 400", xlim = c(0, 0.18))
hist(p_hats3, main = "p = 0.02, n = 1040", xlim = c(0, 0.07))
hist(p_hats4, main = "p = 0.02, n = 400", xlim = c(0, 0.07))
##ex10cont: As to be expected, the centers of the distributions are found are the p^ value. The smaller the n, the wider the spread. The larger the n the narrower the spread. All distibutions are relatively normal though the the smaller p (hat) distirbutions do not appear to be as normal as the other two distributions. Histogram of p_hats4 does not meet the np>10 (.02*400=8 <10) rule, but appears slightly more normal than the histogram of p_hats 3.
*** Is a large n at so a small p not a normal distribution due to the loss of variability discussed above**
##Exercise 11: If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
If these point estiamtes are actually the truth than the ME for a 95% CI appears to be correct with 100% of the data being within 5-6% and 100% of Ecuadors point estimates within 2-3%.
Running a little behind on “On Your Own” section of lab. Will complete the optional section Wednesday after I have HW and Quiz submitted.