## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following objects are masked from 'package:openintro':
##
## ethanol, lsegments
##
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
##
## Orange
Exercise 1
Q.In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?
A. The first paragrpah documents the general finding of this study with percentaged that are of the sample statistics. stats that stem from 50,000 men and women across 57 countrie. ### Exercise 2
The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?
- Random sampling took place.
- The samples were independent of one another .
- The sample size is normaly dist or is sufficient.
The data
What does each row of Table 6 correspond to? What does each row of atheism correspond to?
download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")
Exercise 3
What does each row of Table 6 correspond to? What does each row of atheism correspond to?
view(atheism)
us12 <- subset(atheism, nationality == "United States" & year == "2012")
Table 6 on page 15 of the study shows a row of different countries along with the percentage in in the categories of " religious “,”nonreligious“,”convinced atheists“,”I don’t know"
while rows of table 6 correspond to high level statistics , “atheism” corresponds to individual samples in each country of binary responses of atheist or not atheist.
Exercise 4
Does it agree with the percentage in Table 6? If not, why?
#Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States.
us12 <- subset(atheism, nationality == "United States" & year == "2012")
us.atheist <- subset(us12, response == "atheist" )
us.nonatheist <- subset(us12, response == "non-atheist" )
num.atheist<- count(us.atheist)
num.nonatheist<-count(us.nonatheist)
us.asked<-count(us12)
#Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
prop.atheist<-(num.atheist/us.asked)
The value of the proportion is 0.0499002 which when rounded macth the values from the table.
Inference on proportions
What we’d like, though, is insight into the population parameters. You answer the question, “What proportion of people in your sample reported being atheists?” with a statistic; while the question “What proportion of people on earth would report being atheists” is answered with an estimate of the parameter.
Exercise 5
Q. Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?
# The inferential tools for estimating population proportion are analogous to those used for means in the last chapter: the confidence interval and the hypothesis test.
inference(us12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")
## Single proportion -- success: atheist
## Summary statistics:

## p_hat = 0.0499 ; n = 1002
## Check conditions: number of successes = 50 ; number of failures = 952
## Standard error = 0.0069
## 95 % Confidence interval = ( 0.0364 , 0.0634 )
Note that since the goal is to construct an interval estimate for a proportion, it’s necessary to specify what constitutes a “success”, which here is a response of “atheist”.
Although formal confidence intervals and hypothesis tests don’t show up in the report, suggestions of inference appear at the bottom of page 7:
“In general, the error margin for surveys of this kind is ± 3-5% at 95% confidence”
Exercise 6
Q. Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?
Exercise 7
Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.
How does the proportion affect the margin of error?
Imagine you’ve set out to survey 1000 people on two questions: are you female? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.
Think back to the formula for the standard error: SE=p(1−p)/n‾‾‾‾‾‾‾‾‾‾√. This is then used in the formula for the margin of error for a 95% confidence interval: ME=1.96×SE=1.96×p(1−p)/n‾‾‾‾‾‾‾‾‾‾√. Since the population proportion p is in this ME formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of ME vs. p.
# survey 1000 people
n <- 1000
# The first step is to make a vector p that is a sequence from 0 to 1 with each number separated by 0.01.
p <- seq(0, 1, 0.01)
#we can then create a vector of the margin of error (me) associated with each of these values of p using the familiar approximate formula (ME=2×SE).
me <- 2 * sqrt(p * (1 - p)/n)
#Lastly, we plot the two vectors against each other to reveal their relationship.
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

Exercise 8
Q. Describe the relationship between p and me.
A. As the population proportion increases the margin of erroe increases till it is large enough to not have much significance where it levels off and falls.
Success-failure condition.
We can investigate the interplay between n and p and the shape of the sampling distribution by using simulations.
#To start off, we simulate the process of drawing 5000 samples of size 1040 from a population with a true atheist proportion of 0.1.
p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)
#For each of the 5000 samples we compute p̂ and then plot a histogram to visualize their distribution.
for(i in 1:5000)
{samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

Exercise 9
Q. Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape.
#Hint: Remember that R has functions such as mean to calculate summary statistics.
summary(p_hats)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981
Exercise 10
Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2))
Q.Based on these limited plots, how does n appear to affect the distribution of p̂ ? How does p affect the sampling distribution?
#command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions.
#1. n=400 and p=0.1,
n<-400
p<-0.1
p_hat.a<-rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hat.a[i] <- sum(samp == "atheist")/n
}
hist(p_hat.a, main = "p = 0.1, n = 400", xlim = c(0, 0.18))

#2 n=1040 and p=0.02
n<-1040
p<-0.02
p_hat.b<-rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hat.b[i] <- sum(samp == "atheist")/n
}
hist(p_hat.b, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))

#3n=400 and p=0.02
n<-400
p<-0.02
p_hat.c<-rep(0, 5000)
for(i in 1:5000){
samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
p_hat.c[i] <- sum(samp == "atheist")/n
}
hist(p_hat.c, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

#Once you’re done, you can reset the layout of the plotting window by using the command par(mfrow = c(1, 1)) command or clicking on “Clear All” above the plotting window (if using RStudio). Note that the latter will get rid of all your previous plots.
#Once you’re done, you can reset the layout of the plotting window by using the command par(mfrow = c(1, 1)) command or clicking on “Clear All” above the plotting window (if using RStudio). Note that the latter will get rid of all your previous plots.
par(mfrow = c(1, 1))
Exercise 11
Q. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
While it is somewhat sensible to proceed with the report of the margin of errors in the case of Ecuador it might not be.
On your own
The question of atheism was asked by WIN-Gallup International in a similar survey that was conducted in 2005. (We assume here that sample sizes have remained the same.) Table 4 on page 13 of the report summarizes survey results from 2005 and 2012 for 39 countries.
- Answer the following two questions using the inference function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.
- Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012?
H0: There is no change in atheism in Spain from 2005 to 2012
Ha:There is a change in seen in spain in atheism form 2015 to 2012
#Hint: Create a new data set for respondents from Spain. Form confidence intervals for the true proportion of athiests in both years, and determine whether they overlap.
#2005
sp05 <- subset(atheism, nationality == "Spain" & year == "2005")
sp05.atheist <- subset(sp05, response == "atheist" )
num.atheist05<- count(sp05.atheist)
num.atheist05
## n
## 1 115
sp05.asked<-count(sp05)
sp05.asked
## n
## 1 1146
#Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
#2012
sp12 <- subset(atheism, nationality == "Spain" & year == "2012")
sp12.atheist <- subset(sp12, response == "atheist" )
num.atheist12<- count(sp12.atheist)
sp12.asked<-count(sp12)
sp12.asked
## n
## 1 1145
#Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
prop.atheis12<-(num.atheist12 / sp12.asked)
final <-prop.atheis12*100
2005 Spain polled citizens atheists were 10.0349% of the sample with a proportion of .103 and had proportion of .089 with a percent of 8.9%.furhter with a pvalue of 0.3 the is sufficent evidence to accept the null hypothese.
spain <- subset(atheism, nationality == "Spain" & year == "2005" | nationality == "Spain" & year == "2012")
inference(y = spain$response, x = spain$year, est = "proportion",type = "ht", null = 0, alternative = "twosided", method = "theoretical", success = "atheist")
## Warning: Explanatory variable was numerical, it has been converted to
## categorical. In order to avoid this warning, first convert your explanatory
## variable to a categorical variable using the as.factor() function.
## Response variable: categorical, Explanatory variable: categorical
## Two categorical variables
## Difference between two proportions -- success: atheist
## Summary statistics:
## x
## y 2005 2012 Sum
## atheist 115 103 218
## non-atheist 1031 1042 2073
## Sum 1146 1145 2291
## Observed difference between proportions (2005-2012) = 0.0104
##
## H0: p_2005 - p_2012 = 0
## HA: p_2005 - p_2012 != 0
## Pooled proportion = 0.0952
## Check conditions:
## 2005 : number of expected successes = 109 ; number of expected failures = 1037
## 2012 : number of expected successes = 109 ; number of expected failures = 1036
## Standard error = 0.012
## Test statistic: Z = 0.848
## p-value = 0.3966

- Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?
us12 <- subset(atheism, nationality == "United States" & year == "2012")
us.atheist <- subset(us12, response == "atheist" )
us.nonatheist <- subset(us12, response == "non-atheist" )
num.atheist<- count(us.atheist)
#50
us.asked<-count(us12)
#1002
prop.atheistus12<-(50/1002)
#.0499002
us05 <- subset(atheism, nationality == "United States" & year == "2005")
us.atheist05 <- subset(us05, response == "atheist" )
num.atheist05<- count(us.atheist05)
#10
us.asked05<-count(us05)
#1002
prop05us<-( 10/1002)
#[1] 0.00998004
#There seems to be a change in proportions between the years in the united states.
usa<- subset(atheism, nationality == "United States" & year == "2005" | nationality == "United States" & year == "2012")
inference(y = usa$response, x = usa$year, est = "proportion",type = "ht", null = 0, alternative = "twosided", method = "theoretical", success = "atheist")
## Warning: Explanatory variable was numerical, it has been converted to
## categorical. In order to avoid this warning, first convert your explanatory
## variable to a categorical variable using the as.factor() function.
## Response variable: categorical, Explanatory variable: categorical
## Two categorical variables
## Difference between two proportions -- success: atheist
## Summary statistics:
## x
## y 2005 2012 Sum
## atheist 10 50 60
## non-atheist 992 952 1944
## Sum 1002 1002 2004
## Observed difference between proportions (2005-2012) = -0.0399
##
## H0: p_2005 - p_2012 = 0
## HA: p_2005 - p_2012 != 0
## Pooled proportion = 0.0299
## Check conditions:
## 2005 : number of expected successes = 30 ; number of expected failures = 972
## 2012 : number of expected successes = 30 ; number of expected failures = 972
## Standard error = 0.008
## Test statistic: Z = -5.243
## p-value = 0
p-value = 0 suggests strong evidence to reject the null hypothesis that there isnt any change in atheism in the united states.
- If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance?
A.type one error is rejecting the null hypotheisis when it is true therefore with a significance level of 0.05 or the probability of rejecting the null with 39 countried there isa chance that at least 2 countries will detect a shift or a change in atheism index.
- Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines?
There must be a sample size of 9604 to ensure that there it is wituidelines h in the g
M.e<-0.01
p<-0.5
#ci95%=1.96
se<-0.01/1.96
B<-se^2
#00510
#p(1-P)
A<-.25
n<-(A/B)
n
## [1] 9604
