LAb6

##Ex1: In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?
##These percentages appear to be sample statistics

##Ex2: The title of the report is "Global Index of Religiosity and Atheism". To generalize the report's findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?
##We should assume that the samples have been taken randomly and globally.
##The survey is based on interviews with 50,000 men and women from 57 countries across 5 continents.
##Therefore this seems like a reasonable assumption.

load("more/atheism.RData")

##Ex3: What does each row of Table 6 correspond to? What does each row of atheism correspond to?
##Each row in table 6 corresponds to a country and the respondent's choice of whether the person is atheist or non-atheist between the years 2005 to 2012.
summary(atheism)

##              nationality           response          year     
##  Pakistan          : 5409   atheist    : 5498   Min.   :2005  
##  France            : 3359   non-atheist:82534   1st Qu.:2005  
##  Korea, Rep (South): 3047                       Median :2012  
##  Ghana             : 2995                       Mean   :2009  
##  Macedonia         : 2418                       3rd Qu.:2012  
##  Peru              : 2414                       Max.   :2012  
##  (Other)           :68390

##Ex4: Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?
us12 <- subset(atheism, nationality == "United States" & year == "2012")
usatheist<-subset(atheism,response=="atheist" & nationality == "United States" & year == "2012")
nrow(usatheist)/nrow(us12)

## [1] 0.0499002

##Yes it is very close to the report's value of 0.05 (95% Confidence Level)

##Ex5: Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?
##Independence: Assuming that simple random sampling was used, this condition is satisfied. Also the no of respondents is definitely less than 10% of the population of the world.
##Near normal distribution: There are 1002 observations. The no of atheists at 5% is 50 which is more than 10, so satisfied.

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Warning: package 'openintro' was built under R version 3.5.2

## Warning: package 'BHH2' was built under R version 3.5.3

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

##Ex6: Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?
##Margin of Error
(.0634-.0364)/2

## [1] 0.0135

##Ex7: Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

arg12 <- subset(atheism, nationality == "Argentina" & year == "2012")
inference(arg12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0706 ;  n = 991 
## Check conditions: number of successes = 70 ; number of failures = 921 
## Standard error = 0.0081 
## 95 % Confidence interval = ( 0.0547 , 0.0866 )

##Margin of error
(.0866-.0547)/2

## [1] 0.01595

fra12 <- subset(atheism, nationality == "France" & year == "2012")
inference(fra12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.2873 ;  n = 1688 
## Check conditions: number of successes = 485 ; number of failures = 1203 
## Standard error = 0.011 
## 95 % Confidence interval = ( 0.2657 , 0.3089 )

##Margin of error
(.3089-.2657)/2

## [1] 0.0216

##How proportions affect margin of error
n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

##Ex8:Describe the relationship between p and ME.
##Margin of error increases till population proportion is .50 and then it decreases.

##Success-failure condition
p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

##Ex9: Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape.
summary(p_hats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981

##The median and mean are very close.
##The distribution is normal with a center around 0.095

##Ex10: Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the  par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p^? How does p affect the sampling distribution?
par(mfrow = c(2, 2))

p <- 0.1
n <- 400
p_hats <- rep(0, 5000)
for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.1, n = 400", xlim = c(0, 0.18))

p <- 0.02
n <- 1040
p_hats <- rep(0, 5000)
for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))

p <- 0.02
n <- 400
p_hats <- rep(0, 5000)
for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

##As n increases, the spread reduces. P moves the center of the distribution. All distributions are normal.

par(mfrow = c(1, 1))

##Ex11: If you refer to Table 6, you'll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let's suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?
##The samples are large and the distributions are normal.Therefore it makes sense to proceed with inference and report margin of errors.

#ON YOUR OWN

##1a)  Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012?

spa12 <- subset(atheism, nationality == "Spain" & year == "2012")
spa05 <- subset(atheism, nationality == "Spain" & year == "2005")
inference(spa12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )

inference(spa05$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1003 ;  n = 1146 
## Check conditions: number of successes = 115 ; number of failures = 1031 
## Standard error = 0.0089 
## 95 % Confidence interval = ( 0.083 , 0.1177 )

##Ho: There is no difference in the atheism index in Spain between 2005 and 2012
##Ha: There is a difference in the atheism index in Spain between 2005 and 2012

##The confidence interval for 2012 is (.0734,.1065) and for 2005 is (.083,.1177)
##The intervals overlap so we may conclude there is no convincing evidence that the index changed between these 2 years.

##1b) Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?
us05 <- subset(atheism, nationality == "United States" & year == "2005")
inference(us05$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.01 ;  n = 1002 
## Check conditions: number of successes = 10 ; number of failures = 992 
## Standard error = 0.0031 
## 95 % Confidence interval = ( 0.0038 , 0.0161 )

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

##The confidence interval does not overlap so there is convincing evidence that the index has changed between these 2 years.

##2)If in fact there has been no change in the atheism index in the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance?
##We would expect to detect a change at 5% of the countries simply by chance.

##3) Suppose you're hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines?
##Since ME is highest at p=0.5
1.96^2*.5*.5/.01^2

## [1] 9604

##At least 9604 people would have to be sampled.

LAb6

Farhana Zahir