Lab 5

===Question 1 through 4
Read the pdf document
===The Data

load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData"))

===Question 5
The cases, or rows, in table 6 are countries.
===Question 6
The Atheism data frame's rows are individual respondents.
Note: I am going to start following the convention of capitalizing the names of Data Frames. Even though the Data Camp website does not do this it is considered good practice by most of the R coders that I am familiar with.
===Atheists in the US
Here we start working with subsets Code to create the 'us12' subset:

us12 = subset(atheism, atheism$nationality == "United States" & atheism$year == 
    "2012")

Calculate the proportion of atheist responses:

# proportion = nrow(subset(us12, resonse == 'atheist'))/nrow(us12) #
# spelling error
proportion = nrow(subset(us12, response == "atheist"))/nrow(us12)
# Print the proportion:
proportion

## [1] 0.0499

===Question 7
Yes
===Inference Conditions
We want to estimate the parameters or characteristics of the population from our sample statistics.
===Question 8

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
    success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

plot of chunk unnamed-chunk-4

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

The margin of error is 0.0135 on either side (above or below) the sample estimate (p_hat) of 0.0499, giving us a 95% confidence interval of 0.0364 to 0.0634.
===What about India?

The subset for India for 2012:

india = subset(atheism, nationality == "India" & year == "2012")
# The analysis using the 'inference' function:
inference(india$response, est = "proportion", type = "ci", method = "theoretical", 
    success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

plot of chunk unnamed-chunk-5

## p_hat = 0.0302 ;  n = 1092 
## Check conditions: number of successes = 33 ; number of failures = 1059 
## Standard error = 0.0052 
## 95 % Confidence interval = ( 0.0201 , 0.0404 )

===And China?
First, subset the data for China for 2012:

china = subset(atheism, nationality == "China" & year == "2012")
# The analysis using the 'inference' function:
inference(china$response, est = "proportion", type = "ci", method = "theoretical", 
    success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

plot of chunk unnamed-chunk-6

## p_hat = 0.47 ;  n = 500 
## Check conditions: number of successes = 235 ; number of failures = 265 
## Standard error = 0.0223 
## 95 % Confidence interval = ( 0.4263 , 0.5137 )

===Margin of Error
The margin of error changes with the sample size, the larger the sample the smaller the margin or error, or, which is to say the same thing, the more confident we can be that are sample statistic captures the true population parameter. But the margin of error also changes with the proportion. We can be more confident that we have the population parameter when we are looking for something that is relatively common and, consequently, less confident when we are looking for something rare, i.e., something that happens in a smaller proportion of the group.

This is intuitive ('intuitive' is the word that statistitions use to describe ideas that make sense to normal people). If you are looking for something that happens very rarely like some exotic disease you have to examine more people than if you are looking for something that happens all the time. If you wanted to test a drug to prevent, say, rare brain tumors that happen only in 1 out of 100,000 people then you have to look at least 100,000 people to see if your drug works. But even worse, even if you find that the 100,000 peole that took your drug were free of the rare cancer you could hardly be sure that the sample wasn't cancer free simply by random chance. On the other hand, if you want to test a drug for something really common, like a drug that prevents headaches, finding 100,000 people that after taking your drug report not having any headaches would make you pretty confident that your treatment has an effect.
In order to calculate the statistical significance of any statistics (i.e., the likelihood that the effect implied by your sample statistic is merely due to random variation in the population) we need some measure of variation in the population, i.e., the standard error, or SE. For proportions the standard error is a straightforward function of the proportion itself.

SE = sqrt(p(1-p)/n)

and the ME (Margin of Error) is

ME = 1.96*SE

The tutorial supplies code to plot the relationship between the population proportion and the margin of error.
The first step is to make a vector p that is a sequence from 0 to 1 with each number separated by 0.01:

n = 1000
p = seq(0, 1, 0.01)

We then create a vector of the margin of error (me) associated with each of these values of p using the familiar approximate formula (ME = 2 X SE):

me = 2 * sqrt(p * (1 - p)/n)

Finally, plot the two vectors against each other to reveal their relationship:

plot(me ~ p)

plot of chunk unnamed-chunk-9

Pretty cool, I think. When you are looking for something that happens almost all the time or almost none of the time your margin of error is pretty small.

Think of it this way. If you are looking for something that is never supposed to happen, i.e., your null hypothesis is that the proportion is 0, than finding even one example of the thing happening is enough for you to reject the null hypothesis. So, you can reject the null with even one case so your margin of error is literally 0.

The same thing is true in reverse for something that is always supposed to happen, i.e., p = 1. You are least certain or, which is to say the same thing, have the widest margin of error, when you are looking for something that happens half the time. For example we saw earlier that random data is pretty 'streaky', that you can get a long string of heads or tails even in a fair coin that has a fifty-fifty chance of giving heads or tails.

===Question 9
Relationship between p–the probability that you could have gotten your result due to random error in the sample–and me–the margin or error.

===Atheism in Spain
This time we want to see if the proportion of atheists in Spain has changed between 2005 and 2012.

We need the subset of respondents that are from Spain. We then calculate the proportion of atheists in the entire sample of Spanish respondents, i.e., both those from 2005 and from 2012. Then we use the inference function grouping by year.

Spain = subset(atheism, nationality == "Spain")
head(Spain)

##       nationality    response year
## 45230       Spain non-atheist 2012
## 45231       Spain non-atheist 2012
## 45232       Spain non-atheist 2012
## 45233       Spain non-atheist 2012
## 45234       Spain non-atheist 2012
## 45235       Spain non-atheist 2012

summary(Spain$response)

##     atheist non-atheist 
##         218        2073

Proportion = nrow(subset(Spain, response == "atheist"))/nrow(Spain)
Proportion

## [1] 0.09515

inference(Spain$response, Spain$year, est = "proportion", type = "ci", method = "theoretical", 
    success = "atheist")

## Warning: Explanatory variable was numerical, it has been converted to
## categorical. In order to avoid this warning, first convert your
## explanatory variable to a categorical variable using the as.factor()
## function.

## Response variable: categorical, Explanatory variable: categorical
## Two categorical variables
## Difference between two proportions -- success: atheist
## Summary statistics:
##              x
## y             2005 2012  Sum
##   atheist      115  103  218
##   non-atheist 1031 1042 2073
##   Sum         1146 1145 2291

plot of chunk unnamed-chunk-10

## Observed difference between proportions (2005-2012) = 0.0104
## 
## Check conditions:
##    2005 : number of successes = 115 ; number of failures = 1031 
##    2012 : number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0123 
## 95 % Confidence interval = ( -0.0136 , 0.0344 )

There code worked for me and gave me the same results but I got a warning message about my explanatory variable being numerical rather than categorical. The program 'coerced' it to a factor but the message said to convert it yourself in future to avoid the warning message, so that is what I will do:

inference(Spain$response, as.factor(Spain$year), est = "proportion", type = "ci", 
    method = "theoretical", success = "atheist")

## Response variable: categorical, Explanatory variable: categorical
## Two categorical variables
## Difference between two proportions -- success: atheist
## Summary statistics:
##              x
## y             2005 2012  Sum
##   atheist      115  103  218
##   non-atheist 1031 1042 2073
##   Sum         1146 1145 2291

plot of chunk unnamed-chunk-11

## Observed difference between proportions (2005-2012) = 0.0104
## 
## Check conditions:
##    2005 : number of successes = 115 ; number of failures = 1031 
##    2012 : number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0123 
## 95 % Confidence interval = ( -0.0136 , 0.0344 )

Now it works without a warning.
===Question 10
Since the confidence interval includes zero we cannot be confident that the real population parameter is not zero.
===Rising in the US?
We preform the same analysis on the US data.

levels(atheism$nationality)

##  [1] "Afghanistan"                                 
##  [2] "Argentina"                                   
##  [3] "Armenia"                                     
##  [4] "Australia"                                   
##  [5] "Austria"                                     
##  [6] "Azerbaijan"                                  
##  [7] "Belgium"                                     
##  [8] "Bosnia and Herzegovina"                      
##  [9] "Brazil"                                      
## [10] "Bulgaria"                                    
## [11] "Cameroon"                                    
## [12] "Canada"                                      
## [13] "China"                                       
## [14] "Colombia"                                    
## [15] "Czech Republic"                              
## [16] "Ecuador"                                     
## [17] "Fiji"                                        
## [18] "Finland"                                     
## [19] "France"                                      
## [20] "Georgia"                                     
## [21] "Germany"                                     
## [22] "Ghana"                                       
## [23] "Hong Kong"                                   
## [24] "Iceland"                                     
## [25] "India"                                       
## [26] "Iraq"                                        
## [27] "Ireland"                                     
## [28] "Italy"                                       
## [29] "Japan"                                       
## [30] "Kenya"                                       
## [31] "Korea, Rep (South)"                          
## [32] "Lebanon"                                     
## [33] "Lithuania"                                   
## [34] "Macedonia"                                   
## [35] "Malaysia"                                    
## [36] "Moldova"                                     
## [37] "Netherlands"                                 
## [38] "Nigeria"                                     
## [39] "Pakistan"                                    
## [40] "Palestinian territories (West Bank and Gaza)"
## [41] "Peru"                                        
## [42] "Poland"                                      
## [43] "Romania"                                     
## [44] "Russian Federation"                          
## [45] "Saudi Arabia"                                
## [46] "Serbia"                                      
## [47] "South Africa"                                
## [48] "South Sudan"                                 
## [49] "Spain"                                       
## [50] "Sweden"                                      
## [51] "Switzerland"                                 
## [52] "Tunisia"                                     
## [53] "Turkey"                                      
## [54] "Ukraine"                                     
## [55] "United States"                               
## [56] "Uzbekistan"                                  
## [57] "Vietnam"

US = subset(atheism, nationality == "United States")
head(US)

##         nationality    response year
## 49926 United States non-atheist 2012
## 49927 United States non-atheist 2012
## 49928 United States non-atheist 2012
## 49929 United States non-atheist 2012
## 49930 United States non-atheist 2012
## 49931 United States non-atheist 2012

summary(US$response)

##     atheist non-atheist 
##          60        1944

Proportion = nrow(subset(US, response == "atheist"))/nrow(US)
Proportion

## [1] 0.02994

inference(US$response, as.factor(US$year), est = "proportion", type = "ci", 
    method = "theoretical", success = "atheist")

## Response variable: categorical, Explanatory variable: categorical
## Two categorical variables
## Difference between two proportions -- success: atheist
## Summary statistics:
##              x
## y             2005 2012  Sum
##   atheist       10   50   60
##   non-atheist  992  952 1944
##   Sum         1002 1002 2004

plot of chunk unnamed-chunk-12

## Observed difference between proportions (2005-2012) = -0.0399
## 
## Check conditions:
##    2005 : number of successes = 10 ; number of failures = 992 
##    2012 : number of successes = 50 ; number of failures = 952 
## Standard error = 0.0076 
## 95 % Confidence interval = ( -0.0547 , -0.0251 )

So we can see that the proportion of atheists in the US is higher in 2012 because subtracting the 2012 proportion from the 2005 proportion gives us a negative number and since the upper and lower bounds of the confidence interval are negative we can conclude with 95% probability that there has, in fact, been an increase in atheists in the US population.
===Question 12
We are asked in what proportion of the 39 countries in table 4 would we expect to see an increase in atheism significant at the 95% level even if the true population increase in atheism was 0? We would expect to see it in 5% of the cases so 5% of 39 is:

0.05 * 39

## [1] 1.95

Since you can't have a fraction of a country experiencing a statistically significant increase in atheism I think it is fair to round up to 2, but the answer choice they offer is actually 1.95.
===Question 13
How big of a sample size do you need? First, you assume p = 0.05 to be conservative. Then you do some math with R using the formula you have for computing the ME in a proportion

ME = 1.96 * SE

ME = 1.96 * sqrt(p(1-p)/n) and now we just use algebra to get the equation that solves for n and use R to do the calculation. Simple, yes?

((0.01/1.96)^2)/0.25

## [1] 0.0001041

(0.01/1.96)/0.5

## [1] 0.0102

Let me try a simpler way. Plug in their numbers and see which one works.

1.96 * sqrt(0.25/9604)

## [1] 0.01

Not very elegant, I admit, but it saves a lot of algebra that I have apparently forgotten.