Lab 8 Inference for categorical data

Exercise 1

In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

These percentages appear to be sample statistics as they state that it came from a global poll.

Exercise 2

The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")

I would assume that the sampling method is of a considerable size in order to give a more accurate data globally. We must assume that the sampling method was taken by sampling randomly though individuals in countries throughout the world. In additon the samples should be independent of each other.

Exercise 3

What does each row of Table 6 correspond to? What does each row of atheism correspond to?

Each row provides poll information for each country listed. This information includes the sample size, percent that answered they are a religious person, not a religious person, a convinced atheist or had no response. The row of atheism corresponds to the percentage of people that answered the poll by stating they are a convinced atheist for each country polled.

us12 <- subset(atheism, nationality == "United States" & year == "2012")

Exercise 4

Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

table(us12$response)

## 
##     atheist non-atheist 
##          50         952

50/1002

## [1] 0.0499002

The proportion of atheist responses to the 2012 survey from the US was 0.0499. This does agree with the percentage in Table 6 which was 5%.

Exercise 5

Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

The conditons are that the groups must be independent of each other, which they are (atheist or not) and the sample size must be large enough. I believe all the conditons are met as there are 1,0002 observations for us12.

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

Exercise 6

Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

SE<- c(0.0069)
ME<-qt(.975, 1001)*SE

The margin of error for the proportion of atheists in the US in 2012 is 0.0135

aus12 <- subset(atheism, nationality == "Australia" & year == "2012")
can12 <- subset(atheism, nationality == "Canada" & year == "2012")

Exercise 7

Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

inference(aus12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1001 ;  n = 1039 
## Check conditions: number of successes = 104 ; number of failures = 935 
## Standard error = 0.0093 
## 95 % Confidence interval = ( 0.0818 , 0.1183 )

inference(can12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0898 ;  n = 1002 
## Check conditions: number of successes = 90 ; number of failures = 912 
## Standard error = 0.009 
## 95 % Confidence interval = ( 0.0721 , 0.1075 )

SE_aus<-c(0.0093)
SE_can<-(0.009)

ME_aus<-qt(.975,1038)*SE_aus
ME_can<-qt(.975,1001)*SE_can

The 95% confidence interval for Australia is (0.0818, 0.1183) and for Canada it is (0.0721, 0.1075). The margin of error for each country is 0.0182 and 0.01766 respectively.

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

Exercise 8

Describe the relationship between p and me.

As p increases the margin of error increases until p reaches 0.5 after which the margin of error decreases as p increases.

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

Exercise 9

Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape. Hint: Remember that R has functions such as mean to calculate summary statistics.

summary(p_hats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981

sd(p_hats)

## [1] 0.009287382

The shape is bell shaped and the spread is relatively normal. The mean is .09969 and the median is sligltly lower at 0.09904. The data ranges from .07019 to .12981

p <- 0.1
n <- 400
p_hats2 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats2[i] <- sum(samp == "atheist")/n
}

Exercise 10

Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p^? How does p affect the sampling distribution?

p <- 0.2
n <- 1040
p_hats3 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats3[i] <- sum(samp == "atheist")/n
}

p <- 0.2
n <- 400
p_hats4 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats4[i] <- sum(samp == "atheist")/n
}

par(mfrow = c(2, 2))

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
hist(p_hats2, main = "p = 0.1, n = 400", xlim = c(0, 0.18))
hist(p_hats3, main = "p = 0.2, n = 1040", xlim = c(0, 0.3))
hist(p_hats4, main = "p = 0.2, n = 400", xlim = c(0, 0.3))

All of the histograms have a similar distributition to each other. It appears normal shaped and symmetrical. The distributions with a larger n (n=1040) have a widerr spread than the others.

Exercise 11

If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?

spa05 <- subset(atheism, nationality == "Spain" & year == "2005")
spa12 <- subset(atheism, nationality == "Spain" & year == "2012")

inference(spa05$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1003 ;  n = 1146 
## Check conditions: number of successes = 115 ; number of failures = 1031 
## Standard error = 0.0089 
## 95 % Confidence interval = ( 0.083 , 0.1177 )

summary(spa05)

##       nationality          response         year     
##  Spain      :1146   atheist    : 115   Min.   :2005  
##  Afghanistan:   0   non-atheist:1031   1st Qu.:2005  
##  Argentina  :   0                      Median :2005  
##  Armenia    :   0                      Mean   :2005  
##  Australia  :   0                      3rd Qu.:2005  
##  Austria    :   0                      Max.   :2005  
##  (Other)    :   0

summary(spa12)

##       nationality          response         year     
##  Spain      :1145   atheist    : 103   Min.   :2012  
##  Afghanistan:   0   non-atheist:1042   1st Qu.:2012  
##  Argentina  :   0                      Median :2012  
##  Armenia    :   0                      Mean   :2012  
##  Australia  :   0                      3rd Qu.:2012  
##  Austria    :   0                      Max.   :2012  
##  (Other)    :   0

It is possible to use the inference function and report the margins of error for Australia. It is not possible for Ecuador because the sample is not large enough thereby violating the conditions.