Lab - Inference for categorical data

Lab report

Load data:

download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")

Exercises:

Exercise 1:

The percentages generated from the results of the survey represent the sample population. Which means these would be representative of the sample population. I think this is made csince the paragraph mentions that we would be using our sample data to make inferences about the true population proportions.

Exercise 2:

We would assume that the sample consisted of 51,927 people who were randomly selected from all over the world, which is true in this case. We must also assume that these 51,927 observations truthfully represent the religious identities of all countries, including religious minorities, or areas with multiple religions.

Exercise 3:

Each row in table 6 corresponds to the countries which we surveyed, and their corresponding religion break downs.Each row of “atheism” or variable corresponds to one individual surveyed, or one observation, who considers themselves an atheist.

Exercise 4:

set.seed(031699)
us12 <- subset(atheism, nationality == "United States" & year == "2012")
us12$nationality<-as.factor(as.character(us12$nationality))
us12prop<-prop.table(table(us12$nationality, us12$response))
table(us12prop)

## us12prop
## 0.0499001996007984  0.950099800399202 
##                  1                  1

Our calculated proportion of atheists in the United States is 0.0499, which differs from Table 6’s value of 5%, which would be 0.05. This could be a simple rounding error, since these two values are fairly close.

Exercise 5:

The observations must be independent, which in this survey we assume is true. And assuming participants were selected randomly and considering the sample is less than 10% of the population, this condition is met. Observations must come from a relatively normal distribution, with no heavy skew. Since the percentage of atheists is at 0.05 and number of observations is at 1,002, the observed number of atheists is 50, which is greater than 10. An assumption of nearly normal distribution would be well-founded.

Exercise 6:

set.seed(121314)
inference(us12$response, est = "proportion", type = "ci", method = "theoretical", success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

( 0.0634 - 0.0364 )/2

## [1] 0.0135

Since the 95% confidence interval was calculated at ( 0.0364 , 0.0634 ), our MOE would be the interval divided by 2, which would be 0.0135.

Exercise 7:

set.seed(050373)
chi12<-subset(atheism, nationality == "China" & year == "2012")
chi12$nationality <- as.factor(as.character(chi12$nationality))
( chi12prop <- prop.table(table(chi12$nationality, chi12$response)) )

##        
##         atheist non-atheist
##   China    0.47        0.53

inference(chi12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.47 ;  n = 500 
## Check conditions: number of successes = 235 ; number of failures = 265 
## Standard error = 0.0223 
## 95 % Confidence interval = ( 0.4263 , 0.5137 )

(0.5137-0.4263)/2

## [1] 0.0437

spa12 <- subset(atheism, nationality == "Spain" & year == "2012")
spa12$nationality <- as.factor(as.character(spa12$nationality))
( spa12prop <- prop.table(table(spa12$nationality, spa12$response)) )

##        
##            atheist non-atheist
##   Spain 0.08995633  0.91004367

inference(spa12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )

(0.1065-0.0734)/2

## [1] 0.01655

I picked China and Spain, the MOE for China was calculated to be around 0.0437. And the MOE for Spain was around 0.01655. We can assume the same conditions as what we applied to the US sample, and we can say that China and Spain are independent and follow a nearly normal distribution. We can conclude that both countries meet the conditions for inference.

Exercise 8:

set.seed(032322)
n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion", main = "For a sample size of n = 1000")

The relationship is quadratic, as the population proportion increases from 0 to 0.5, the MOE also increases with a max at p = 0.5. As population proportion continues to increase past 0.5, the reverse occurs and the MOE decreases down to 0 at p = 1. Since MOE is based on p, p*(1-p), then two halves of the relationship are mirror images of each other. Specifically when p goes from 0 to 0.5 and when p goes from 0.5 to 1.

Exercise 9:

set.seed(020976)
p <- 0.02
n <- 400
p_hats <- rep(0, 5000)
for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}
hist(p_hats, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

mean(p_hats)

## [1] 0.01999

sd(p_hats)

## [1] 0.006971522

The sampling distribution is unimodal and fairly symmetrical. It is centered at 0.01999 with a standard deviation of 0.006972. This distribution would be considered normal.

Exercise 10:

set.seed(011593)
p_too <- c(0.1, 0.1, 0.02, 0.02)
n_too <- c(1040, 400, 1040, 400)
p_hats_too <- data.frame(c(rep(0, 5000)), c(rep(0, 5000)), c(rep(0, 5000)), c(rep(0, 5000)))

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n_too[1], replace = TRUE, prob = c(p_too[1], 1-p_too[1]))
  p_hats_too[i, 1] <- sum(samp == "atheist")/n_too[1]
  samp <- sample(c("atheist", "non_atheist"), n_too[2], replace = TRUE, prob = c(p_too[2], 1-p_too[2]))
  p_hats_too[i, 2] <- sum(samp == "atheist")/n_too[2]
  samp <- sample(c("atheist", "non_atheist"), n_too[3], replace = TRUE, prob = c(p_too[3], 1-p_too[3]))
  p_hats_too[i, 3] <- sum(samp == "atheist")/n_too[3]  
  samp <- sample(c("atheist", "non_atheist"), n_too[4], replace = TRUE, prob = c(p_too[4], 1-p_too[4]))
  p_hats_too[i, 4] <- sum(samp == "atheist")/n_too[4]
}

par(mfrow = c(2, 2))
hist(p_hats_too[, 1], xlab = "", main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
hist(p_hats_too[, 2], xlab = "", main = "p = 0.1, n = 400", xlim = c(0, 0.18))
hist(p_hats_too[, 3], xlab = "", main = "p = 0.02, n = 1040", xlim = c(0, 0.18))
hist(p_hats_too[, 4], xlab = "", main = "p = 0.02, n = 400", xlim = c(0, 0.18))

par(mfrow=c(1,1))

All distributions are unimodal and symmetrical. The spread of p=0.1 and n=1040, and p=0.1 and n=400 are the most centered, while both p=0.02 and n=1040 and p=0.02 and n=400 are more right skewed. Based on the histograms above, n affects the spread and distribution of p and affects where the center of the distribution falls. Higher n values create more narrow distribution, and these distributions are usually centered around their p value.

Exercise 11:

par(mfrow=c(1,1))

For the sample proportion of atheists in Ecuador, which was around 0.02 with 400 observations, the number of atheists in the sample is 8, which is not enough to assume normal distribution and may not lead to a sensible inference. For Australia, with a proportion of 0.1 and sample size of 1,040, the number of atheists in the sample is 104, which is more sensible to assume normal distribution and proceed with inference.

On your own:

1a:

set.seed(121112)
spain05 <- subset(atheism, nationality == "Spain" & year == "2005")
spain05$nationality <- as.factor(as.character(spain05$nationality))
table(spain05$nationality, spain05$response)

##        
##         atheist non-atheist
##   Spain     115        1031

spain12 <- subset(atheism, nationality == "Spain" & year == "2012")
spain12$nationality <- as.factor(as.character(spain12$nationality))
table(spain12$nationality, spain12$response)

##        
##         atheist non-atheist
##   Spain     103        1042

inference(spain05$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.1003 ;  n = 1146 
## Check conditions: number of successes = 115 ; number of failures = 1031 
## Standard error = 0.0089 
## 95 % Confidence interval = ( 0.083 , 0.1177 )

inference(spain12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.09 ;  n = 1145 
## Check conditions: number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0085 
## 95 % Confidence interval = ( 0.0734 , 0.1065 )

As previously mentioned, we can assume observations to be independent form one another. The number of atheists in Spain 2005 is 115 and in 2012 it is 103. Both are greater than 10, so we can assume normal distribution. Yes, there is convincing data that Spain has seen an change in the number of atheists present from 2005 to 2012. There is significant overlap between the confidence interval for the ample from 2005 to 2012. Since our p12 = 0.09, the confidence interval for 2005 is (0.081, 0.1177), so we fail to reject the null hypothesis, and the change in atheism can be attributed to chance.

1b:

set.seed(111112)
us05 <- subset(atheism, nationality == "United States" & year == "2005")
us05$nationality <- as.factor(as.character(us05$nationality))
table(us05$nationality, us05$response)

##                
##                 atheist non-atheist
##   United States      10         992

table(us12$nationality, us12$response)

##                
##                 atheist non-atheist
##   United States      50         952

inference(us05$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.01 ;  n = 1002 
## Check conditions: number of successes = 10 ; number of failures = 992 
## Standard error = 0.0031 
## 95 % Confidence interval = ( 0.0038 , 0.0161 )

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

The number of atheists in 2005 is 10 and in 2012 it is 50. The number of atheists in 2005 is borderline enough to assume normal distribution. There is no overlap between confidence interval for the 2005 sample and the 2012 sample. Additionally p12=0.05 and it is outside the confidence interval for 2005 which is (0.0038, 0.0161), so we would reject the null, hypothesis. This change in atheism is not likely due to chance.

2:

If there has been no change in the atheism index, but we detect a change due to chance and reject a null hypothesis even when it is true, that means we made a type 1 error. At a significant level of 0.05 and considering that we have all 39 countries, we would expect to make this type 1 error with 39*0.05 = 1.95, or rounding up, with about 2 countries.

3:

set.seed(111212)
0.01/1.96

## [1] 0.005102041

((0.5*0.5)/0.0051^2)

## [1] 9611.688

If the MOE is 0.01, then at 95% confidence SE would be 0.01/1.96 = 0.0051. SInce we don’t know the p value, we assume that p = 0.5. So to solve for n we would use p=0.5 and the SE determined from the previous step, which would be 0.5^2/0.00512 = 9612. The sample must include 9612 people, for the worst case scenario. It is possible that there is a good estimate of what proportion of residents attend services, so p can be lowered in the above calculation.

Teamwork report

Team member	Attendance	Author	Contribution %
Name of member 1	Yes / No	Yes / No	25%
Name of member 2	Yes / No	Yes / No	25%
Name of member 3	Yes / No	Yes / No	25%
Name of member 4	Yes / No	Yes / No	25%
Total			100%