Inference for Categorical Data

Exercise 1 -

In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics or population parameters?

The percentages appear to be sample statistics.

Exercise 2 -

The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

The observations are independent of each other and probably utilized random sampling. With about 50,000 responses recorded, we can reasonably assume that the sample size is less than 10% of the population.

download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")

head(atheism)

##   nationality    response year
## 1 Afghanistan non-atheist 2012
## 2 Afghanistan non-atheist 2012
## 3 Afghanistan non-atheist 2012
## 4 Afghanistan non-atheist 2012
## 5 Afghanistan non-atheist 2012
## 6 Afghanistan non-atheist 2012

Exercise 3 -

What does each row of Table 6 correspond to? What does each row of atheism correspond to?

Each row in Table 6 corresponds to a country and its survey results. Each row in atheism corresponds to one person’s response and other descriptors.

Exercise 4 -

Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

us12 <- subset(atheism, nationality == "United States" & year == "2012")
us12$nationality <- as.factor(as.character(us12$nationality))
prop.table(table(us12$nationality,us12$response))

##                
##                   atheist non-atheist
##   United States 0.0499002   0.9500998

The proportion of atheist responses is about 5% which agrees with the value listed in Table 6.

Exercise 5 -

Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

Observations must be independent of each other. Since the sample comes from less than 10% of the population, we can assume that the observations are independent.

Of the 1,000 US sample size and the 5% of atheists, the observed number in the US would be 50 which is greater than 10. This allows us to assume a normal distribution.

inference(us12$response, est="proportion", type="ci", method="theoretical", success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

Exercise 6 -

Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

margin of error = 1.96 * 0.0069 = 0.0135

Exercise 7 -

Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new datasets for each of the two countries first, and then use these datasets in the inference function to construct the confidence intervals.

China

The conditions for inference have been met because the observations are independent of each other and the observed number of atheists is greater than 10.

ch12 <- subset(atheism, nationality == "China" & year == "2012")
ch12$nationality <- as.factor(as.character(ch12$nationality))
prop.table(table(ch12$nationality,ch12$response))

##        
##         atheist non-atheist
##   China    0.47        0.53

inference(ch12$response, est="proportion", type="ci", method="theoretical", success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.47 ;  n = 500 
## Check conditions: number of successes = 235 ; number of failures = 265 
## Standard error = 0.0223 
## 95 % Confidence interval = ( 0.4263 , 0.5137 )

margin of error = 1.96 * 0.0223 = 0.0437

Nigeria

The conditions for inference have been met because the observations are independent of each other and the observed number of atheists is greater than 10.

nig12 <- subset(atheism, nationality == "Nigeria" & year == "2012")
nig12$nationality <- as.factor(as.character(nig12$nationality))
prop.table(table(nig12$nationality,nig12$response))

##          
##               atheist non-atheist
##   Nigeria 0.009532888 0.990467112

inference(nig12$response, est="proportion", type="ci", method="theoretical", success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0095 ;  n = 1049 
## Check conditions: number of successes = 10 ; number of failures = 1039 
## Standard error = 0.003 
## 95 % Confidence interval = ( 0.0037 , 0.0154 )

margin of error = 1.96 * 0.003 = 0.00589

Exercise 8 -

Describe the relationship between p and me.

n <- 1000
p <- seq(0,1,.01)
me <- 2 * sqrt(p * (1-p)/n)
plot(me~p,ylab="Margin of Error",xlab="Population Proportion")

The relationship between p and me is quadratic. Between p = 0 and 0.5, the margin of error increases, put after p=0.5, the margin of error decreases.

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

summary(p_hats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981

Exercise 9 -

Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape.

The distribution looks approximately normal. The data ranges from 0.06923 to 0.13558, with a center at 0.10.

Exercise 10 -

Repeat the above simulation three more times but with modified sample sizes and proportions: for n = 400 and p = 0.1, n = 1040 and p = 0.02, and n = 400 and p = 0.02. Plot all four histograms together by running the par(mfrow=c(2,2)) command before creating the histograms. You may need to expand the plot window to accomodate the larger two by two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p(hat)? How does p affect the samplind distribution?

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

p <- 0.1
n <- 400
p_hats1 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats1[i] <- sum(samp == "atheist")/n
}

p <- 0.02
n <- 1040
p_hats2 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats2[i] <- sum(samp == "atheist")/n
}

p <- 0.02
n <- 400
p_hats3 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats3[i] <- sum(samp == "atheist")/n
}

par(mfrow=c(2,2))
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
hist(p_hats1, main = "p = 0.1, n = 400", xlim = c(0, 0.18))
hist(p_hats2, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))
hist(p_hats3, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

The new distributions all appear to be normal as well, however the p_hats3 distribution does show some skewing. From these plots, n appears to affect the spread of the distribution by making it more narrow, whereas p affects the center of the distribution.

Exercise 11 -

If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the report does?

For Australia, it would be reasonable to proceed with inference because their observed number of atheists would be 104, which is greater than 10. For Ecuador however, it would not be reasonable to proceed with inference becauce their observed number of atheists would be 8 which is less than 10 meaning that the distribution is probably not normal.

Inference for Categorical Data

Adeola Animasahun

11/3/2019

Exercise 1 -

In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics or population parameters?

The percentages appear to be sample statistics.

Exercise 2 -

The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

The observations are independent of each other and probably utilized random sampling. With about 50,000 responses recorded, we can reasonably assume that the sample size is less than 10% of the population.

Exercise 3 -

What does each row of Table 6 correspond to? What does each row of atheism correspond to?

Each row in Table 6 corresponds to a country and its survey results. Each row in atheism corresponds to one person’s response and other descriptors.

Exercise 4 -

Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

The proportion of atheist responses is about 5% which agrees with the value listed in Table 6.

Exercise 5 -

Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

Observations must be independent of each other. Since the sample comes from less than 10% of the population, we can assume that the observations are independent.

Of the 1,000 US sample size and the 5% of atheists, the observed number in the US would be 50 which is greater than 10. This allows us to assume a normal distribution.

Exercise 6 -

Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

margin of error = 1.96 * 0.0069 = 0.0135

Exercise 7 -

China

The conditions for inference have been met because the observations are independent of each other and the observed number of atheists is greater than 10.

margin of error = 1.96 * 0.0223 = 0.0437

Nigeria

The conditions for inference have been met because the observations are independent of each other and the observed number of atheists is greater than 10.

margin of error = 1.96 * 0.003 = 0.00589

Exercise 8 -

Describe the relationship between p and me.

The relationship between p and me is quadratic. Between p = 0 and 0.5, the margin of error increases, put after p=0.5, the margin of error decreases.

Exercise 9 -

Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape.

The distribution looks approximately normal. The data ranges from 0.06923 to 0.13558, with a center at 0.10.

Exercise 10 -

The new distributions all appear to be normal as well, however the p_hats3 distribution does show some skewing. From these plots, n appears to affect the spread of the distribution by making it more narrow, whereas p affects the center of the distribution.

Exercise 11 -