EXERCISE 1

In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

Sample statistics. The data came from polls. It does not include the whole population in the countries.

EXERCISE 2

The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

Data must be random samples. The observations within each sample must be independent.

Whether it is a reasonable assumption depends on whether we have confidence in the pollster to do the sampling correctly.

I suppose it is reasonable to assume that the WIN-Gallup network of leading pollsters took truly random samples.

The data

Turn your attention to Table 6 (pages 15 and 16), which reports the sample size and response percentages for all 57 countries. While this is a useful format to summarize the data, we will base our analysis on the original data set of individual responses to the survey. Load this data set into R with the following command.

download.file("http://www.openintro.org/stat/data/atheism.RData", destfile = "atheism.RData")
load("atheism.RData")

EXERCISE 3

What does each row of Table 6 correspond to? What does each row of atheism correspond to?

Each row in Table 6 corresponds to a country and the percentage of respondents for each category.

Each row in atheism corresponds to one respondent’s answer in a particular country and year.

EXERCISE 4

Using the command below, create a new dataframe called us12 that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

Proportion of atheist reponses is 5.25% in atheism. This agrees with the value in Table 6, which is 5%.

us12 <- subset(atheism, nationality == "United States" & year == "2012")

sum(us12$response == "non-atheist")

## [1] 952

sum(us12$response == "atheist")

## [1] 50

sum(us12$response == "atheist")/sum(us12$response == "non-atheist")

## [1] 0.05252101

Inference on proportions

As was hinted at in Exercise 1, Table 6 provides statistics, that is, calculations made from the sample of 51,927 people. What we’d like, though, is insight into the population parameters. You answer the question, “What proportion of people in your sample reported being atheists?” with a statistic; while the question “What proportion of people on earth would report being atheists” is answered with an estimate of the parameter.

The inferential tools for estimating population proportion are analogous to those used for means in the last chapter: the confidence interval and the hypothesis test.

EXERCISE 5

Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

The data is regarded as a random sample from the population.
The observations are independent. All observations in the population have the same chance of being sampled.
Sample size is sufficiently large: np >= 10 and n(1-p) >= 10.

No, not confident about the second condition. Polls are voluntary, so it’s harder to get a truly representative sample. Also the polling methods (in person vs phone vs email) yield different response rates and might bias the selection.

If the conditions for inference are reasonable, we can either calculate the standard error and construct the interval by hand, or allow the inference function to do it for us.

inference(us12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

Note that since the goal is to construct an interval estimate for a proportion, it’s necessary to specify what constitutes a “success”, which here is a response of “atheist”.

Although formal confidence intervals and hypothesis tests don’t show up in the report, suggestions of inference appear at the bottom of page 7: “In general, the error margin for surveys of this kind is ± 3-5% at 95% confidence”.

EXERCISE 6

Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

Margin of error = critical value (z) SE = 1.96 * 0.0069 = 0.0135 or 1.35%

EXERCISE 7

Using the inference function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It may be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference function to construct the confidence intervals.

saudi12 <- subset(atheism, nationality == "Saudi Arabia" & year == "2012")

inference(saudi12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.05 ;  n = 500 
## Check conditions: number of successes = 25 ; number of failures = 475 
## Standard error = 0.0097 
## 95 % Confidence interval = ( 0.0309 , 0.0691 )

Saudia Arabia

Margin of Error = 1.96 * 0.0097 = 0.0190
p-hat = 0.05 n = 500 np = 25 met n(1-p) = 475 met 95% CI: 0.05 +/- 0.019 or 95% CI: (0.31, 0.069)

turkey12 <- subset(atheism, nationality == "Turkey" & year == "2012")

inference(turkey12$response, est = "proportion", type = "ci", method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0203 ;  n = 1032 
## Check conditions: number of successes = 21 ; number of failures = 1011 
## Standard error = 0.0044 
## 95 % Confidence interval = ( 0.0117 , 0.029 )

Turkey

Margin of Error = 1.96 * 0.0044 = 0.0086 p-hat = 0.0203 n = 1032 np = 21 met n(1-p) = 1011 met 95% CI: 0.0203 +/- 0.0086 or 95% CI: (0.0117, 0.0289)

How does the proportion affect the margin of error?

Imagine you’ve set out to survey 1000 people on two questions: are you female? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.

Think back to the formula for the standard error: SE=p(1−p)/n‾‾‾‾‾‾‾‾‾‾√. This is then used in the formula for the margin of error for a 95% confidence interval: ME=1.96×SE=1.96×p(1−p)/n‾‾‾‾‾‾‾‾‾‾√. Since the population proportion p is in this ME formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of ME vs. p.

The first step is to make a vector p that is a sequence from 0 to 1 with each number separated by 0.01. We can then create a vector of the margin of error (me) associated with each of these values of p using the familiar approximate formula (ME=2×SE). Lastly, we plot the two vectors against each other to reveal their relationship.

n <- 1000
p <- seq(0, 1, 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
plot(me ~ p, ylab = "Margin of Error", xlab = "Population Proportion")

EXERCISE 8

Describe the relationship between p and me.

This is a symmetrical curve where margin of error increases with proportion in the first half and decreases in the second half

Success-failure condition

The textbook emphasizes that you must always check conditions before making inference. For inference on proportions, the sample proportion can be assumed to be nearly normal if it is based upon a random sample of independent observations and if both np≥10 and n(1−p)≥10. This rule of thumb is easy enough to follow, but it makes one wonder: what’s so special about the number 10?

The short answer is: nothing. You could argue that we would be fine with 9 or that we really should be using 11. What is the “best” value for such a rule of thumb is, at least to some degree, arbitrary. However, when np and n(1−p) reaches 10 the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on that approximation.

We can investigate the interplay between n and p and the shape of the sampling distribution by using simulations. To start off, we simulate the process of drawing 5000 samples of size 1040 from a population with a true atheist proportion of 0.1. For each of the 5000 samples we compute p̂ and then plot a histogram to visualize their distribution.

p <- 0.1
n <- 1040
p_hats <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats[i] <- sum(samp == "atheist")/n
}

hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))

EXERCISE 9

Describe the sampling distribution of sample proportions at n=1040 and p=0.1. Be sure to note the center, spread, and shape. Hint: Remember that R has functions such as mean to calculate summary statistics.

summary(p_hats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07019 0.09327 0.09904 0.09969 0.10577 0.12981

sd(p_hats)

## [1] 0.009287382

This sampling distribution has a normal distribution with center at 0.10 and spread of 0.07 to 0.13, and standard deviation of 0.0093.

EXERCISE 10

Repeat the above simulation three more times but with modified sample sizes and proportions: for n=400 and p=0.1, n=1040 and p=0.02, and n=400 and p=0.02. Plot all four histograms together by running the par(mfrow = c(2, 2)) command before creating the histograms. You may need to expand the plot window to accommodate the larger two-by-two plot. Describe the three new sampling distributions. Based on these limited plots, how does n appear to affect the distribution of p̂ ? How does p affect the sampling distribution?

The 3 new sampling distributions are also normal, but with different centers, spreads, and frequencies.

A larger n reduces the spread of the distribution.

A larger p reduces the height of the distribution (lower frequencies).

p <- 0.1
n <- 400
p_hats2 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats2[i] <- sum(samp == "atheist")/n
}

p <- 0.02
n <- 1040
p_hats3 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats3[i] <- sum(samp == "atheist")/n
}

p <- 0.02
n <- 400
p_hats4 <- rep(0, 5000)

for(i in 1:5000){
  samp <- sample(c("atheist", "non_atheist"), n, replace = TRUE, prob = c(p, 1-p))
  p_hats4[i] <- sum(samp == "atheist")/n
}

par(mfrow = c(2, 2))
hist(p_hats, main = "p = 0.1, n = 1040", xlim = c(0, 0.18))
hist(p_hats2, main = "p = 0.1, n = 400", xlim = c(0, 0.18))
hist(p_hats3, main = "p = 0.02, n = 1040", xlim = c(0, 0.18))
hist(p_hats4, main = "p = 0.02, n = 400", xlim = c(0, 0.18))

mean(p_hats)

## [1] 0.09969

mean(p_hats2)

## [1] 0.099759

mean(p_hats3)

## [1] 0.01995423

mean(p_hats4)

## [1] 0.0198785

Once you’re done, you can reset the layout of the plotting window by using the command par(mfrow = c(1, 1)) command or clicking on “Clear All” above the plotting window (if using RStudio). Note that the latter will get rid of all your previous plots.

EXERCISE 11

If you refer to Table 6, you’ll find that Australia has a sample proportion of 0.1 on a sample size of 1040, and that Ecuador has a sample proportion of 0.02 on 400 subjects. Let’s suppose for this exercise that these point estimates are actually the truth. Then given the shape of their respective sampling distributions, do you think it is sensible to proceed with inference and report margin of errors, as the reports does?

Australia p-hat = 0.1 n = 1040 np = 104 met n(1-p)= 936 met

Ecuador p-hat = 0.02 n = 400 np = 8 not met n(1-p) = 392 met

For Australia, conditions for inference are met. But for Ecuador, sample size is not large enough. np >= 10 condition is not met. But the shape of both sampling distributions look normal with similar means. Since the distribution looks normal and the choice of “10” is somewhat arbitrary, it is reasonable to proceed with inference.

RLab8 - Inference for Categorical Data

R. LEE

Nov 2, 2019