Part 6: Making Statistical Decisions about a Population Proportion

Probability distributions are useful for telling us how likely we are to see a certain value in a population.

For any population, we can take a random sample and calculate a statistic to help make decisions about a population parameter. In a population, we're more likely to observe certain values than others. The same is true for sample statistics.

Sampling Distributions

Suppose that we want to take an exit poll of voters in Lincoln.

Each vote can be considered a random variable, because the outcome varies from voter to voter. A statistic, like the percent of voters in your sample who support a certain candidate, can also be thought of as a random variable. That sample percent of voters will change from sample to sample!

Sampling Distribution:

Every statistic has a sampling distribution. Some sampling distributions can be modeled using known probability distributions, specifically the normal distribution. Others can be found by analyzing the data. No matter the underlying pattern, it emerges with repeated random sampling.


Example: During the last class, we looked at a simulated sampling distribution for whether or not we could correctly identify dog food from a plate of 5 foods. The histogram is displayed below.

plot of chunk unnamed-chunk-1

Even though we didn't treat it this way, we could look at this data in terms of a proportion - what proportion of subjects can correctly identify the dog food?

plot of chunk unnamed-chunk-2

This is an example of a sampling distribution for a proportion. In this situation, we're making the assumption that each subject in our sample has the same probability of correctly identifying the dog food. This is an important assumption!


Sampling Distribution for a Proportion

The sampling distribution for a proportion has the following properties:

\[ mean=\rho \]

\[ s.e.=\sqrt{\frac{\rho*(1-\rho)}{n}} \]

Under certain conditions, this sampling distribution follows an approximate normal distribution. If the sample size, n, is large enough such that:

then we can approximate the probabilities for this sampling distribution with the normal distribution.

Like in the dog food example, we assume that every subject in the population has the same probability of being in the category we're interested in. We call this a success.


Example: Suppose that 54% of the registered students at UNL would support you running for ASUN president next year. A Daily Nebraskan survey before the ASUN election asks a random sample of 1,336 students who they are planning to vote for in the election. What's the probability that the Daily Nebraskan survey reports that you will win (the sample proportion \( \hat{p}\ge \) 0.50)?

0.54 * 1336
## [1] 721.4
(1 - 0.54) * 1336
## [1] 614.6
# Find the standard error
se <- sqrt(0.54 * 0.46/1336)
# Estimate the probability that the sample proportion is greater than 0.50
xpnorm(0.5, mean = 0.54, sd = 0.0136)
## 
## If X ~ N(0.54,0.0136), then 
## 
##  P(X <= 0.5) = P(Z <= -2.941) = 0.0016
##  P(X >  0.5) = P(Z >  -2.941) = 0.9984

plot of chunk unnamed-chunk-4

## [1] 0.001635

Note: The true sampling distribution for a sample proportion is called the binomial distribution. When our sample size is large enough, the normal distribution is a good approximation.

plot of chunk unnamed-chunk-5

This is very close to the normal distribution!


Example: A sample of 150 students is asked if they prefer Coke or Pepsi. Suppose that the population proportion of people who prefer Coke is 0.70. Use the normal approximation to find the probability that more than 2/3 of students in the sample prefer Coke.


Example: Suppose that 45% of residents in a city support construction of a new Wal-Mart Super Center. A sample of 400 adults was taken. Use the normal approximation to find the probability that less than 50% of the sample support the Wal-Mart?


Question: Is it realistic to assume we know the population proportion, \( \rho \)?

Normally, we don't know the population proportion! In fact it's the opposite. Usually we have sample data, and want to use that to estimate the population proportion. We can do that with a hypothesis test. Hypothesis tests have an elaborate vocabulary, but the basic idea is very simple.

The Binomial Test

The binomial test is what we'll use to make statements about a population proportion. Every hypothesis test will have the same steps.

Step Process
1 Define the hypotheses we want to test.
2 Identify the sample size, \( \rho_0 \), and \( \hat{p} \).
3 Use R to find the p-value.
4 Interpret the test and write a short conclusion.

Step 1: Define the hypotheses we want to test

The null hypothesis and the alternative hypotheses are always statements about the unknown population parameter, \( \rho \). They are related to what we want to prove.


Example: Write down the null and alternative hypotheses for each scenario. Define \( \rho \) for each one.


Step 2: Identify the sample size, \( \rho_0 \), and \( \hat{p} \)

These are the three pieces of information, along with the null and alternative hypotheses, that we need to do the binomial test in R.

Example: Should teenagers aged 14-16 have access to birth control methods, even if their parents disapprove? One of your friends claims that less than half of American adults would agree to it, but you think the majority would agree.

To settle the question, you look this up in the General Social Survey, and find that 471 out of 880 participants in the 2004 survey said that they agreed or strongly agreed that teenagers should have these methods available to them. Is this evidence convincing enough to show that more than half of American adults agree with that statement?


Step 3: Use R to find the p-value

In the last section, we defined the p-value.

p-value: the probability of observing a sample statistic as or more extreme than what we have observed, if the null hypothesis is true

In R, we'll use the binom.test function to carry out this hypothesis test.

Basic syntax:

binom.test(x = , n = , p = , alternative = )

We have four inputs for this function:

Input What We Need
x= Number of successes, \( \hat{p}*n \)
If we have the number of successes at the start, we can skip finding the sample proportion!
n= Sample size
p= \( \rho_0 \), the claimed population proportion
alternative= One of three: 'two.sided', 'less', 'greater'

Let's try it!

Example: Should teenagers aged 14-16 have access to birth control methods, even if their parents disapprove? One of your friends claims that less than half of American adults would agree to it, but you think the majority would agree.

To settle the question, you look this up in the General Social Survey, and find that 471 out of 880 participants in the 2004 survey said that they agreed or strongly agreed that teenagers should have these methods available to them. Is this evidence convincing enough to show that more than half of American adults agree with that statement?

binom.test(x = 471, n = 880, p = 0.5, alternative = "greater")
## 
##  Exact binomial test
## 
## data:  x and n
## number of successes = 471, number of trials = 880, p-value =
## 0.01985
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.5069 1.0000
## sample estimates:
## probability of success 
##                 0.5352

In the R output, we're looking for p-value = 0.01985. This tells us that if the null hypothesis is true, and there really is a 50% chance that a randomly selected adult thinks teens should have access to birth control, then the probability of observing a sample with at least as many adults agreeing is only 0.01985.


Step 4: Interpret the test and write a short conclusion

The p-value represents the strength of the evidence for the null hypothesis. The smaller the p-value, the more evidence against the null and in support of the alternative hypothesis. How small should the p-value be to “prove” \( H_A \)? The smaller the better!

We decide if the p-value is “small enough” by comparing it to a pre-specified significance level, called the \( \alpha \)-level. The standard \( \alpha \)-level is 0.05. We will see later where \( \alpha \) comes from.

Based on the p-value, we either reject or fail to reject the null hypothesis.

If we reject the null hypothesis, then we're assuming the alternative hypothesis is true.

However if we fail to reject the null hypothesis, we are not automatically assuming that it's true. We can only say that we don't have enough evidence to conclude that it's false.

Statistical significance: if we reject the null hypothesis, we say that our results are statistically significant.

There are three things that you should always include when writing your conclusions.

  1. Whether you reject or fail to reject \( H_0 \).
  2. The p-value from R, and whether it's less than or greater than 0.05.
  3. An interpretation of the results based on the problem.

Example: So how would we interpret the birth control study?

Based on our sample data, we reject the null hypothesis (p-value=0.0199 < 0.05). We have enough evidence to conclude that more than 50% of American adults believe that teenagers aged 14-16 should have access to birth control methods, even if their parents disapprove.


Example: The Nebraska Forest Service is concerned about a small population of emerald ash borers in eastern Nebraska. 1500 randomly selected ash trees were tested for traces of the emerald ash borer. 69 of the trees showed such traces. If evidence is found that more than 4% of trees have been infected, the Forest Service will take measures to prevent further spread of the borers. Test the hypothesis that more than 4% of the trees have been infected.


Example: The CEO of Lincoln Electric Systems recently claimed that 80% of customers are very happy with the service that they receive. To test this claim, the Journal Star surveyed 100 customers using simple random sampling. Among the sampled customers, 73 said they were very satisfied. Based on this survey, do you agree or disagree with the CEO's claim?