Part 6: Making Statistical Decisions about a Population Proportion

Probability distributions are useful for telling us how likely we are to see a certain value in a population.

For any population, we can take a random sample and calculate a statistic to help make decisions about a population parameter. In a population, we're more likely to observe certain values than others. The same is true for sample statistics.

Sampling Distributions

Suppose that we want to take an exit poll of voters in Lincoln.

Each vote can be considered a random variable, because the outcome varies from voter to voter. A statistic, like the percent of voters in your sample who support a certain candidate, can also be thought of as a random variable. That sample percent of voters will change from sample to sample!

Sampling Distribution:

The idea behind a sampling distribution is to see the pattern that emerges when we take __________________________ and compute a statistic from each of them.
Sampling distribution always refers to the probability distribution of a statistic.

Every statistic has a sampling distribution. Some sampling distributions can be modeled using known probability distributions, specifically the normal distribution. Others can be found by analyzing the data. No matter the underlying pattern, it emerges with repeated random sampling.

Example: During the last class, we looked at a simulated sampling distribution for whether or not we could correctly identify dog food from a plate of 5 foods. The histogram is displayed below.

plot of chunk unnamed-chunk-1

How would you describe the shape of this histogram?

Even though we didn't treat it this way, we could look at this data in terms of a proportion - what proportion of subjects can correctly identify the dog food?

plot of chunk unnamed-chunk-2

This is an example of a sampling distribution for a proportion. In this situation, we're making the assumption that each subject in our sample has the same probability of correctly identifying the dog food. This is an important assumption!

Sampling Distribution for a Proportion

The sampling distribution for a proportion has the following properties:

\[ mean=\rho \]

All sample proportions \( \hat{p} \) should be close to the population proportion \( \rho \).

\[ s.e.=\sqrt{\frac{\rho*(1-\rho)}{n}} \]

In sampling distributions, we call the standard deviation the standard error (s.e.). This reflects the fact that we're looking at a sample!
The standard error is how far we'd expect an “average sample” to be from the true proportion

Under certain conditions, this sampling distribution follows an approximate normal distribution. If the sample size, n, is large enough such that:

\( n*\rho\ge15 \)
\( n*(1-\rho)\ge15 \)

then we can approximate the probabilities for this sampling distribution with the normal distribution.

Like in the dog food example, we assume that every subject in the population has the same probability of being in the category we're interested in. We call this a success.

Example: Suppose that 54% of the registered students at UNL would support you running for ASUN president next year. A Daily Nebraskan survey before the ASUN election asks a random sample of 1,336 students who they are planning to vote for in the election. What's the probability that the Daily Nebraskan survey reports that you will win (the sample proportion \( \hat{p}\ge \) 0.50)?

Identify the sample size.
What's a “success”? What's the probability of a success (population proportion, \( \rho \))?
Is our sample large enough to use a normal distribution to approximate? How do we know?

0.54 * 1336

## [1] 721.4

(1 - 0.54) * 1336

## [1] 614.6

What are the mean and standard error of this sampling distribution?
What is the approximate probability that the Daily Nebraskan sample will have \( \hat{p}\ge \) 0.50?

# Find the standard error
se <- sqrt(0.54 * 0.46/1336)
# Estimate the probability that the sample proportion is greater than 0.50
xpnorm(0.5, mean = 0.54, sd = 0.0136)

## 
## If X ~ N(0.54,0.0136), then 
## 
##  P(X <= 0.5) = P(Z <= -2.941) = 0.0016
##  P(X >  0.5) = P(Z >  -2.941) = 0.9984

plot of chunk unnamed-chunk-4

## [1] 0.001635

Note: The true sampling distribution for a sample proportion is called the binomial distribution. When our sample size is large enough, the normal distribution is a good approximation.

plot of chunk unnamed-chunk-5

This is very close to the normal distribution!

Example: A sample of 150 students is asked if they prefer Coke or Pepsi. Suppose that the population proportion of people who prefer Coke is 0.70. Use the normal approximation to find the probability that more than 2/3 of students in the sample prefer Coke.

Example: Suppose that 45% of residents in a city support construction of a new Wal-Mart Super Center. A sample of 400 adults was taken. Use the normal approximation to find the probability that less than 50% of the sample support the Wal-Mart?

Question: Is it realistic to assume we know the population proportion, \( \rho \)?

Normally, we don't know the population proportion! In fact it's the opposite. Usually we have sample data, and want to use that to estimate the population proportion. We can do that with a hypothesis test. Hypothesis tests have an elaborate vocabulary, but the basic idea is very simple.

The Binomial Test

The binomial test is what we'll use to make statements about a population proportion. Every hypothesis test will have the same steps.

Step	Process
1	Define the hypotheses we want to test.
2	Identify the sample size, \( \rho_0 \), and \( \hat{p} \).
3	Use R to find the p-value.
4	Interpret the test and write a short conclusion.

Step 1: Define the hypotheses we want to test

The null hypothesis and the alternative hypotheses are always statements about the unknown population parameter, \( \rho \). They are related to what we want to prove.

The null hypothesis states that \( \rho \) takes on a particular value. We write: \[ H_0:\rho=\rho_0 \]
- \( \rho_0 \) is our claimed population proportion
The alternative hypothesis is what we want to show. There are two types of alternative hypotheses.
- One-sided – we want to see if \( \rho \) is greater than or less than the claimed value \[ H_A:\rho\ge\rho_0 \] \[ H_A:\rho\le\rho_0 \]
- Two-sided – we just want to know if \( \rho \) is different than the claimed value \[ H_A:\rho\ne\rho_0 \]

Example: Write down the null and alternative hypotheses for each scenario. Define \( \rho \) for each one.

In a particular town, the proportion of accidents per year that involve people talking on a cell phone is 0.40. A new ad campaign was conducted by the Highways Department to encourage people to pull over when they talk on a cell phone. Does the proportion of accidents with cell phones decrease?
A student senator needs more than 2/3 of the votes in order to pass a ruling that increases student activity fee spending on non-fraternity or sorority events. He takes a small sample to see if there is support for his bill. Is there evidence to show that more than 2/3 of student senators may vote for the bill?
The proportion of male college students that binge drink is close to 0.5. Is it different for women?

Step 2: Identify the sample size, \( \rho_0 \), and \( \hat{p} \)

These are the three pieces of information, along with the null and alternative hypotheses, that we need to do the binomial test in R.

Example: Should teenagers aged 14-16 have access to birth control methods, even if their parents disapprove? One of your friends claims that less than half of American adults would agree to it, but you think the majority would agree.

To settle the question, you look this up in the General Social Survey, and find that 471 out of 880 participants in the 2004 survey said that they agreed or strongly agreed that teenagers should have these methods available to them. Is this evidence convincing enough to show that more than half of American adults agree with that statement?

Write the null and alternative hypotheses.
What's the sample size, n?
What's the claimed proportion, \( \rho_0 \)?

Step 3: Use R to find the p-value

In the last section, we defined the p-value.

p-value: the probability of observing a sample statistic as or more extreme than what we have observed, if the null hypothesis is true

Here, “extreme” means too far from \( \rho_0 \) in the direction specified in the alternative hypothesis.
- If our alternative is “greater than”“, we look for large sample proportions.
- If our alternative is "less than”“, we look for small sample proportions.
- what should we look for if our alternative is "not equal to”?
- What's the sample proportion, \( \hat{p} \)?

In R, we'll use the binom.test function to carry out this hypothesis test.

Basic syntax:

binom.test(x = , n = , p = , alternative = )

We have four inputs for this function:

Input	What We Need
x=	Number of successes, \( \hat{p}*n \)
	If we have the number of successes at the start, we can skip finding the sample proportion!
n=	Sample size
p=	\( \rho_0 \), the claimed population proportion
alternative=	One of three: 'two.sided', 'less', 'greater'

Let's try it!

binom.test(x = 471, n = 880, p = 0.5, alternative = "greater")

## 
##  Exact binomial test
## 
## data:  x and n
## number of successes = 471, number of trials = 880, p-value =
## 0.01985
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.5069 1.0000
## sample estimates:
## probability of success 
##                 0.5352

In the R output, we're looking for p-value = 0.01985. This tells us that if the null hypothesis is true, and there really is a 50% chance that a randomly selected adult thinks teens should have access to birth control, then the probability of observing a sample with at least as many adults agreeing is only 0.01985.

Step 4: Interpret the test and write a short conclusion

The p-value represents the strength of the evidence for the null hypothesis. The smaller the p-value, the more evidence against the null and in support of the alternative hypothesis. How small should the p-value be to “prove” \( H_A \)? The smaller the better!

We decide if the p-value is “small enough” by comparing it to a pre-specified significance level, called the \( \alpha \)-level. The standard \( \alpha \)-level is 0.05. We will see later where \( \alpha \) comes from.

Based on the p-value, we either reject or fail to reject the null hypothesis.

If we reject the null hypothesis, then we're assuming the alternative hypothesis is true.

However if we fail to reject the null hypothesis, we are not automatically assuming that it's true. We can only say that we don't have enough evidence to conclude that it's false.

Statistical significance: if we reject the null hypothesis, we say that our results are statistically significant.

The results of our hypothesis test are unlikely to be due to chance!

There are three things that you should always include when writing your conclusions.

Whether you reject or fail to reject \( H_0 \).
The p-value from R, and whether it's less than or greater than 0.05.
An interpretation of the results based on the problem.

Example: So how would we interpret the birth control study?

Based on our sample data, we reject the null hypothesis (p-value=0.0199 < 0.05). We have enough evidence to conclude that more than 50% of American adults believe that teenagers aged 14-16 should have access to birth control methods, even if their parents disapprove.

Example: The Nebraska Forest Service is concerned about a small population of emerald ash borers in eastern Nebraska. 1500 randomly selected ash trees were tested for traces of the emerald ash borer. 69 of the trees showed such traces. If evidence is found that more than 4% of trees have been infected, the Forest Service will take measures to prevent further spread of the borers. Test the hypothesis that more than 4% of the trees have been infected.

What are the null and alternative hypotheses?
What are the pieces of information we already know?
Use R to find the p-value. Based on that p-value do you reject or fail to reject the hypothesis?
Write a short conclusion. Do you recommend that the Forest Service take action? Why or why not?

Example: The CEO of Lincoln Electric Systems recently claimed that 80% of customers are very happy with the service that they receive. To test this claim, the Journal Star surveyed 100 customers using simple random sampling. Among the sampled customers, 73 said they were very satisfied. Based on this survey, do you agree or disagree with the CEO's claim?