Analyzing proportions

M. Drew LaMar
October 6, 2021


https://xkcd.com/539/

From means to proportions

So far, the population parameter of interest was a mean/average.

For means and averages, the random variable of interest was numerical.

In this chapter, the population parameter of interest is a proportion.

For proportions, the random variable of interest is categorical.

More specifically, we are interested in the proportion of times in the population that a particular level of a categorical variable occurs.

Example

Practice Problem #5 (estimation)

Population: All $1 US bills

Variable: Measurable cocaine? (Levels: Yes/No)

Parameter: Proportion of bills with measurable cocaine.

Sample: 50 $1 bills [BTW, in actual study, 46 had measurable cocaine]

Example

Practice Problem #4 (hypothesis testing)

Population: All humans who could have been downwind of site of 11 previous aboveground nuclear bomb tests in 1955.

Variable: Developed cancer by 1980s? (Levels: Yes/No)

Parameter: Probability of developing cancer by 1980s.

Sample: 220 actors in film The Conqueror (including John Wayne) who were downwind of … [91 developed cancer by 1980s]

Note: 14% of age group within this time frame should have been stricken with cancer.

Preliminaries - What variable?

Suppose I have a population of size \( N \) and a categorical variable \( Y \) (e.g. \( Y \) could be genotype with levels {aa, Aa, AA}).

For proportions, we really need a categorical variable with only two levels, which would be

  • “Success = has level of interest”
  • “Failure = does not have level of interest”.

For example, with the genotype case, we could be interested in the proportion of heterozygotes in a population, so we would have

  • “Success = Aa”
  • “Failure = not Aa = {aa, AA}”

Preliminaries - What parameter?

Given the categorical variable with levels “Success” and “Failure”, the proportion of successes in the population would be denoted by

\[ p = \frac{\mathrm{Number \ of \ successes \ in \ population}}{\mathrm{Total \ population \ size}} = \frac{X}{N} \]

If we have a sample of size \( n \), then we would have a sample estimate of this proportion \( p \) given by

\[ \hat{p} = \frac{\mathrm{Number \ of \ successes \ in \ sample}}{\mathrm{Total \ sample \ size}} = \frac{\hat{X}}{n} \]

Estimation

What is the standard error for a proportion (i.e. what is the measure of precision for the sample proportion)?

Definition: The standard error of a proportion is the standard deviation of the sampling distribution for a proportion and is given by \[ \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \]

Definition: An estimate of the standard error for the proportion is given by \[ \mathrm{SE}_{\hat{p}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Estimation - Practice Problem #3

In a study in Scotland (as reported by Devlin 2009), researchers left a total of 240 wallets around Edinburgh, as though the wallets were lost. Each contained contact information including an address. Of the wallets, 101 were returned by the people who found them.

Discuss: What might the population of interest be in this study?

Answer: Possibly the Edinburgh population only (otherwise, possible bias). Could also be all previously (or futurely) dropped wallets.

Estimation - Practice Problem #3

In a study in Scotland (as reported by Devlin 2009), researchers left a total of 240 wallets around Edinburgh, as though the wallets were lost. Each contained contact information including an address. Of the wallets, 101 were returned by the people who found them.

Discuss: What might be a possible weakness with this study, if they were interested in inferring about “honesty” of the population?

Answer: Possible that one person could have found multiple wallets, which is an independence issue.

Estimation - Practice Problem #3

In a study in Scotland (as reported by Devlin 2009), researchers left a total of 240 wallets around Edinburgh, as though the wallets were lost. Each contained contact information including an address. Of the wallets, 101 were returned by the people who found them.

Discuss: What is the categorical variable of interest (include levels)?

Answer: Wallet fate (Levels: returned/not returned)

Estimation - Practice Problem #3

In a study in Scotland (as reported by Devlin 2009), researchers left a total of 240 wallets around Edinburgh, as though the wallets were lost. Each contained contact information including an address. Of the wallets, 101 were returned by the people who found them.

Calculate: Estimate the proportion of returned wallets.

Answer: \( \ \hat{p} = 101/240 \approx 0.42 \)

Calculate: Compute \( \mathrm{SE}_{\hat{p}} \).

Answer: \( \ \mathrm{SE}_{\hat{p}} = \sqrt{\frac{0.42\times(1-0.42)}{240}} \approx 0.032 \)

Is that good?

95% confidence interval for proportion

Two main methods.

Method #1: In the Wald method, the 95% confidence interval is given by

\[ \hat{p} - 1.96\ \mathrm{SE}_{\hat{p}} < p < \hat{p} + 1.96\ \mathrm{SE}_{\hat{p}} \]

Caution: The Wald method is only accurate when (1) \( n \) is large and (2) population parameter \( p \) is not close to 0 or 1. If these conditions are not met, then the Wald confidence interval will bracket the true population parameter less than 95% of the time.

95% confidence interval for proportion

Due to this, you should use the Agresti-Coull method.

Method #2: In the Agresti-Coull method, the 95% confidence interval is given by

\[ \scriptsize p^{\prime} - 1.96\sqrt{\frac{p^{\prime}(1-p^{\prime})}{n+4}} < p < p^{\prime} + 1.96\sqrt{\frac{p^{\prime}(1-p^{\prime})}{n+4}} \] where \[ \scriptsize p^{\prime} = \frac{X+2}{n+4}. \]

Estimation - Practice Problem #3

Let's use R. Load in the lost wallet data.

walletData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter07/chap07q03LostWallets.csv")
str(walletData)
'data.frame':   240 obs. of  1 variable:
 $ return: chr  "returned" "returned" "returned" "returned" ...

Estimation - Practice Problem #3

Only one variable, so let's remove the data frame from the picture:

walletData <- as.factor(walletData$return)
(walletTable <- summary(walletData))
not returned     returned 
         139          101 

Okay, well, I guess we already knew this. ¯\(ツ)/¯

Estimation - Practice Problem #3

Let's compute the standard error for the proportion.

(n <- sum(walletTable)) # Number of trials
[1] 240
(phat <- walletTable["returned"]/n) # Estimate
 returned 
0.4208333 
(phat <- unname(phat)) # Remove confusing name
[1] 0.4208333

Estimation - Practice Problem #3

Let's compute the standard error for the proportion.

(SE_phat <- sqrt(phat*(1-phat)/n))
[1] 0.03186774

Now 95% confidence interval using Wald method.

lower <- phat - 1.96 * SE_phat
upper <- phat + 1.96 * SE_phat
(wald_CI <- c(lower = lower, upper = upper))
    lower     upper 
0.3583726 0.4832941 

Estimation - Practice Problem #3

Finally, 95% confidence interval using Agresti-Coull (and Wald to compare).

library(binom) # Need binom package
binom.confint(walletTable["returned"], n, method = "ac")
         method   x   n      mean     lower     upper
1 agresti-coull 101 240 0.4208333 0.3600899 0.4840711
wald_CI
    lower     upper 
0.3583726 0.4832941 

What is the sampling distribution for the estimate?

The sampling distribution for the sample estimate of the proportion is a “scaled” binomial distribution.

\[ \hat{p} = \frac{\mathrm{Number \ of \ successes \ in \ sample}}{\mathrm{Total \ sample \ size}} = \frac{\hat{X}}{n} \]

What is the sampling distribution for the estimate?

Definition: The binomial distribution provides the probability distribution for the number of “successes” in a fixed number of independent trials, when the probability of success is the same in each trial.

Properties:

  • Number of trials is fixed (\( n \))
  • Probability of success in each trial (\( p \)) is the same in every trial
  • Separate trials are independent

What is the sampling distribution for the estimate?

Definition: The binomial distribution provides the probability distribution for the number of “successes” in a fixed number of independent trials, when the probability of success is the same in each trial.

Properties:

  • Number of trials is fixed (\( n \)) [sample size]
  • Probability of success in each trial (\( p \)) is the same in every trial [proportion of successes in population]
  • Separate trials are independent [random sample]

Binomial distributions

If we have \( n \) trials, and the probability of success in each trial is \( p \), we have \[ \mathrm{Pr[}X \mathrm{ \ successes]} = \left(\begin{array}{c}{n \\ X}\end{array}\right)p^{X}(1-p)^{n-X}, \] where \[ \left(\begin{array}{c}{n \\ X}\end{array}\right) = \frac{n!}{X!(n-X)!}, \] and \[ n! = n\times(n-1)\times(n-2)\cdots 2\times 1. \]

Why?

Binomial distributions - Math moment

To figure out Pr[\( X \) successes], first ask

Question: “What are all different outcomes of \( X \) successes in \( n \) trials?”

Example: Suppose \( n=3 \) and \( X=2 \).

\[ 2 \ \mathrm{successes} = \{SSF, SFS, FSS\} \]

\[ \mathrm{Pr}[SSF] = \mathrm{Pr}[S]\times \mathrm{Pr}[S]\times \mathrm{Pr}[F] = p^2(1-p) \]

\[ \mathrm{Pr}[SFS] = \mathrm{Pr}[S]\times \mathrm{Pr}[F]\times \mathrm{Pr}[S] = p^2(1-p) \]

\[ \mathrm{Pr}[FSS] = \mathrm{Pr}[F]\times \mathrm{Pr}[S]\times \mathrm{Pr}[S] = p^2(1-p) \]

Binomial distributions - Math moment

To figure out Pr[\( X \) successes], first ask

Question: “What are all different outcomes of \( X \) successes in \( n \) trials?”

Example: Suppose \( n=3 \) and \( X=2 \).

\[ 2 \ \mathrm{successes} = \{SSF, SFS, FSS\} \]

\[ \mathrm{Pr}[SSF] = \mathrm{Pr}[SFS] = \mathrm{Pr}[FSS] = p^2(1-p) = p^X(1-p)^{n-X} \]

How many ways are there to have 2 successes in 3 trials? 3 choose 2!!

\[ \mathrm{Pr[2 \ successes]} = \left(\begin{array}{c}{3 \\ 2}\end{array}\right)p^2(1-p) \]

Binomial distributions - in R

To get values of probability distribution, use the dbinom function. Supposing \( n=10 \) and \( p=0.5 \), we have:

(pdist <- dbinom(x=0:10, size=10, prob=0.5))
 [1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250
 [6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250
[11] 0.0009765625
sum(pdist)
[1] 1

The d in dbinom stands for distribution.

Binomial distributions - in R

Question: Given \( p=0.3 \) and \( n=20 \), what is Pr[6 successes]? (Write out using notation)

Answer: \[ \mathrm{Pr[}6 \mathrm{ \ successes]} = \left(\begin{array}{c}{20 \\ 6}\end{array}\right)0.3^{6}\times 0.7^{14}. \]

Using R and dbinom,

(ans <- dbinom(x=6, size=20, prob=0.3))
[1] 0.191639

Thus, Pr[6 successes] = 0.191639.

Binomial distributions - in R

Let's plot the distribution:

barplot(pdist, 
        names.arg=0:10, 
        col="firebrick", 
        xlab="X (Number of successes)", 
        ylab="Probability")

Binomial distributions - in R

Let's plot the distribution:

plot of chunk unnamed-chunk-9

Galton Box

Binomial distributions - in R

Let's just look at a lower probability of success, say \( p=0.1 \):

pdist <- dbinom(0:10, 10, 0.1)

barplot(pdist, 
        names.arg=0:10, 
        col="firebrick", 
        xlab="X (Number of successes)", 
        ylab="Probability")

Binomial distributions - in R

Let's just look at a lower probability of success, say \( p=0.1 \):

plot of chunk unnamed-chunk-10

Back to sampling distribution

What is the mean, variance and standard deviation of a binomial random variable \( X \)?

alt text

Definition: Distribution of sample estimates is the sampling distribution.

\[ \hat{p} = \frac{\mathrm{Number \ of \ successes \ in \ sample}}{\mathrm{Total \ sample \ size}} = \frac{\hat{X}}{n} \]

Back to sampling distribution

\[ \hat{p} = \frac{\mathrm{Number \ of \ successes \ in \ sample}}{\mathrm{Total \ sample \ size}} = \frac{1}{n}\hat{X} \]

  • \( \hat{X} \) is a binomial random variable
  • \( \hat{p} \) is thus a scaled binomial random variable
  • Standard deviation of \( \hat{X} \) is \( \sigma_{\hat{X}} = \sqrt{np(1-p)} \)
  • From rules for scaled random variables, the standard deviation of \( \hat{p} \) is thus

    \[ \sigma_{\hat{p}} = \frac{1}{n}\sqrt{np(1-p)} = \sqrt{\frac{p(1-p)}{n}} \]

This is the standard error for \( \hat{p} \)!!!

Sampling distribution for a proportion

plot of chunk unnamed-chunk-11

Binomial test - Definitions

Definition: The binomial test uses data to test whether a population proportion (\( p \)) matches a null expectation (\( p_{0} \)) for the proportion.

Definition: The null hypothesis \( H_{0} \) and alternative hypothesis \( H_{A} \) for a binomial test are given by:

    \( H_{0} \): Relative frequency of successes in population is \( p_{0} \).
    \( H_{A} \): Relative frequency of successes in population is not \( p_{0} \).

Example - Practice Problem #2

Do people typically use a particular ear preferentially when listening to strangers? Marzoli and Tomassi (2009) had a researcher approach and speak to strangers in a noisy nightclub. An observer scored whether the person approached turned either the left or right ear toward the questioner. Of 25 participants, 19 turned the right ear toward the questioner and 6 offered the left ear. Is this evidence of population difference from 50% for each ear?

Discuss: What is the null and alternative hypotheses?

Answer:
\[ \begin{array}{ll} H_{0}\,: & p = 0.5 \\ H_{A}\,: & p \neq 0.5 \end{array} \]

Example - Practice Problem #2

Do people typically use a particular ear preferentially when listening to strangers? Marzoli and Tomassi (2009) had a researcher approach and speak to strangers in a noisy nightclub. An observer scored whether the person approached turned either the left or right ear toward the questioner. Of 25 participants, 19 turned the right ear toward the questioner and 6 offered the left ear. Is this evidence of population difference from 50% for each ear?

Discuss: What is the observed value of the test statistic?

Answer: Number of right ears is 19 (\( \hat{X}=19 \)).

Example - Practice Problem #2

Discuss: Under the null hypothesis, calculate the probability of getting exactly 19 right ears and six left ears.

(prob <- dbinom(x = 19, size = 25, prob = 0.5))
[1] 0.005277991

\[ \mathrm{Pr[19]} = \left(\begin{array}{c}{25 \\ 19}\end{array}\right)0.5^{19}0.5^{6} = 0.005278 \]

Example - Practice Problem #2

Discuss: List all possible outcomes in which the number of right ears is greater than the 19 observed.

Answer: 20, 21, 22, 23, 24, 25

Discuss: Calculate the probability under the null hypothesis of each of the extreme outcomes listed above

(probs <- dbinom(x = 20:25, size = 25, prob = 0.5))
[1] 1.583397e-03 3.769994e-04 6.854534e-05 8.940697e-06 7.450581e-07
[6] 2.980232e-08

Example - Practice Problem #2

Discuss: Use the addition rule to calculate the probability of 19 or more right-eared turns under the null hypothesis.

(extreme_probs <- dbinom(x = 19:25, size = 25, prob = 0.5))
[1] 5.277991e-03 1.583397e-03 3.769994e-04 6.854534e-05 8.940697e-06
[6] 7.450581e-07 2.980232e-08
sum(extreme_probs)
[1] 0.007316649

Example - Practice Problem #2

Discuss: Give the two-tailed \( P \)-value based on your previous answer.

(pval <- 2*sum(extreme_probs))
[1] 0.0146333

Discuss: State your conclusion.

Answer: Using a significance level of \( \alpha = 0.05 \), we reject \( H_{0} \) since \( P < 0.05 \). There is evidence that more people use the right ear than the left ear when listening to a stranger in the noisy nightclub.

One last thing on binomial tests...

Use binom.test to do a binomial test! It's more accurate. If our observed test statistic is \( X = 19 \) successes out of \( n = 25 \) trials, and our null hypothesized proportion is \( p_{0} = 0.5 \), then we have:

binom.test(19,
           n = 25,
           p = 0.5)

One last thing on binomial tests...

Use binom.test to do a binomial test! It's more accurate. If our observed test statistic is \( X = 19 \) successes out of \( n = 25 \) trials, and our null hypothesized proportion is \( p_{0} = 0.5 \), then we have:


    Exact binomial test

data:  19 and 25
number of successes = 19, number of trials = 25, p-value = 0.01463
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5487120 0.9064356
sample estimates:
probability of success 
                  0.76