Proportions

Drs. Kilgore & Carriere

Proportions

Categorical Variables - Frequency data

Proportion is the fraction of individuals having a particular attribute
# of successes / # of trials
We can estimate the population proportion by: \[ \hat{p}=\frac{X}{n} \]
where X denotes number of “successes”

Lightning Strike

531 of 648 people struck by lightning in the US from 1995 to 2008 were men.
Men make up 49.2% of the U.S. population

\[ \hat{p} = \frac{X}{n} = \frac{531}{648} = 0.8194 \]

But how confident are we in this?

\[ \hat{p} = \frac{X}{n} = \frac{531}{648} = 0.8194 \]

We can use this to solve for the standard error of our proportion.

\[ SE_\hat{p} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = \sqrt{\frac{(0.8194)(1-0.8194)}{648}} = 0.0151 \]

Confidence Intervals

There are many ways to calculate confidence intervals for proportions. Your book recommends using the Agresti-Coull confidence intervals, which is calculated in two steps.

\[ p= \pm(1.96)\sqrt{\frac{p'(1-p')}{n+4}} \] \[ where p' = \frac{X+2}{n+4}=\frac{533}{652}=0.8174 \]

Thus leading us to…

\[ 0.7932 \pm(1.96)\sqrt{\frac{0.8174(1-0.8174)}{648+4}}= [0.7878, 0.8471] \]

CIs - R as a calculator

Using R code to plug all of this in, we could get:

x <- 531
n <- 648
phat <- (x/n)
pi <- (x+2)/(n+4)
pi -  ( (1.96) * sqrt( (pi*(1-pi) ) / (n+4)  ))

[1] 0.7878349

pi +  ( (1.96) * sqrt( (pi*(1-pi) ) / (n+4)  ))

[1] 0.8471345

CIs using binom.confint()

We can also calculate this in R. To do this, we load the binom package, run binom.confint(), and set the method to ac, for Agretsi-Coull.

Remember by hand, we got .7878; .8471

library(binom)
binom.confint(x=531, n=648, method="ac")

         method   x   n      mean     lower     upper
1 agresti-coull 531 648 0.8194444 0.7879139 0.8472099

binom.confint() extra

The default setting for binom.confint is to give all of the possible ways to calculate the confidence intervals. This is the same as setting method=“all”.

binom.confint(x=531, n=648)

          method   x   n      mean     lower     upper
1  agresti-coull 531 648 0.8194444 0.7879139 0.8472099
2     asymptotic 531 648 0.8194444 0.7898285 0.8490604
3          bayes 531 648 0.8189522 0.7891231 0.8482295
4        cloglog 531 648 0.8194444 0.7876056 0.8469808
5          exact 531 648 0.8194444 0.7876231 0.8483216
6          logit 531 648 0.8194444 0.7879178 0.8471935
7         probit 531 648 0.8194444 0.7883190 0.8475301
8        profile 531 648 0.8194444 0.7885988 0.8477696
9            lrt 531 648 0.8194444 0.7885989 0.8477662
10     prop.test 531 648 0.8194444 0.7871576 0.8478685
11        wilson 531 648 0.8194444 0.7879734 0.8471504

Solve for n

Remember that equations will generally be set to solve for one thing.

The ability to taste phenylthiocarbamide (PTC) is genetically controlled in humans. In Europe and Asia, about 70% of people are “tasters”. Suppose a study is being planned to estimate the relative frequency of tasters in a certain Asian population and the desired standard error of the estimated relative frequency is 0.01. How many people should be included in the study?

Which one of our equations holds n?

\[ SE_\hat{p} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Yes, the standard error. It is also helpful that the question has identified for us a \( \hat{p} \) (.70) already and a desired standard error (.01).

\[ .01 = \sqrt{\frac{.70(1-.70)}{n}} \rightarrow .0001 = \frac{.70(1-.70)}{n} \]

\[ n = \frac{.70*.30}{.0001} \rightarrow n = \frac{.21}{.0001} \rightarrow n= 2100 \]

If we don't know what proportion of our population are a certain category, it is best to assume a \( \hat{p} \) of 0.5 to calculate the maximum number of samples (making this result to 2,500).

Binomial distribution

A binomial random variable is a variable for which:

B - There are two possible outcomes for each trial (e.g. binary outcomes of success and failure);
I - The outcomes of the trial are independent of each other;
N - The number of trials, n, is fixed in advance; and
S - The probability of a success on a single trial is the same for all trials (i.e. the same p value) DIFFERENT P!

If X is the count of successes in a binomial setting, then X has the binomial distribution with parameters n and p. The possible values of X are integers from 0 to n.

Calculating binomial probabilities

\[ Pr[X] = \binom{n}{X}p^X(1-p)^{(n-X)} \]

where (Binomial Coefficient) \[ \binom{n}{X} = \frac{n!}{k!(n-k)!} \]

and where \[ n! = n \cdot (n-1) \cdot (n-2) \cdot ... 3 \cdot 2 \cdot 1 \]

Example: The probability of rolling “4” exactly three times in five rolls of a die can be written as: \[ \binom{5}{3} = \frac{n!}{k!(n-k)!} \]

Die Example

The probability of rolling “4” exactly three times in five rolls of a die can be written as:

p= Probability of a single 4 = \( \frac{1}{6} \).
n = 5 dice rolls
k = 3 successes
X = also k, so 3

\[ \binom{5}{3} = \frac{n!}{k!(n-k)!}p^X(1-p)^{(n-X)} \]

\[ \binom{5}{3} = \frac{5!}{3!(5-3)!}\frac{1}{6}^{3}(1-\frac{1}{6})^{(5-3)} \]

Die example

\[ \binom{5}{3} = \frac{5!}{3!(5-3)!}\cdot\frac{1}{6}^{3}\cdot(1-\frac{1}{6})^{(5-3)} \rightarrow \]

\[ \binom{5}{3} = \frac{120}{6(2)}\cdot\frac{1}{6}^{3}\cdot(\frac{5}{6})^{2} \rightarrow \]

\[ \binom{5}{3} = 10\cdot.00469\cdot(.69\bar{4}) = .032569 \]

Die example: R as a calculator

So, the probability of rolling a 4 three times is .0325. Can we use R to make this easier?

First, let's use it as a calculator.

((factorial(5)/ (factorial(3)*factorial(2))) * ((1/6)^3) * ((5/6)^2))

[1] 0.03215021

choose(5,3) * ((1/6^3) * ((5/6)^2))

[1] 0.03215021

All is good there.

Die example: binom.test()

Now let's use binom.test

library(binom)
binom.test(x=3, n=5, p=(1/6))


    Exact binomial test

data:  3 and 5
number of successes = 3, number of trials = 5, p-value = 0.03549
alternative hypothesis: true probability of success is not equal to 0.1666667
95 percent confidence interval:
 0.1466328 0.9472550
sample estimates:
probability of success 
                   0.6

What is going on!? Did we do our math wrong? It says the probability is .03549!

Does anyone remember what the definition of a p value is?

Die example: reminder of p values

Ah, right. p values are the probability we would find something as extreme or MORE. So, what's more extreme than rolling 3 4's out of 5 dice? How about 4 4's? 5 4's?

 ((factorial(5)/ (factorial(3)*factorial(2))) * ((1/6)^3) * ((5/6)^2)) +
 ((factorial(5)/ (factorial(4)*factorial(1))) * ((1/6)^4) * ((5/6)^1)) +
 ((factorial(5)/ (factorial(5)*factorial(0))) * ((1/6)^5) * ((5/6)^0))

[1] 0.03549383

Indeed. binom.test() will tell us the probability of seeing a value as extreme as a given draw.

Dice example:

Instead, if we wanted to find the probability of seeing a single X, we would use dbinom().

dbinom(x=3,  size=5, prob=(1/6))

[1] 0.03215021

Now we have the result we were expecting.

Lightning Strike Conference

Let's get back to our lightning strikes. As a reminder, 531 out of 648 people struck by lightning were men, and we calculated that means that 81.94% of people who get struck by lightning are men.

If you attended a conference for people who had been struck by lightning and observed that three of ten talks were given by men, is there evidence that men are more likely to talk about their experience?

H₀: The relative frequency of successes (men giving talks who have been struck by lightning) in the population is 81.94%

H_A: The relative frequency of successes in the population is not 81.94%.

n= 10; X =3 ; p=.8194

Lightning Strike Conference: Binom.test()

Starting off with our binom.test()

binom.test(x=3, n=10, p=.8194)


    Exact binomial test

data:  3 and 10
number of successes = 3, number of trials = 10, p-value = 0.0004496
alternative hypothesis: true probability of success is not equal to 0.8194
95 percent confidence interval:
 0.06673951 0.65245285
sample estimates:
probability of success 
                   0.3

Mean and SD of the binomial distribution

\[ \mu = n\cdot p \]

\[ \sigma = \sqrt{n \cdot p \cdot (1-p)} \]

Example: flip a fair coin 20 times, count the number of heads.

hist(rbinom(1000, 20, .5))

plot of chunk unnamed-chunk-10

Lightning Strike

531 of 648 people were men.

We know the proportion is .8194

We observed 10 speakers.

Mean number of successes:

\[ \mu = n\cdot p = (10)(0.8194) = 8.194 \]

SD of the number of successes: \[ \sigma = \sqrt{n \cdot p \cdot (1-p)} = \sqrt{(10)\cdot(0.8194)\cdot(1-0.8194)}=1.2165 \]

Lightning Strike

H₀: The proportion of people who are struck by lightning who are men is 0.5.

H_A: The proportion of people who are struck by lightning who are men is not 0.5.

Lightning Strike

H₀: The proportion of people who are struck by lightning who are men is 0.5.

H_A: The proportion of people who are struck by lightning who are men is not 0.5.

Mean number of successes: \[ \mu = n\cdot p = (648)\cdot(0.5) = 324 \]

Standard deviation of number of successes \[ \sigma = \sqrt{n\cdot p \cdot (1-p)} = \sqrt{(648)\cdot(0.5)\cdot(0.5)} = 12.72792 \]

Lightning Strike

binom.test(x=531,n=648, p=0.5)


    Exact binomial test

data:  531 and 648
number of successes = 531, number of trials = 648, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.7876231 0.8483216
sample estimates:
probability of success 
             0.8194444

2*binom.test(x=531,n=648, p=0.5)$p.value

[1] 1.41102e-63

We multiply it by two since it is a two-sided test.

Normal approximation to the binomial distribution

Why? When calculating exact binomial probabilities with large n.

If n * p and n * (1-p) are both greater than five:

\[ P = 2(Pr[X \ge Observed]) = 2 (Pr [Z>\frac{Observed - \frac{1}{2}-n\cdot p}{\sqrt{n\cdot p \cdot (1-p)}}]) \]

The ½ is what we call the continuity correction.

Questions

Email: kcarriere at washjeff dot edu
Email: jkilgore at washjeff dot edu
Email: lewisme at washjeff dot edu

Book Example: p.187 Part 1

The probability of getting a result as extreme as, or more extreme than, 10 spermatogensis genes on the X chromosome when the null expectation is 1.525.

dbinom(10, 25, .061)

[1] 9.071211e-07

That is 10.

Book Example: p.187 Part 2

dbinom(10, 25, .061)+dbinom(11, 25, .061)+dbinom(12, 25, .061)+
  dbinom(13, 25, .061)+dbinom(14, 25, .061)+dbinom(15, 25, .061)+
  dbinom(16, 25, .061)+dbinom(17, 25, .061)+dbinom(18, 25, .061)+
  dbinom(19, 25, .061)+dbinom(20, 25, .061)+dbinom(22, 25, .061)+
  dbinom(23, 25, .061)+dbinom(24, 25, .061)+dbinom(25, 25, .061)

[1] 9.93988e-07

This is the sum of 10 to the possible maximum of 25.

This is the same thing as running

 binom.test(10, 25, .061)$p.value

[1] 9.93988e-07

Book Example: p.187 Part 3

 2*binom.test(10, 25, .061)$p.value

[1] 1.987976e-06