Drs. Kilgore & Carriere
Categorical Variables - Frequency data
We can estimate the population proportion by: \[ \hat{p}=\frac{X}{n} \]
where X denotes number of “successes”
\[ \hat{p} = \frac{X}{n} = \frac{531}{648} = 0.8194 \]
\[ \hat{p} = \frac{X}{n} = \frac{531}{648} = 0.8194 \]
We can use this to solve for the standard error of our proportion.
\[ SE_\hat{p} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = \sqrt{\frac{(0.8194)(1-0.8194)}{648}} = 0.0151 \]
There are many ways to calculate confidence intervals for proportions. Your book recommends using the Agresti-Coull confidence intervals, which is calculated in two steps.
\[ p= \pm(1.96)\sqrt{\frac{p'(1-p')}{n+4}} \] \[ where p' = \frac{X+2}{n+4}=\frac{533}{652}=0.8174 \]
Thus leading us to…
\[ 0.7932 \pm(1.96)\sqrt{\frac{0.8174(1-0.8174)}{648+4}}= [0.7878, 0.8471] \]
Using R code to plug all of this in, we could get:
x <- 531
n <- 648
phat <- (x/n)
pi <- (x+2)/(n+4)
pi - ( (1.96) * sqrt( (pi*(1-pi) ) / (n+4) ))
[1] 0.7878349
pi + ( (1.96) * sqrt( (pi*(1-pi) ) / (n+4) ))
[1] 0.8471345
We can also calculate this in R. To do this, we load the binom package, run binom.confint(), and set the method to ac, for Agretsi-Coull.
Remember by hand, we got .7878; .8471
library(binom)
binom.confint(x=531, n=648, method="ac")
method x n mean lower upper
1 agresti-coull 531 648 0.8194444 0.7879139 0.8472099
The default setting for binom.confint is to give all of the possible ways to calculate the confidence intervals. This is the same as setting method=“all”.
binom.confint(x=531, n=648)
method x n mean lower upper
1 agresti-coull 531 648 0.8194444 0.7879139 0.8472099
2 asymptotic 531 648 0.8194444 0.7898285 0.8490604
3 bayes 531 648 0.8189522 0.7891231 0.8482295
4 cloglog 531 648 0.8194444 0.7876056 0.8469808
5 exact 531 648 0.8194444 0.7876231 0.8483216
6 logit 531 648 0.8194444 0.7879178 0.8471935
7 probit 531 648 0.8194444 0.7883190 0.8475301
8 profile 531 648 0.8194444 0.7885988 0.8477696
9 lrt 531 648 0.8194444 0.7885989 0.8477662
10 prop.test 531 648 0.8194444 0.7871576 0.8478685
11 wilson 531 648 0.8194444 0.7879734 0.8471504
Remember that equations will generally be set to solve for one thing.
The ability to taste phenylthiocarbamide (PTC) is genetically controlled in humans. In Europe and Asia, about 70% of people are “tasters”. Suppose a study is being planned to estimate the relative frequency of tasters in a certain Asian population and the desired standard error of the estimated relative frequency is 0.01. How many people should be included in the study?
Which one of our equations holds n?
\[ SE_\hat{p} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]
Yes, the standard error. It is also helpful that the question has identified for us a \( \hat{p} \) (.70) already and a desired standard error (.01).
\[ .01 = \sqrt{\frac{.70(1-.70)}{n}} \rightarrow .0001 = \frac{.70(1-.70)}{n} \]
\[ n = \frac{.70*.30}{.0001} \rightarrow n = \frac{.21}{.0001} \rightarrow n= 2100 \]
If we don't know what proportion of our population are a certain category, it is best to assume a \( \hat{p} \) of 0.5 to calculate the maximum number of samples (making this result to 2,500).
A binomial random variable is a variable for which:
B - There are two possible outcomes for each trial (e.g. binary outcomes of success and failure);
I - The outcomes of the trial are independent of each other;
N - The number of trials, n, is fixed in advance; and
S - The probability of a success on a single trial is the same for all trials (i.e. the same p value) DIFFERENT P!
If X is the count of successes in a binomial setting, then X has the binomial distribution with parameters n and p. The possible values of X are integers from 0 to n.
\[ Pr[X] = \binom{n}{X}p^X(1-p)^{(n-X)} \]
where (Binomial Coefficient) \[ \binom{n}{X} = \frac{n!}{k!(n-k)!} \]
and where \[ n! = n \cdot (n-1) \cdot (n-2) \cdot ... 3 \cdot 2 \cdot 1 \]
Example: The probability of rolling “4” exactly three times in five rolls of a die can be written as: \[ \binom{5}{3} = \frac{n!}{k!(n-k)!} \]
The probability of rolling “4” exactly three times in five rolls of a die can be written as:
\[ \binom{5}{3} = \frac{n!}{k!(n-k)!}p^X(1-p)^{(n-X)} \]
\[ \binom{5}{3} = \frac{5!}{3!(5-3)!}\frac{1}{6}^{3}(1-\frac{1}{6})^{(5-3)} \]
\[ \binom{5}{3} = \frac{5!}{3!(5-3)!}\cdot\frac{1}{6}^{3}\cdot(1-\frac{1}{6})^{(5-3)} \rightarrow \]
\[ \binom{5}{3} = \frac{120}{6(2)}\cdot\frac{1}{6}^{3}\cdot(\frac{5}{6})^{2} \rightarrow \]
\[ \binom{5}{3} = 10\cdot.00469\cdot(.69\bar{4}) = .032569 \]
So, the probability of rolling a 4 three times is .0325. Can we use R to make this easier?
First, let's use it as a calculator.
((factorial(5)/ (factorial(3)*factorial(2))) * ((1/6)^3) * ((5/6)^2))
[1] 0.03215021
Or
choose(5,3) * ((1/6^3) * ((5/6)^2))
[1] 0.03215021
All is good there.
Now let's use binom.test
library(binom)
binom.test(x=3, n=5, p=(1/6))
Exact binomial test
data: 3 and 5
number of successes = 3, number of trials = 5, p-value = 0.03549
alternative hypothesis: true probability of success is not equal to 0.1666667
95 percent confidence interval:
0.1466328 0.9472550
sample estimates:
probability of success
0.6
What is going on!? Did we do our math wrong? It says the probability is .03549!
Does anyone remember what the definition of a p value is?
Ah, right. p values are the probability we would find something as extreme or MORE. So, what's more extreme than rolling 3 4's out of 5 dice? How about 4 4's? 5 4's?
((factorial(5)/ (factorial(3)*factorial(2))) * ((1/6)^3) * ((5/6)^2)) +
((factorial(5)/ (factorial(4)*factorial(1))) * ((1/6)^4) * ((5/6)^1)) +
((factorial(5)/ (factorial(5)*factorial(0))) * ((1/6)^5) * ((5/6)^0))
[1] 0.03549383
Indeed. binom.test() will tell us the probability of seeing a value as extreme as a given draw.
Instead, if we wanted to find the probability of seeing a single X, we would use dbinom().
dbinom(x=3, size=5, prob=(1/6))
[1] 0.03215021
Now we have the result we were expecting.
Let's get back to our lightning strikes. As a reminder, 531 out of 648 people struck by lightning were men, and we calculated that means that 81.94% of people who get struck by lightning are men.
If you attended a conference for people who had been struck by lightning and observed that three of ten talks were given by men, is there evidence that men are more likely to talk about their experience?
H0: The relative frequency of successes (men giving talks who have been struck by lightning) in the population is 81.94%
HA: The relative frequency of successes in the population is not 81.94%.
n= 10; X =3 ; p=.8194
Starting off with our binom.test()
binom.test(x=3, n=10, p=.8194)
Exact binomial test
data: 3 and 10
number of successes = 3, number of trials = 10, p-value = 0.0004496
alternative hypothesis: true probability of success is not equal to 0.8194
95 percent confidence interval:
0.06673951 0.65245285
sample estimates:
probability of success
0.3
\[ \mu = n\cdot p \]
\[ \sigma = \sqrt{n \cdot p \cdot (1-p)} \]
Example: flip a fair coin 20 times, count the number of heads.
hist(rbinom(1000, 20, .5))
531 of 648 people were men.
We know the proportion is .8194
We observed 10 speakers.
Mean number of successes:
\[ \mu = n\cdot p = (10)(0.8194) = 8.194 \]
SD of the number of successes: \[ \sigma = \sqrt{n \cdot p \cdot (1-p)} = \sqrt{(10)\cdot(0.8194)\cdot(1-0.8194)}=1.2165 \]
H0: The proportion of people who are struck by lightning who are men is 0.5.
HA: The proportion of people who are struck by lightning who are men is not 0.5.
H0: The proportion of people who are struck by lightning who are men is 0.5.
HA: The proportion of people who are struck by lightning who are men is not 0.5.
Mean number of successes: \[ \mu = n\cdot p = (648)\cdot(0.5) = 324 \]
Standard deviation of number of successes \[ \sigma = \sqrt{n\cdot p \cdot (1-p)} = \sqrt{(648)\cdot(0.5)\cdot(0.5)} = 12.72792 \]
binom.test(x=531,n=648, p=0.5)
Exact binomial test
data: 531 and 648
number of successes = 531, number of trials = 648, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.7876231 0.8483216
sample estimates:
probability of success
0.8194444
2*binom.test(x=531,n=648, p=0.5)$p.value
[1] 1.41102e-63
We multiply it by two since it is a two-sided test.
Why? When calculating exact binomial probabilities with large n.
If n * p and n * (1-p) are both greater than five:
\[ P = 2(Pr[X \ge Observed]) = 2 (Pr [Z>\frac{Observed - \frac{1}{2}-n\cdot p}{\sqrt{n\cdot p \cdot (1-p)}}]) \]
The ½ is what we call the continuity correction.
The probability of getting a result as extreme as, or more extreme than, 10 spermatogensis genes on the X chromosome when the null expectation is 1.525.
dbinom(10, 25, .061)
[1] 9.071211e-07
That is 10.
dbinom(10, 25, .061)+dbinom(11, 25, .061)+dbinom(12, 25, .061)+
dbinom(13, 25, .061)+dbinom(14, 25, .061)+dbinom(15, 25, .061)+
dbinom(16, 25, .061)+dbinom(17, 25, .061)+dbinom(18, 25, .061)+
dbinom(19, 25, .061)+dbinom(20, 25, .061)+dbinom(22, 25, .061)+
dbinom(23, 25, .061)+dbinom(24, 25, .061)+dbinom(25, 25, .061)
[1] 9.93988e-07
This is the sum of 10 to the possible maximum of 25.
This is the same thing as running
binom.test(10, 25, .061)$p.value
[1] 9.93988e-07
2*binom.test(10, 25, .061)$p.value
[1] 1.987976e-06