Joel Correa da Rosa
February 15th and 22nd 2017
A proportion \( p \) is a population parameter that describes a binary outcome : \( Y \in \{0,1\} \).
This parameter has the same interpretation as the mean for a quantitative (continuous or discrete) variable.
The estimator for \( p \) will be noted here as \( \hat{p} \), the sample proportion.
A Bernoulli random experiment is the one that observes a variable that can take one between two possible values \( 0(failure) \) and \( 1(success) \).
\( P(Y=y)=p^y(1-p)^{1-y} \),
\( y \in \{0,1\} \)
The population parameter \( p \) is the only parameter for this probability distribution and it does also represent the probability of a success \( P(Y=1) \).
To flip a coin and observe the upper face (Head or Tail).
To observe if a patient will respond to a treatment.
To observe if a monoclonal antibody will have neutralizing activity.
The natural estimator for \( p \) is the average number of successes in the sample, usually denoted as \( \hat{p} \).
\( \hat{p}=\frac{\sum_{i=1}^n Y_i}{n}=\frac{X}{n} \)
For computational purposes , a binary variable should be coded as 0 or 1.
When the probability function is written as a function of the unknown parameters given the observed data, it is called likelihood function. For a sample of \( n \) Bernoulli variables, the likelihood of \( p \) is written as:
\( L(p|Y_1 = y_1,Y_2 = y_2,...,Y_n = y_n) = \)
\( p^{y_1}(1-p)^{1-y_1}\times p^{y_2}(1-p)^{1-y_2}\times ...\times p^{y_n}(1-p)^{1-y_n} = \)
\( p^{\sum y_i}(1-p)^{n-\sum y_i} \) , \( i=1,2,...,n \)
After a sample of size n from the Bernoulli, it can be shown that the maximum likelihood estimator (the value of \( p \) that maximizes the likelihood function based on the observed sample) is computed as:
\( \hat{p}_{ML}=\frac{\sum_{i=1}^n y_i}{n} \)
A sample of 3 Bernoulli variables was taken and the following sequence was observed: \( Y_1=1 ,Y_2 =1 \) and \( Y_3=0 \). Based on this (small sample), the value \( p \) that maximizes the likelihood function is \( \hat{p}_{ML}=\frac{\sum_{i=1}^n y_i}{n} = \frac{2}{3} \).
The shape of the likelihood function can be seen in a scatterplot of a range of values for p and the evaluated likelihood: \( L(p|Y_1,Y_2,Y_3) \).
After sampling n=3 Bernoulli random variables and obtaining \( Y_1=1 ,Y_2 =1 \) and \( Y_3=0 \), we can write the likelihood function in R with the following comands.
# writing the likelihood function
Bernoulli.Likelihood<-function(p){ l = p^(2)*(1-p)
return(l)}
# plotting the likelihood function for p in the interval 0-1
seq_p<-seq(0,1,0.01)
plot(seq_p,Bernoulli.Likelihood(seq_p),xlab='p',ylab='Bernoulli Likelihood')
abline(v=2/3,col='red')
A Binomial random experiment is the one that observes \( n \) independent variables \( Y_1,Y_2,...,Y_n \), each one being a Bernoulli variable.
\( X = \sum_{i=1}^n Y_i \) : number of successes in \( n \) independent trials
\( P(X=x)=C_x^n p^x (1-p)^{n-x} \)
\( x \in {0,1,2,...,n} \)
\( C_x^n = \frac{n!}{(n-x)!x!} \)
Although the Bernoulli random variable is the simplest one, inference for its parameter may be more complicated than the inference for means of the normal distribution. Here we enumerate 3 approaches.
Normal Approximation
Exact Binomial Confidence Interval (Clopper Pearson)
Beta-Binomial Bayesian Approach (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3307549/)
The Central Limit Theorem (CLT) states that under certain conditions the distribution of a sum of random variables converges to the normal distribution.
The Binomial random variable is a sum of Bernoulli random variables.
The larger the sample, the better is the approximation.
The approximation works well for \( n>30 \) and \( 0.3 < p < 0.7 \)
par(mfrow=c(3,1))
hist(rbinom(500,5,0.1))
hist(rbinom(500,5,0.3))
hist(rbinom(500,5,0.5))
par(mfrow=c(3,1))
hist(rbinom(500,50,0.1))
hist(rbinom(500,50,0.3))
hist(rbinom(500,50,0.5))
\( X\approx Normal \)
\( \hat{p}=\frac{X}{n}\approx Normal \)
\( \hat{p}-z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}< p < \hat{p} +z_{1-\alpha/2} \sqrt{\frac{p(1-p)}{n}} \)
x<-seq(0,1,0.01)
varp<-function(x){x*(1-x)}
plot(x,varp(x),xlab= 'p' , ylab ='p(1-p)')
abline(v=0.5,col='red')
\( p=0.5 \) : maximum variance
\( \hat{p}-z_{1-\alpha/2}\sqrt{\frac{0.25}{n}}< p < \hat{p} +z_{1-\alpha/2} \sqrt{\frac{0.25}{n}} \)
\( p_{lower} \) = \( \frac{1}{1+\frac{n-x-1}{x}F_{a1,a2}(1-\alpha/2)} \)
\( p_{upper} \) = \( \frac{\frac{x+1}{n-x}F_{b1,b2}(\alpha/2)}{1+\frac{x+1}{n-x}F_{b1,b2}(1-\alpha/2)} \)
\( a1 = 2(n-x+1) \)
\( a2 = 2x \)
\( b1 = 2(x+1) \)
\( b2 = 2(n-x) \)
This computation is possible when \( p=0 \) and \( p=1 \)
n=5
x=2
alpha=0.05
a1 = 2*(n-x+1)
a2= 2*x
b1 = 2*(x+1)
b2 = 2*(n-x)
plower= 1/(1+((n-x-1)/x)*(qf(1-alpha/2,a1,a2)))
pupper = (((x+1)/(n-x))*(qf(1-alpha/2,b1,b2)))/((1+((x+1)/(n-x))*(qf(1-alpha/2,b1,b2))))
plower
[1] 0.1002046
pupper
[1] 0.8533672
\( H_0:p = p_0 \)
\( H_1:p \neq p_0 \)
\( p_0 \) is a hypothesized value for the population proportion.
The test statistic is
\( z = \frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \)
, alternatively:
\( z = \frac{X-np_0}{\sqrt{np_0(1-p_0)}} \)
\( \hat{p}=\frac{1}{n}\sum_{i=1}^n X_i \)
\( Var(\hat{p}) = \frac{p(1-p)}{n} \)
Since the binomial variable is discrete, \( P(X \geq x) \neq P(X >x) \).
To avoid this problem, the test statistic is corrected for continuity
\( z = \frac{X-np_0-1/2}{\sqrt{np_0(1-p_0)}} \)
Assuming that the binomial variable is being approximated by the normal distribution, the same statistic for comparison of two independent means should be used:
\( z = \frac{\bar{x}_1-\bar{x}_2-(\mu_1-\mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} \)
Replace \( \bar{x}_1 \) and \( \bar{x}_2 \) by \( \hat{p}_1 \) and \( \hat{p}_2 \) respectively.
Replace \( \sigma_1^2 \) and \( \sigma_2^2 \) by \( p1(1-p1) \) and \( p2(1-p2) \) respectively.
\( z = \frac{\hat{p}_1-\hat{p}_2-(p_1-p_2)}{\sqrt{\frac{p1(1-p1)}{n_1}+\frac{p2(1-p2)}{n_2}}} \)
R implements a non-parametric approach that uses the \( \chi^2 \) distribution.
require(MASS)
data(Pima.tr)
Pima.tr$obese<-factor(Pima.tr$bmi>30)
test.prop<-prop.test(table(Pima.tr$obese,Pima.tr$type))
test.prop
2-sample test for equality of proportions with continuity
correction
data: table(Pima.tr$obese, Pima.tr$type)
X-squared = 15.814, df = 1, p-value = 6.988e-05
alternative hypothesis: two.sided
95 percent confidence interval:
0.1618019 0.4228683
sample estimates:
prop 1 prop 2
0.8529412 0.5606061
p1hat<-test.prop$estimate[1]
p2hat<-test.prop$estimate[2]
n1<-length(which(Pima.tr$type=='No'))
n2<-length(which(Pima.tr$type=='Yes'))
# standard approach
z<-p1hat-p2hat/sqrt((p1hat*(1-p1hat)/n1)+(p2hat*(1-p2hat)/n2))
z
prop 1
-7.437381
# conservative approach
zcons<-p1hat-p2hat/sqrt((0.5*(1-0.5)/n1)+(0.5*(1-0.5)/n2))
zcons
prop 1
-6.658343
# standard approach
pval1<-2*pnorm(abs(z),lower.tail = F)
pval1
prop 1
1.027013e-13
# conservative approach
pval2<-2*pnorm(abs(zcons),lower.tail = F)
pval2
prop 1
2.769314e-11
fisher.test(table(Pima.tr$obese,Pima.tr$type))
Fisher's Exact Test for Count Data
data: table(Pima.tr$obese, Pima.tr$type)
p-value = 3.314e-05
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
2.059697 10.788318
sample estimates:
odds ratio
4.513701