Lecture 7: Inference for Proportions

Joel Correa da Rosa
February 15th and 22nd 2017

Inference

A proportion \( p \) is a population parameter that describes a binary outcome : \( Y \in \{0,1\} \).

This parameter has the same interpretation as the mean for a quantitative (continuous or discrete) variable.

The estimator for \( p \) will be noted here as \( \hat{p} \), the sample proportion.

Bernoulli Experiment

A Bernoulli random experiment is the one that observes a variable that can take one between two possible values \( 0(failure) \) and \( 1(success) \).

Probability Distribution

\( P(Y=y)=p^y(1-p)^{1-y} \),

\( y \in \{0,1\} \)

The population parameter \( p \) is the only parameter for this probability distribution and it does also represent the probability of a success \( P(Y=1) \).

Examples of Bernoulli Experiments

To flip a coin and observe the upper face (Head or Tail).
To observe if a patient will respond to a treatment.
To observe if a monoclonal antibody will have neutralizing activity.

Interpretation of p

Average number of successes in the population.

The natural estimator for \( p \) is the average number of successes in the sample, usually denoted as \( \hat{p} \).

\( \hat{p}=\frac{\sum_{i=1}^n Y_i}{n}=\frac{X}{n} \)

For computational purposes , a binary variable should be coded as 0 or 1.

Likelihood Estimation of p

When the probability function is written as a function of the unknown parameters given the observed data, it is called likelihood function. For a sample of \( n \) Bernoulli variables, the likelihood of \( p \) is written as:

Likelihood Function

\( L(p|Y_1 = y_1,Y_2 = y_2,...,Y_n = y_n) = \)

\( p^{y_1}(1-p)^{1-y_1}\times p^{y_2}(1-p)^{1-y_2}\times ...\times p^{y_n}(1-p)^{1-y_n} = \)

\( p^{\sum y_i}(1-p)^{n-\sum y_i} \) , \( i=1,2,...,n \)

Maximum Likelihood Estimator

After a sample of size n from the Bernoulli, it can be shown that the maximum likelihood estimator (the value of \( p \) that maximizes the likelihood function based on the observed sample) is computed as:

\( \hat{p}_{ML}=\frac{\sum_{i=1}^n y_i}{n} \)

Example of Maximum Likelihood Estimation

A sample of 3 Bernoulli variables was taken and the following sequence was observed: \( Y_1=1 ,Y_2 =1 \) and \( Y_3=0 \). Based on this (small sample), the value \( p \) that maximizes the likelihood function is \( \hat{p}_{ML}=\frac{\sum_{i=1}^n y_i}{n} = \frac{2}{3} \).

The shape of the likelihood function can be seen in a scatterplot of a range of values for p and the evaluated likelihood: \( L(p|Y_1,Y_2,Y_3) \).

Likelihood function in R

After sampling n=3 Bernoulli random variables and obtaining \( Y_1=1 ,Y_2 =1 \) and \( Y_3=0 \), we can write the likelihood function in R with the following comands.

# writing the likelihood function
Bernoulli.Likelihood<-function(p){ l = p^(2)*(1-p)
return(l)}

Shape of the Likelihood

# plotting the likelihood function for p in the interval 0-1
seq_p<-seq(0,1,0.01)
plot(seq_p,Bernoulli.Likelihood(seq_p),xlab='p',ylab='Bernoulli Likelihood')
abline(v=2/3,col='red')

plot of chunk unnamed-chunk-2

Binomial Experiment

A Binomial random experiment is the one that observes \( n \) independent variables \( Y_1,Y_2,...,Y_n \), each one being a Bernoulli variable.

\( X = \sum_{i=1}^n Y_i \) : number of successes in \( n \) independent trials

Probability Distribution

\( P(X=x)=C_x^n p^x (1-p)^{n-x} \)

\( x \in {0,1,2,...,n} \)

\( C_x^n = \frac{n!}{(n-x)!x!} \)

Confidence Intervals for a Binomial Proportion

Although the Bernoulli random variable is the simplest one, inference for its parameter may be more complicated than the inference for means of the normal distribution. Here we enumerate 3 approaches.

Normal Approximation
Exact Binomial Confidence Interval (Clopper Pearson)
Beta-Binomial Bayesian Approach (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3307549/)

Normal Approximation

The Central Limit Theorem (CLT) states that under certain conditions the distribution of a sum of random variables converges to the normal distribution.

The Binomial random variable is a sum of Bernoulli random variables.

The larger the sample, the better is the approximation.

The approximation works well for \( n>30 \) and \( 0.3 < p < 0.7 \)

Normal Approximation (small samples)

par(mfrow=c(3,1))
hist(rbinom(500,5,0.1))
hist(rbinom(500,5,0.3))
hist(rbinom(500,5,0.5))

plot of chunk unnamed-chunk-3

Normal Approximation (large samples)

par(mfrow=c(3,1))
hist(rbinom(500,50,0.1))
hist(rbinom(500,50,0.3))
hist(rbinom(500,50,0.5))

plot of chunk unnamed-chunk-4

Normal Approximation for the Confidence Interval

\( X\approx Normal \)

\( \hat{p}=\frac{X}{n}\approx Normal \)

\( \hat{p}-z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}< p < \hat{p} +z_{1-\alpha/2} \sqrt{\frac{p(1-p)}{n}} \)

Conservative Approach for the Normal Approximation

x<-seq(0,1,0.01)
varp<-function(x){x*(1-x)}

plot(x,varp(x),xlab= 'p' , ylab ='p(1-p)')
abline(v=0.5,col='red')

plot of chunk unnamed-chunk-6

Conservative Approach for the Normal Approximation

\( p=0.5 \) : maximum variance

\( \hat{p}-z_{1-\alpha/2}\sqrt{\frac{0.25}{n}}< p < \hat{p} +z_{1-\alpha/2} \sqrt{\frac{0.25}{n}} \)

Clopper Pearson Exact Confidence Interval

\( p_{lower} \) = \( \frac{1}{1+\frac{n-x-1}{x}F_{a1,a2}(1-\alpha/2)} \)

\( p_{upper} \) = \( \frac{\frac{x+1}{n-x}F_{b1,b2}(\alpha/2)}{1+\frac{x+1}{n-x}F_{b1,b2}(1-\alpha/2)} \)

\( a1 = 2(n-x+1) \)

\( a2 = 2x \)

\( b1 = 2(x+1) \)

\( b2 = 2(n-x) \)

This computation is possible when \( p=0 \) and \( p=1 \)

Clopper Pearson Exact Confidence Interval

n=5
x=2

alpha=0.05
a1 = 2*(n-x+1)
a2= 2*x
b1 = 2*(x+1)
b2 = 2*(n-x)

plower= 1/(1+((n-x-1)/x)*(qf(1-alpha/2,a1,a2)))

pupper = (((x+1)/(n-x))*(qf(1-alpha/2,b1,b2)))/((1+((x+1)/(n-x))*(qf(1-alpha/2,b1,b2))))


plower

[1] 0.1002046

pupper

[1] 0.8533672

Hypothesis Test for a Single Binomial Proportion

\( H_0:p = p_0 \)

\( H_1:p \neq p_0 \)

\( p_0 \) is a hypothesized value for the population proportion.

Normal Approximation

The test statistic is

\( z = \frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \)

, alternatively:

\( z = \frac{X-np_0}{\sqrt{np_0(1-p_0)}} \)

\( \hat{p}=\frac{1}{n}\sum_{i=1}^n X_i \)

\( Var(\hat{p}) = \frac{p(1-p)}{n} \)

Correction for continuity

Since the binomial variable is discrete, \( P(X \geq x) \neq P(X >x) \).

To avoid this problem, the test statistic is corrected for continuity

\( z = \frac{X-np_0-1/2}{\sqrt{np_0(1-p_0)}} \)

Difference Between Two Proportions

Assuming that the binomial variable is being approximated by the normal distribution, the same statistic for comparison of two independent means should be used:

\( z = \frac{\bar{x}_1-\bar{x}_2-(\mu_1-\mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} \)

Replace \( \bar{x}_1 \) and \( \bar{x}_2 \) by \( \hat{p}_1 \) and \( \hat{p}_2 \) respectively.

Replace \( \sigma_1^2 \) and \( \sigma_2^2 \) by \( p1(1-p1) \) and \( p2(1-p2) \) respectively.

Test Statistic for Difference Between Two Proportions

\( z = \frac{\hat{p}_1-\hat{p}_2-(p_1-p_2)}{\sqrt{\frac{p1(1-p1)}{n_1}+\frac{p2(1-p2)}{n_2}}} \)

Using R for Testing Difference Between Proportions

R implements a non-parametric approach that uses the \( \chi^2 \) distribution.

require(MASS)
data(Pima.tr)
Pima.tr$obese<-factor(Pima.tr$bmi>30)
test.prop<-prop.test(table(Pima.tr$obese,Pima.tr$type))
test.prop


    2-sample test for equality of proportions with continuity
    correction

data:  table(Pima.tr$obese, Pima.tr$type)
X-squared = 15.814, df = 1, p-value = 6.988e-05
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1618019 0.4228683
sample estimates:
   prop 1    prop 2 
0.8529412 0.5606061

Using R for Testing Difference Between Proportions

p1hat<-test.prop$estimate[1]
p2hat<-test.prop$estimate[2]

n1<-length(which(Pima.tr$type=='No'))
n2<-length(which(Pima.tr$type=='Yes'))

# standard approach
z<-p1hat-p2hat/sqrt((p1hat*(1-p1hat)/n1)+(p2hat*(1-p2hat)/n2))
z

   prop 1 
-7.437381

# conservative approach
zcons<-p1hat-p2hat/sqrt((0.5*(1-0.5)/n1)+(0.5*(1-0.5)/n2))
zcons

   prop 1 
-6.658343

P-value

# standard approach
pval1<-2*pnorm(abs(z),lower.tail = F)
pval1

      prop 1 
1.027013e-13

# conservative approach
pval2<-2*pnorm(abs(zcons),lower.tail = F)
pval2

      prop 1 
2.769314e-11

The Fisher's Exact Test (another approach)

fisher.test(table(Pima.tr$obese,Pima.tr$type))


    Fisher's Exact Test for Count Data

data:  table(Pima.tr$obese, Pima.tr$type)
p-value = 3.314e-05
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  2.059697 10.788318
sample estimates:
odds ratio 
  4.513701