26-02-2025 >eR-BioStat

Inference for categorical data

Ziv Shkedy and Thi Huyen Nguyen based on Chapter 8 in the book of Julie Vu and Dave Harrington Introductory Statistics for the Life and Biomedical Sciences (https://www.openintro.org/book/biostat/)

## load/install libraries
.libPaths(c("./Rpackages",.libPaths()))
library(knitr)
library(tidyverse)
library(deSolve)
library(minpack.lm)
library(ggpubr)
library(readxl)
library(gamlss)
library(data.table)
library(grid)
library(png)
library(nlme)
library(gridExtra)
library(mvtnorm)
library(e1071)
library(lattice)
library(ggplot2)
library(dslabs)
library(NHANES)
library(plyr)
library(dplyr)
library(nasaweather)
library(ggplot2)
library(gganimate)
library(av)
library(gifski)
library(foreach)
library("DAAG")
library(DT)
library(TeachingDemos)
library(gridExtra)

1. Inference for a single proportion

1.1 The Wolchok’s study

Advanced melanoma is an aggressive form of skin cancer that until recently was almost uniformly fatal. In rare instances, a patient’s melanoma stopped progressing or disappeared altogether when the patient’s immune system successfully mounted a response to the cancer. Those observations led to research into therapies that might trigger an immune response in cancer. Some of the most notable successes have been in melanoma, particularly with two new therapies, nivolumab and ipilimumab. A 2013 report in the New England Journal of Medicine by Wolchok et al. reported the results of a study in which patients were treated with both nivolumab and ipilimumab Fifty-three patients were given the new regimens concurrently, and the response to therapy could be evaluated in 52 of the 53. Of the 52 evaluate patients, 21 $(40.3\%)$ experienced a response according to commonly accepted criteria. Figure 1 shows the distribution of the patients in the sample. In previous studies, the proportion of patients responding to one of these agents was $30\%$ or less. How might one compare the new data to past results?

freq<-c(30,21)
barplot(freq, width = 1, space = NULL,col=c(2,3),names=c("Non responder","Responder"))

Figure 1: Response to therapy in Wolchok’s study.

1.2 Binomial population

The data from this study are binomial data, with success defined as a response to therapy. Suppose the number of patients who respond in a study like this is represented by the random variable X, where X is binomial with parameters $n$ (the number of trials, where each trial is represented by a patient) and $p$ (the unknown population proportion of response). In the population, $X$ is a binary variable that takes two values:

\[ X_{i}=\left \{ \begin{array}{c,l,l} 1 & p & subject \;respond \\ 0 & 1-p & elsewhere. \end{array} \right . \] Let $X$ ve the total number of subjects in the study that responsed to therapy, that is \[ X=\sum_{i=1}^{n}X_{i}. \]

From formulas discussed in Chapter 3, the mean of $X$ is $n \times p$ and the standard deviation of $X$ is $n \times p \times (1-p)$. In Wolchok’s study, $X=21$ and the proportion of patients who experianced response to the theraphy is \[ \hat{p}=\frac{X}{n}=\frac{\sum_{i=1}^{n}X_{i}}{n}=\frac{21}{52}=0.404. \]

1.3 Sampling distribution of $\hat{p}$

The sampling distribution for $\hat{p}$, calculated from a sample of size $n$ from a population with a success proportion $p$, is nearly normal when

the sample observations are independent.
at least 10 successes and 10 failures are expected in the sample, i.e. $n p> 10 $ 10 and $n \times (1-p)>10$. This is called the success-failure condition. If these conditions are met, then the sampling distribution of $\hat{p}$ is approximately normal with mean $p$ and standard error \[ SE_{\hat{p}}=\sqrt{\frac{(p \times (1-p))}{n}}, \]

\[ \hat{p} \sim N \left ( p, \sqrt{\frac{(p \times (1-p))}{n}} \right ) \]

When conducting inference, the population proportion $p$ is unknown. Thus, to construct a confidence interval, the sample proportion $\hat{p}$ can be substituted for $p$ to check the success-failure condition and compute the standard error. In a hypothesis test, $p_{0}$ is substituted for $p$.

1.4 Confidence intervals for a proportion

When using the normal approximation to the sampling distribution of $\hat{p}$, a confidence interval for a proportion has the same structure as a confidence interval for a mean; it is centered at the point estimate, with a margin of error calculated from the standard error and appropriate critical value from $N(0,1). The formula for a $95\%$ confidence interval is

\[ \hat{p} \pm z_{\frac{\alpha}{2}} \times SE_{\hat{p}}=\hat{p} \pm 1.96 \times \sqrt{\frac{(p \times (1-p))}{n}}. \]

Example: Confifence interval for $p$ in Wolchok’s study

The independence and success-failure assumptions should be checked first. Since the outcome of one patient is unlikely to influence that of other patients, the observations are independent. The success-failure condition is satisfied since $n \hat{p} = 52 \times 0.404 = 21 > 10$ and $n \times \hat{p} \times (1-\hat{p}) = 52 \times 0.596 = 31 > 10$. The standard error for $\hat{p}$ is \[ \sqrt{\frac{(0.404 \times 0.596))}{52}}=0.068. \]

For a $95\%$ confidence interval $\frac{\alpha}{2}=0.025$ and the ctritical value from $N(0,1)$ is equal to 1.96. Therefore, a $95\%$ confidence intervals is

\[ 0.404 \pm 1.96 \times 0.068 \rightarrow (0.27;0.54) \]

Example: Confifence interval for $p$ in Wolchok’s study using R

In R, a confidence interval for the proportion can be constructed using the R function prop.test(). The function choses the critical values from a standard normal distrubtion.

library(TeachingDemos)
X<-21
n<-52
prop.test(X,n)

## 
##  1-sample proportions test with continuity correction
## 
## data:  X out of n, null probability 0.5
## X-squared = 1.5577, df = 1, p-value = 0.212
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2731269 0.5487141
## sample estimates:
##         p 
## 0.4038462

As mentioned in the output above, the analysis was done with continuity correction (the default). To conduct the same analysis without continuity correction we use the argument correct=F.

library(TeachingDemos)
prop.test(X,n, correct=F)

## 
##  1-sample proportions test without continuity correction
## 
## data:  X out of n, null probability 0.5
## X-squared = 1.9231, df = 1, p-value = 0.1655
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2815973 0.5393242
## sample estimates:
##         p 
## 0.4038462

1.5 Hypothesis testing for a proportion in one population

Just as with inference for population means, confidence intervals for population proportions can be used when deciding whether to reject a null hypothesis. It is useful in most settings, however, to calculate the p-value for a test as a measure of the strength of the evidence contradicting the null hypothesis. When using the normal approximation for the distribution of pˆ to conduct a hypothesis test, one should always verify that $\hat{p}$ is nearly normal under $H_{0}$ by checking the independence and success-failure conditions. Since a hypothesis test is based on the distribution of the test statistic under the null hypothesis, the success-failure condition is checked using the null proportion $p_{0}$, not the estimate $\hat{p}$. According to the normal approximation to the binomial distribution, the number of successes in $n$ trials is normally distributed with mean $n \times p_{0}$ and standard deviation $n \times p_{0} (1-p{0})$. This approximation is valid when $n \times p_{0}$ and $n \times (1-p_{0})$ are both at least 10.

Under the null hypothesis $H_{0}:p=p_{0}$, the sample proportion $\hat{p}$ follows a normal distribution \[ \hat{p} \sim N \left ( p_{0}, \sqrt{\frac{(p_{0} \times (1-p_{0}))}{n}} \right ). \]

As before, the test statistics is defined as

\[ z=\frac{\mbox{point estimate - parameter under the null}}{SE}. \]

For a sample proportion the test statistics given by

\[ z=\frac{\hat{p}-p_{0}}{\sqrt{\frac{(p_{0} \times (1-p_{0}))}{n}}}. \]

Example: Hypothesis testing for $p$ in Wolchok’s study

Suppose that we wish to test the null hypothesis $H_{0}:p=0.3$ aginast a one sided alternative, $H_{1}:p > 0.3$. In the sample, $\hat{p}=0.404$. The test statistics in this case is equal to

X<-21
n<-52
pbar = 21/52
prop = 0.3
n = 52
z=(pbar-prop)/sqrt((prop*(1-prop))/n)
z

## [1] 1.634114

For significant level of $5\%$ ($\alpha=0.05$), the critical value is

alpha=0.05
crit.point=qnorm(1-alpha)#p=0.05 one tailed (upper)
crit.point

## [1] 1.644854

Since $1.6365<1.6448$ we do not reject the null hypothesis. Alternatively, we reach the same decision if we calculate the p-value

pval=1- pnorm(z,0,1)#p=0.05 one tailed (upper)
pval

## [1] 0.05111742

1.6 Hypothesis testing for $p$ in Wolchok’s study using R

The analysis in the previous sestion can be conducted using the R function prop.test(). We need to specify the value of the parameter under the null hypothesis p=0.3 and for a one sided alternative we we alternative = c(“greater”). Note that we conduct the inference without a continuity correction using the argument correct=F.

X<-21
n<-52
prop.test(X, n, p=0.3,alternative = c("greater"),correct=F)

## 
##  1-sample proportions test without continuity correction
## 
## data:  X out of n, null probability 0.3
## X-squared = 2.6703, df = 1, p-value = 0.05112
## alternative hypothesis: true p is greater than 0.3
## 95 percent confidence interval:
##  0.2993794 1.0000000
## sample estimates:
##         p 
## 0.4038462

Note that the prop.test function conduct a $\chi^{2}$ test, the value of the test statistics is equal to $z{2}$.

z^2

## [1] 2.67033

2 Case Study: the NHANES dataset

2.1 The data

The NHANES dataset consists of data from the US National Health and Nutrition Examination Study. Information about 76 variables is available for 10000 individuals included in the study. The 10000 indivuduals are considered as the population. In this section we focus on the variable Depressed that consists of Self-reported number of days where participant felt down, depressed or hopeless. (the R object Depressed).

library(NHANES)
data(NHANES)
dim(NHANES)

## [1] 10000    76

#names(NHANES)

Depression in the population

After removing the missing values, the population size is $N=6673$ from which $1009+418=1427$ individuals, $21.4\%$, reported depression in several or most of the day of the study. The true proprtion of depression in the population (the parameter in the population) \[ p=\frac{1427}{6673}=0.2138468. \]

data(NHANES)
#Diabetes1<-na.omit(NHANES$Diabetes)
Depressed1<-na.omit(NHANES$Depressed)
#table(Diabetes1)
table(Depressed1)

## Depressed1
##    None Several    Most 
##    5246    1009     418

length(Depressed1)

## [1] 6673

#sum(Diabetes1)/length(Diabetes1)

We define a new indicator variable for depression which take the value of No is the participant did not report depression and Yes if the participant reported that he/she was depressed several or most of the days,

\[ X_{i}= \left \{ \begin{array}[ll] YYes & \mbox{participant reported depression (several/most)}, \\ No & \mbox{participant did not report depression}. \\ \end{array} \right. \]

The parameter $p=P(X_{i}=Yes)$ be the probability that a participant reported a depression in several or most of the days of the study. The distribution of $X_{i}$ in the population is shown below.

Freq<-c(5246,1427)
Depressed<-c("No","Yes")
NHANES.p<-data.frame(Depressed,Freq)
NHANES.p

##   Depressed Freq
## 1        No 5246
## 2       Yes 1427

Freq.1<-c(5246,1009,418)
Depressed.1<-c("None","Several","Most")
NHANES.p1<-data.frame(Depressed.1,Freq.1)
names(NHANES.p1)<-c("Depressed","Freq")

Visualization

Figure 2 shows the original distribution of depression in the population with three categories. For the analysis presented in this section, the variable was dichotomized, the categories “Several” and “Most” are combined together to the category “Yes” as shown in Figure 3.

plot2a=ggplot(NHANES.p1, aes(x="", y=Freq.1, fill=Depressed.1)) +
  geom_bar(stat="identity", width=1, color="black") +
  coord_polar("y", start=0)+theme_void()+
    geom_text(aes(label = Freq.1),
            position = position_stack(vjust = 0.35))
plot2a

Figure 2: Depression in the NHANES dataset.

plot2=ggplot(NHANES.p, aes(x="", y=Freq, fill=Depressed)) +
  geom_bar(stat="identity", width=1, color="black") +
  coord_polar("y", start=0)+theme_void()+
    geom_text(aes(label = Freq),
            position = position_stack(vjust = 0.25))
plot2

Figure 3: Depression (Yes/No) in the NHANES dataset.

2.2 A random sample of size 100 from the data and parameter estimate for $p$

The sample

Out of the population (of 6673 individuals) we draw a random sample of 100 individuals. In the sample, 19 individuals reported depression and 81 not.

set.seed(1234)
Depressed2<-sample(Depressed1,size=100,replace=FALSE)
length(Depressed2)

## [1] 100

table(Depressed2)

## Depressed2
##    None Several    Most 
##      81      13       6

Parameter estimate and visualization

The sample proportion, the parameter estimate for $p$, is equal to \[ p=\frac{19}{100}=0.19. \]

Figure 4 shows the distribution of depression in the sample.

Freq<-c(81,19)
Depressed<-c("No","Yes")
NHANES.s<-data.frame(Depressed,Freq)

plot2=ggplot(NHANES.s, aes(x="", y=Freq, fill=Depressed)) +
  geom_bar(stat="identity", width=1, color="black") +
  #coord_polar("y", start=0)+theme_void()+
    geom_text(aes(label = Freq),
            position = position_stack(vjust = 0.5))
plot2

Figure 4: Depression in a random sample of 100 individuals.

Hypothesis testing and confidence interval for $p$ in the NHANES dataset

A $95\%$ confidence interval for the propotion of depression in the population is (0.113109,0.266891). Note that this interval is symtric around $\hat{p}$ and it contains the true value of $p=0.2138$.

hat.p<-0.19
SE.p<-sqrt((hat.p*(1-hat.p))/100)
LL<-hat.p-SE.p*1.96
UL<-hat.p+SE.p*1.96
c(LL,UL)

## [1] 0.113109 0.266891

Under the null hypothesis, $p_{0}=0.124. The test statistic can be calculated by

\[ z=\frac{0.19-0.214}{\sqrt{0.124 \times (1-0.214)}{100}}=-0.5851849. \]

Note that $z^{2}=0.3424413$. We will elaborate on this point in the next section.

hat.p<-0.19
null.p<-0.214
SE0<-sqrt(null.p*(1-null.p)/100)
z<-(hat.p-null.p)/SE0
z

## [1] -0.5851849

z^2

## [1] 0.3424413

Since the alternative hypothesis is a two sided hypothesis, $H_{A}:p \ne 0.124$, the p-value is equal to $P(Z \le -0.5851) \times 2 = 0.5584$. For $\alpha=0.05$ , we do not reject the null hypothesis.

(pnorm(z,0,1))*2

## [1] 0.5584234

Hypothesis testing and confidence interval for $p$ in the NHANES dataset using R

As before, for the analysis we use the R function prop.test(). We need to specify the value of the parameter under the null hypothesis. Suppose that we wish to test the null hypothesis $H_{0}:p=0.214$ against a two sided alternative $H_{A}:p \ne 0.214$. In R, we specify the null hypothesis, p=0.214, and for a two sided alternative we use alternative = c(“two sided”). Note that we conduct the inference without a continuity correction using the argument correct=F.

X<-19
n<-100
prop.test(X,n, p=0.214,alternative = c("two.sided"),correct=F,conf.level=0.95)

## 
##  1-sample proportions test without continuity correction
## 
## data:  X out of n, null probability 0.214
## X-squared = 0.34244, df = 1, p-value = 0.5584
## alternative hypothesis: true p is not equal to 0.214
## 95 percent confidence interval:
##  0.1251475 0.2777885
## sample estimates:
##    p 
## 0.19

The $95\%$ confidence interval for the population proportion $p$, (0.1251475 0.2777885), covers the parameter value in the population $p$ and it symmetric around the value of $\hat{p}=0.19$. The two sided p value is equal to 0.5584. Hence, we conclude that there no evidence in the data to reject the null hypothesis that $p=0.124$.

x1<-seq(from=-3,to=3,length=10000)
x2<-seq(from=0,to=2,length=10000)
dx1<-dnorm(x1,0,1)
dx2<-dchisq(x2,1)
par(mfrow=c(1,2))
plot(x1,dx1,type="l",xlab="x",ylab="f(x)")
lines(c(-0.5851,-0.5851),c(0,0.4),lwd=2,col=2)
title("N(0,1)")
plot(x2,dx2,type="l",xlab="x",ylab="f(x)")
lines(c(-0.5851,-0.5851)^2,c(0,28),lwd=2,col=2)
title("chi-square_(1)")

Figure 5: Chi-square_(1) and N(0,1) distributions.

3 Inference for proportion in two populations

3.1 Sampling distribution of the difference of two proportions

Just as inference can be done for the difference of two population means, conclusions can also be drawn about the difference of two population proportions: $p_{1} - p_{2}$. The normal model can be applied to $\hat{p}_{1}-\hat{p}_{2}$ if the sampling distribution for each sample proportion is nearly normal and if the samples are independent random samples from the relevant populations. The difference $\hat{p}_{1}-\hat{p}_{2}$ tends to follow a normal model when:

Each of the two samples are random samples from a population.
The two samples are independent of each other.
Each sample proportion follows (approximately) a normal model.

This condition is satisfied when

$n_{1} \times \hat{p}_{1}$, $n{1} \times (1-\hat{p}_{1})$, $n_{2} \times \hat{p}_{2}$ and $n_{2} \times (1-\hat{p}_{2})$ are all > 10.

The standard error of the difference in sample proportions is

\[ SE_{p_{1}-p_{2}}=\sqrt{SE_{p_{1}}+SE_{p_{1}}} \] using the results from the previous section we have \[ SE_{p_{1}-p_{2}}=\sqrt{\frac{(p_{1} \times (1-p_{1}))}{n_{1}}+\frac{(p_{2} \times (1-p_{2}))}{n_{2}}}. \] Here, $p_{1}$ and $p_{2}$ are the population proportions, and $n_{1}$ and $n_{2}$ are the two sample sizes. The sampling distribution of the difference $\hat{p}_{1}-\hat{p}_{2}$ is given by \[ \hat{p}_{1}-\hat{p}_{2} \sim N \left ( {p}_{1}-{p}_{2} , SE_{p_{1}-p_{2}} \right ). \]

3.2 Confidence intervals for $p_{1}-p_{2}$

A $(1-\alpha) \times 100 \%$ confidence interval for $p_{1}-p_{2}$ is

\[ \left ( \hat{p}_{1}-\hat{p}_{2} \right ) \pm SE_{p_{1}-p_{2}} \times Z_{\frac{\alpha}{2}}. \]

Example

3.3 Hypothesis testing for $p_{1}-p_{2}$

Hypothesis tests for $p_{1}$ and $p_{2}$ are usually testing the null hypothesis of no difference between $p_{1}$ and $p_{2}$; i.e. \[ \begin{array}{l} H_{0}:p_{1}=p_{2}, \\ H_{0}:p_{1} \ne p_{2}. \end{array} \]

Note that under the null hypothesis, $p_{1}-p_{2}=0$ and the sampling distribution of $\hat{p}_{1}-\hat{p}_{2}$ is a normal distribution

\[ \hat{p}_{1}-\hat{p}_{2} \sim N \left (0 , SE_{p_{1}-p_{2}} \right ). \] Under the null hypothesis $p = p_{1} = p_{2}$ and therefore the $SE_{p_{1}-p_{2}}$ is \[ SE_{p_{1}-p_{2}}=\sqrt{\frac{(p_{1} \times (1-p_{1}))}{n_{1}}+\frac{(p_{2} \times (1-p_{2}))}{n_{2}}}=\sqrt{ (p \times (1-p)) \left (\frac{1}{n_{1}}+\frac{1}{n_{2}} \right )}. \]

Since $p$ is unknown, an estimate is used to compute the standard error of $\hat{p}_{1}-\hat{p}_{2}$. $p$ can be estimated by $\hat{p}$ the weighted average of the sample proportions $\hat{p}_{1}$ and $\hat{p}_{2}$, that is

\[ \hat{p}=\frac{n_{1}\hat{p_{1}}+n_{2}\hat{p_{2}}}{n_{1}+n_{2}}. \]

This pooled proportion $\hat{p}$ is also used to check the success-failure condition. The test statistic $z$ for testing teh above hypotheses equals to

\[ z=\frac{\hat{p}_{1}-\hat{p}_{2}}{\sqrt{ (\hat{p} \times (1-\hat{p})) \left (\frac{1}{n_{1}}+\frac{1}{n_{2}} \right )}}. \]

Under $H_{0}$, $z \sim N(0,1)$.

3.4 Example: the Breast cancer study

The use of screening mammograms for breast cancer has been controversial for decades because the overall benefit on breast cancer mortality is uncertain. Several large randomized studies have been conducted in an attempt to estimate the effect of mammogram screening. A 30-year study to investigate the effectiveness of mammograms versus a standard non-mammogram breast cancer exam was conducted in Canada with 89,835 female participants.12 During a 5-year screening period, each woman was randomized to either receive annual mammograms or standard physical exams for breast cancer. During the 25 years following the screening period, each woman was screened for breast cancer according to the standard of care at her health care center. At the end of the 25 year follow-up period, 1,005 women died from breast cancer. The results are shown in the table below.

resp<-as.factor(c(rep(0,500),rep(1,44425),rep(0,505),rep(1 ,44405)))
trt<-as.factor(c(rep(1,500),rep(1,44425),rep(2,505),rep(2,44405)))
BC1<-table(trt,resp)
rownames(BC1)=c("Mammogram","Control")
colnames(BC1)=c("Yes","No")
BC1

##            resp
## trt           Yes    No
##   Mammogram   500 44425
##   Control     505 44405

Figure 6 shows the percentage of death from breast cancer (Yes/No) by group and for the pooled sample.

x<-seq(from=-3,to=3,length=1000)

Freq<-c(500,44425,505,44405,1005,88830)
tot<-500+44425+505+44405
prop<-(Freq/c(44925,44925,44910,44910,tot,tot))*100
barplot(prop, width = 1, space = NULL,col=c(2,3,2,3,2,3),names=c("yes","No","yes","No","yes","No"),
        ylab="% death from breast cancer",ylim=c(0,105))
text(1,102,"Mammogram",cex=1)
text(3.5,102,"Control",cex=1)
text(6,102,"Pooled sample",cex=1)
text(0.75,10,"11.12%")
text(3.25,10,"11.24%")
text(5.5,10,"11.18%")

Figure 6: Percentage of death from breast cancer (Yes/No).

Estimation of the pool proportion

In the first step, we estimate the sample proportions, $\hat{p_{1}}$ and $\hat{p_{2}}$:

p1<-(BC1[1,1])/(BC1[1,1]+BC1[1,2])
p2<-(BC1[2,1])/(BC1[2,1]+BC1[2,2])
p1

## [1] 0.01112966

p2

## [1] 0.01124471

and the pool proportion $\hat{p}$:

p.pool<-(BC1[1,1]+BC1[2,1])/(BC1[1,1]+BC1[1,2]+BC1[2,1]+BC1[2,2])
p.pool

## [1] 0.01118718

The standard error of the proportion difference (Under $H_{0}$) is estimated by

\[ \sqrt{ (\hat{p} \times (1-\hat{p})) \left (\frac{1}{n_{1}}+\frac{1}{n_{2}} \right )}. \]

n1<-BC1[1,1]+BC1[1,2]
#n1
n2<-BC1[2,1]+BC1[2,2]
#n2
SE.pool<-sqrt((p.pool*(1-p.pool))*(1/n1+1/n2))
SE.pool

## [1] 0.000701818

Inference

The null hypothesis that we wish to test states that the probability of a breast cancer death is the same for the women in the two groups. If group 1 represents the mammogram group and group 2 the control group,we formulate the hypotheses as

\[ \begin{array}[l] HH_{0} : p_{1} = p_{2} , \\ H_{A} : p_{1} \ne p_{2}. \\ \end{array} \]

Let $\alpha= 0.05$. We calculate the test statistic $z$:

\[ z=\frac{0.01112966-0.01124471}{0.000701818}=-0.163933. \]

#p1
#p2
#p1-p2
z<-(p1-p2)/SE.pool
z

## [1] -0.163933

The two-sided p-value is $P|Z| \ge -0.1639 = 0.8697$, which is greater than 0.05. In Figure 7, it is equal to the areas to the left and the right of the values $z=-0.163933$ and $-z=0.163933$. Thus, we conclude that there is insufficient evidence to reject the null hypothesis; the observed difference in breast cancer death rates is reasonably explained by chance.

x<-seq(from=-3,to=3,length=1000)
dx<-dnorm(x,0,1)
plot(x,dx,type="l",ylab="f(x)=N(0,1)")
lines(c(z,z),c(0,0.4),lwd=3,col=2)
lines(abs(c(z,z)),c(0,0.4),lwd=3,col=2)
text(-1,0.05,"0.4348919")
text(1,0.05,"0.4348919")

Figure 7: Density of N(0,1) and the test statistic.

(1-pnorm(abs(z),0,1))*2

## [1] 0.8697839

Confidence interval for $p_{1}-p_{2}$

For the mammogram group, $\hat{p}_{1} = 0:01113$, $n_{1}\hat{p}_{1} = 0.1113 \times 44925=500$ and $n_{1}(1 - \hat{p}_{1}) = 39925$. It is easy to show that the success failure condition is holds for the control group as well. The point estimate for the difference in the probability of death, $\hat{p}_{1}-\hat{p}_{2}$ is

#p1
#p2
p1-p2

## [1] -0.0001150511

The standard error for the estimated difference uses the individual estimates of the probability of a death:

\[ SE_{p_{1}-p_{2}}=\sqrt{\frac{(p_{1} \times (1-p_{1}))}{n_{1}}+\frac{(p_{2} \times (1-p_{2}))}{n_{2}}}. \]

In our example, $SE_{p_{1}-p_{2}}=0.0007$.

SE.1<-(p1*(1-p1))/n1
SE.2<-(p2*(1-p2))/n2
SE.12<-sqrt(SE.1+SE.2)
SE.12

## [1] 0.0007018185

For confidence level of $95\%$, $Z_{\frac{\alpha}{2}}=1.96$,

z.val<-qnorm(0.975,0,1)
z.val

## [1] 1.959964

and the $95\%$ confidence interval is given by

LL<-(p1-p2)-SE.12*z.val
UL<-(p1-p2)+SE.12*z.val
c(LL,UL)

## [1] -0.001490590  0.001260488

With $95\%$ confidence, the difference in the probability of death is between $-0.15\%$ and $0.13\%$. As expected from the large p-value, the confidence interval contains the null value 0.

3.4 Analysis of the breast cancer data using R

The analysis discussed above can be conducted using the R function prop.test(). As shown below, the function uses a $\chi^{2}$ test with one degree of freedom and therefore the test statistic is equal to $z^{2}=-0.163933^{2}=0.026874$, Note that the p-value is the same as before.

prop.test(BC1,correct=F)

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  BC1
## X-squared = 0.026874, df = 1, p-value = 0.8698
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.001490590  0.001260488
## sample estimates:
##     prop 1     prop 2 
## 0.01112966 0.01124471

3. Analysis of $2 \times 2$ tables

3.1 Example: The breast Cancer study

A $2 \times 2$ contingency table

The comparison of the proportion of breast cancer deaths between the two groups can also be approached using a two-way contingency table, which contains counts for combinations of outcomes for two variables. The results for the mammogram study in this format are shown in The table below.

resp<-as.factor(c(rep(0,500),rep(1,44425),rep(0,505),rep(1 ,44405)))
trt<-as.factor(c(rep(1,500),rep(1,44425),rep(2,505),rep(2,44405)))
#cbind(resp,trt)
BC1<-table(trt,resp)
#BC1
rownames(BC1)=c("Mammogram","Control")
colnames(BC1)=c("Yes","No")
BC1

##            resp
## trt           Yes    No
##   Mammogram   500 44425
##   Control     505 44405

Previously, the main question of interest was stated as, “Is there evidence of a difference in the proportion of breast cancer deaths between the two screening groups?” If the probability of a death from breast cancer does not depend the method of screening, then screening method and outcome are independent. Thus, the question can be re-phrased: “Is there evidence that screening method is associated with outcome?” Hypothesis testing in a two-way table assesses whether the two variables of interest are associated (i.e., not independent). The approach can be applied to settings with two or more groups and for responses that have two or more categories. The observed number of counts in each table cell are compared to the number of expected counts, where the expected counts are calculated under the assumption that the null hypothesis of no association is true. A $\chi^{2}$ test of significance is based on the differences between observed and expected values in the cells.

Vizualistion

Figure 8 shows the percentage of death in each group. In the sample, $1.1130\%$ and $1.1245\%$ died in the Mammogram,Control groups, respectively.

resp<-c(500,44425,505,44405)
Group<-c("Mammogram","Mammogram","Control","Control")
Death<-c("Yes","No","Yes","No")
tot<-c(44925,44925,44910,44910)
prop<-round(resp/tot,6)*100
BC2<-data.frame(Group,Death,resp,tot,prop)
BC2

##       Group Death  resp   tot    prop
## 1 Mammogram   Yes   500 44925  1.1130
## 2 Mammogram    No 44425 44925 98.8870
## 3   Control   Yes   505 44910  1.1245
## 4   Control    No 44405 44910 98.8755

plot2=ggplot(BC2, aes(x="", y=prop, fill=Death)) +
  geom_bar(stat="identity", width=1, color="black") +
  #coord_polar("y", start=0)+theme_void()+
  facet_wrap(~as.factor(Group))+
  geom_text(aes(label = prop),
            position = position_stack(vjust = 0.5))
plot2

Figure 8: Barplot for the precentage of death from breast cancer by group.

Expected counts

If type of breast cancer screening had no effect on outcome in the mammogram data, what would the expected results be?

Recall that if two events $A$ and $B$ are independent, then

\[ P (A | B) = P(A) \times P (B) \] Let $A$ represents assignment to the mammogram group and $B$ the event of death from breast cancer. Let $O_{ij}$ be the observed count in the $ij$th cell. For the breat cancer example, let us focus on the combination Mammogram/Yes. The observed count $O_{11}$ is equal to 500.

xx<-chisq.test(BC1,correct=F)
#xx
xx$observed

##            resp
## trt           Yes    No
##   Mammogram   500 44425
##   Control     505 44405

Under independence, the number of individuals out of the total counts that are expected to be in the $ij$th cell is equal to

\[ \mbox{Expected}_{ij}=\frac{\mbox{row}\;i \;\mbox{total} \times \mbox{column}\;j \;\mbox{total}}{\mbox{total count in the table}} \]

For the cell Mammogram/Yes, the (1,1) cell, we have

\[ \frac{1005 \times 44,925}{89835}=502.5839=E_{11}. \]

Here, $E_{11}$ is the expected count, under the independence hypothesis, in the $(1,1)$th cell. For all the cells in the table we have

xx$expected

##            resp
## trt              Yes       No
##   Mammogram 502.5839 44422.42
##   Control   502.4161 44407.58

The residuals and standaried residuals

WE difense the cell’s residual $e_{ij}$ by

\[ e_{ij}=O_{ij}-{E_{ij}}. \]

For the cell Mammogram/Yes, $e_{11}=O_{11}-E_{11}=500-502.6=-2.6$.

e11<-(xx$observed-xx$expected)[1,1]
e11

## [1] -2.583904

The standardized residuals for the $ij$th cell, $r_{ij}$, is given by

\[ r_{ij}=\frac{O_{ij}-{E_{ij}}}{\sqrt{E_{ij}}}=\frac{e_{ij}}{\sqrt{E_{ij}}}. \]

For the cell Mammogram/Yes

\[ r_{11}=\frac{e_{11}}{\sqrt{E_{11}}}=\frac{-2.6}{\sqrt{502.6}}=-0.1152583. \]

e11/sqrt(xx$expected[1,1])

## [1] -0.1152583

For all cells in the table we have

xx$residuals

##            resp
## trt                 Yes          No
##   Mammogram -0.11525826  0.01225957
##   Control    0.11527751 -0.01226162

a<-as.vector(xx$residuals)
a

## [1] -0.11525826  0.11527751  0.01225957 -0.01226162

The $\chi^{2}$ statistic

Previously, test statistics have been constructed by calculating the difference between a point estimate and a null value, then dividing by the standard error of the point estimate to standardize the difference. The $chi^{2}$ statistic is based on a different idea. In each cell of a table, the difference observed - expected ($O_{ij}-E_{ij}$) is a measure of the discrepancy between what was observed in the data and what should have been observed under the null hypothesis of no association. If the row and colum variables are highly associated, that difference will be large. Two adjustments are made to the differences before the final statistic is calculated. First, since both positive and negative differences suggest a lack of independence, the differences are squared to remove the effect of the sign. Second, cells with larger counts may have larger discrepancies by chance alone, so the squared differences in each cell are scaled by the number expected in the cell under the hypothesis of independence. The final $\chi^{2}$ statistic is the sum of these standardized squared differences, where the sum has one term for each cell in the table. The $\chi^{2}$ test statistic is calculated as

\[ \chi^{2}=\sum_{ij} \left ( \frac{\mbox{Observed count}_{ij}-{\mbox{Expected count}_{ij}}}{\sqrt{\mbox{Expected count}_{ij}}} \right )^{2}, \]

\[ \chi^{2}=\sum_{ij} \left ( \frac{O_{ij}-{E_{ij}}}{\sqrt{E_{ij}}} \right )^{2}=\sum_{ij}r^{2}_{ij}. \]

For the Breast cancer example the standardized residuals are

## [1] -0.11525826  0.11527751  0.01225957 -0.01226162

which leads to the $\chi^{2}$ statistic

\[ \chi^{2}=(-0.11525826)^{2}+(0.11527751)^2+(0.01225957)^2+(-0.01226162)^2=0.026874. \]

z<-sum(a^2)
z

## [1] 0.02687401

Inference

The chi-square distribution is often used with data and statistics that are positive and right skewed. The distribution is characterized by a single parameter, the degrees of freedom. For a $I \times J$ table, the distribution of the test statistic is given by \[ \chi^{2} \sim \chi^{2}_{(I-1) \times (J-1)}. \] For a $2 \times 2$ table, $(I-1) \times (J-1) = (2-1) \times (2-1)=1$ and therefore $\chi^{2} \sim \chi^{2}_{1}$.

Figure 9 demonstrates three general properties of chi-square distributions as the degrees of freedom increases: the distribution becomes more symmetric, the center moves to the right, and the variability increases.

x1<-seq(from=0,to=25,length=1000)
dx1<-dchisq(x1,1)
dx2<-dchisq(x1,2)
dx3<-dchisq(x1,8)
plot(x1,dx1,type="l",xlab="x",ylab="f(x)",ylim=c(0,1))
lines(x1,dx2,col=2)
lines(x1,dx3,col=3)
legend(20,0.8,c("df=1","df=2","df=8"),lty=c(1,1,1),col=c(1,2,3))

Figure 9: Densities for chi-square distributions.

For the breast cancer example, the test statistic is equal to 0.02. The p value is teh area to the right of the test statistic in Figure 10 and it is equal to 0.8650. Hence, we cannot reject the null hypotehsis and conclude that the data do not provide convincing evidence of an association between screening group and breast cancer death.

x<-seq(from=0,to=0.5,length=1000)
dx<-dchisq(x,1)
plot(x,dx,type="l",xlab="x",ylab="f(x)")
lines(c(z,z),c(0,18),col=2,lwd=3)

Figure 10: chi-square distribution with df=1 and the test statistic.

p.val<-1-pchisq(z,1)
p.val

## [1] 0.8697839

Implemention in R

The analysis described above can be conducted in R using the function chisq.test(). The output that was presented in previous sections (expected counts, residuals etc.) are a part of R object that contain the results obtained using chisq.test().

xx<-chisq.test(BC1,correct=F)
xx

## 
##  Pearson's Chi-squared test
## 
## data:  BC1
## X-squared = 0.026874, df = 1, p-value = 0.8698

3.2 Example: the HIV study

The data

If a newborn is HIV+, should he or she be treated with nevirapine (NVP) or a more expensive drug, lopinarvir (LPV)? In this setting, success means preventing virologic failure; i.e., growth of the virus. A randomized study was conducted to assess whether there is an association between treatment and outcome.In total 14 Of the 147 children administered NVP, about 41% experienced virologic failure; of the 140 children administered LPV, about $19\%$ experienced virologic failure. The table below shows the observed counts.

resp<-as.factor(c(rep(0,60),rep(1,27),rep(0,87),rep(1 ,113)))
trt<-as.factor(c(rep(1,60),rep(1,27),rep(2,87),rep(2,113)))
#cbind(resp,trt)
HIV1<-table(trt,resp)
rownames(HIV1)=c("Failure","Stable")
colnames(HIV1)=c("NVP","LPV")
HIV1

##          resp
## trt       NVP LPV
##   Failure  60  27
##   Stable   87 113

Visualiztion

Figure 11 shows the proportion of treatment outcome (Failure/Stable) by treatment group. The proportion of failure are 0.408 and 0.193 for the NVP and LPV groups, respectively.

resp<-c(60,27,87,113)
Trt<-c("NVP","LPV","NVP","LPV")
Outcome<-c("Failure","Failure","Stable","Stable")
tot<-c(147,140,147,140)
prop<-round(resp/tot,3)
HIV2<-data.frame(Trt,Outcome,resp,tot,prop)
HIV2

##   Trt Outcome resp tot  prop
## 1 NVP Failure   60 147 0.408
## 2 LPV Failure   27 140 0.193
## 3 NVP  Stable   87 147 0.592
## 4 LPV  Stable  113 140 0.807

plot2=ggplot(HIV2, aes(x="", y=prop, fill=Outcome)) +
  geom_bar(stat="identity", width=1, color="black") +
  coord_polar("y", start=0)+theme_void()+
  facet_wrap(~as.factor(Trt))+
  geom_text(aes(label = prop),
            position = position_stack(vjust = 0.5))
plot2

Figure 11: Proportion of treatment outcome (Failure/Stable) by treatment group.

Expected counts

The expected count under the null hypothesis are shonw below. For example $E_{11}=44.56098$.

xx<-chisq.test(HIV1,correct=F)
#xx
#xx$observed
xx$expected

##          resp
## trt             NVP      LPV
##   Failure  44.56098 42.43902
##   Stable  102.43902 97.56098

Residulas

For the cell Virologic Failure/NVP the residual is equal to

\[ e_{11}=60-44.6 \]

and the standerized residual

\[ r_{11}=\frac{60-44.6}{\sqrt{44.6}}=2.312824. \]

For all cell in the table we have

#e1<-(xx$observed-xx$expected)[1,1]
#e1
#e1/sqrt(xx$expected[1,1])
xx$residuals

##          resp
## trt             NVP       LPV
##   Failure  2.312824 -2.369939
##   Stable  -1.525412  1.563082

Inference

The $\chi^{2}$ statistic is equal to 15.73587 and the p-value, based on a $\chi^{2}_{1}$ is equal to 7.282988e-05. p-value=$P(\chi^{2} > 15.7358)$ and in Figure 12, it is the area to the right of the test statistic.

a<-as.vector(xx$residuals)
z<-sum(a^2)
z

## [1] 15.73587

x<-seq(from=0,to=20,length=1000)
dx<-dchisq(x,1)
plot(x,dx,type="l",xlab="x",ylab="f(x)")
lines(c(z,z),c(0,2.5),col=2,lwd=3)

Figure 12: Chi-square distribution with df=1 and the observed test statistic (red vertical line).

p.val<-1-pchisq(z,1)
p.val

## [1] 7.282988e-05

Implemention in R

The analysis’ R output is shown below. With p value smaller than 0.05, we reject the null hypothesis and conclude that the data provides convincing evidence of an association between treatment and outcome.

xx<-chisq.test(HIV1,correct=F)
xx

## 
##  Pearson's Chi-squared test
## 
## data:  HIV1
## X-squared = 15.736, df = 1, p-value = 7.283e-05

4. Analysis of $I \times J$ tables

4.1 Example: the FAMuSS study

The Functional polymorphisms Associated with human Muscle Size and Strength study (FAMuSS) measured a variety of demographic, phenotypic, and genetic characteristics for about 1300 participants. Data from the study have been used in a number of subsequent studies, such as one examining the relationship between muscle strength and genotype at a location on the ACTN3 gene.The famuss dataset is a subset of the data for 595 participants.

resp<-c(16,6,5,21,18,16,125,216,126,4,10,9,7,11,5)
sum(resp)

## [1] 595

For the analysis presented below we are interested in the association between the race and the genotype. The table below shows the distribution of race across the level of Genotype at the location r577x in the ACTN3 gene (CC, CT, TT).

aa<-matrix(resp,nrow=5,ncol=3,byrow=T)
rownames(aa)=c("African Am.","Asian","Cauca.","Hispanic","Other")
colnames(aa)=c("CC","CT","TT")
aa

##              CC  CT  TT
## African Am.  16   6   5
## Asian        21  18  16
## Cauca.      125 216 126
## Hispanic      4  10   9
## Other         7  11   5

Visualization

The research question of interest: Is there evidence of an association between genotype and race? The proportion of each genotype across the race group is shown in Figure 13.

resp<-c(16,6,5,21,18,16,125,216,126,4,10,9,7,11,5)
Genotype<-c("CC","CT","TT","CC","CT","TT","CC","CT","TT","CC","CT","TT",
            "CC","CT","TT")
Race<-c("African Am.","Asian","Cauca.","Hispanic","Other",
        "African Am.","Asian","Cauca.","Hispanic","Other",
         "African Am.","Asian","Cauca.","Hispanic","Other")
tot<-c(27,27,27,55,55,55,467,467,467,23,23,23,24,24,24)
prop<-round(resp/tot,3)
FAMuSS<-data.frame(Race,Genotype,resp,tot,prop)

plot31a=ggplot(FAMuSS, aes(x=Genotype, y=prop, fill=Genotype)) + 
   geom_bar(stat = "identity", color="black")+
   geom_text(aes(label = prop),vjust=-0.15)+
   facet_wrap(~as.factor(Race))+
   theme_void()
plot31a

Figure 13: Distribution of genotype across race groups.

The expected counts

The expected counts under the null hypothesis that there is no association between genotype and race are shown below. For example for the cell Asian/CT, $E_{2,2}=24.12605$ while $O_{2,2}=18$.

xx<-chisq.test(aa,correct=T)
#xx
xx$expected

##                     CC        CT         TT
## African Am.   7.850420  11.84370   7.305882
## Asian        15.991597  24.12605  14.882353
## Cauca.      135.783193 204.85210 126.364706
## Hispanic      6.687395  10.08908   6.223529
## Other         6.687395  10.08908   6.223529

Residuals and test statistic

For the cell Asian/CT, $e_{2,2}=18-24.12605= -6.12605 $, $r_{ij}=-6.12605/ \sqrt{24.12605}=-1.24720$. The $\chi^{2}$ statistic is equal to 19.4001.

xx$residuals

##                     CC          CT          TT
## African Am.  2.9086319 -1.69802497 -0.85310170
## Asian        1.2524298 -1.24720387  0.28971360
## Cauca.      -0.9253891  0.77888407 -0.03244366
## Hispanic    -1.0392093 -0.02804356  1.11294757
## Other        0.1208836  0.28678513 -0.49045147

z<-xx$statistic
z

## X-squared 
##   19.4001

Inference

For the FAMuSS study, since $I=5$ and $J=3$, the $\chi^{2}$ statistic is distributed with $(5-1) \times (3-1)= 8$ degrees of freedom. The p-value, as shown in Figure 13, is equal to 0.01286033 (the area to the right of the test statistic). Thus, for $\alpha=0.05$, there is sufficient evidence to reject the null hypothesis of independence between race and genotype.

x<-seq(from=0,to=25,length=1000)
dx<-dchisq(x,8)
plot(x,dx,type="l",xlab="x",ylab="f(x)")
lines(c(z,z),c(0,0.1),col=2,lwd=3)

Figure 14: chi-square distribution with df=8 and the test statistic.

p.val1<-1-pchisq(z,8)
p.val1

##  X-squared 
## 0.01286033

Implementaion in R

The analysis presnted above can be conducted in R using the function chisq.test(). The output is shown below.

xx<-chisq.test(aa,correct=T)
xx

## 
##  Pearson's Chi-squared test
## 
## data:  aa
## X-squared = 19.4, df = 8, p-value = 0.01286

5. Fisher’s exact test

Example 1: Fisher’s experimant

According to Alen Agresti’s book and wikipedia, the experiment involved a woman (Muriel Bristol) who claimed she could distinguish between cups of tea where milk was added first and those where tea was added first. Her future husband, William Roach, suggested a test where eight cups of tea were prepared, four with milk added first and four with tea added first, and they were randomly presented to her. The resukts are shown below

data <- data.frame(
  "Milk" = c(3, 1),
  "Tea " = c(1, 3),
  row.names = c("Muriel guess: 1st Milk", "Muriel guess: 1st Tea"),
  stringsAsFactors = FALSE
)
colnames(data) <- c(" Poured 1st Milk", "Poured 1st Tea")

data

##                         Poured 1st Milk Poured 1st Tea
## Muriel guess: 1st Milk                3              1
## Muriel guess: 1st Tea                 1              3

The hypotheses

Fisher’s exact test using R

test <- fisher.test(data,alternative = "greater")
test

## 
##  Fisher's Exact Test for Count Data
## 
## data:  data
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.3135693       Inf
## sample estimates:
## odds ratio 
##   6.408309

Example 2

data <- data.frame(
  "Cured." = c(13, 3),
  "Uncured" = c(4, 9),
  row.names = c("Fecal Infusion", "Vancomycin"),
  stringsAsFactors = FALSE
)
colnames(data) <- c("Cured", "Unured")

data

##                Cured Unured
## Fecal Infusion    13      4
## Vancomycin         3      9

test <- fisher.test(data)
test

## 
##  Fisher's Exact Test for Count Data
## 
## data:  data
## p-value = 0.00953
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   1.373866 78.811505
## sample estimates:
## odds ratio 
##   8.848725

6. Chi-square tests for the fit of a distribution

1. Inference for a single proportion

1.1 The Wolchok’s study

1.2 Binomial population

1.3 Sampling distribution of \(\hat{p}\)

1.4 Confidence intervals for a proportion

Example: Confifence interval for \(p\) in Wolchok’s study

Example: Confifence interval for \(p\) in Wolchok’s study using R

1.5 Hypothesis testing for a proportion in one population

Example: Hypothesis testing for \(p\) in Wolchok’s study

1.6 Hypothesis testing for \(p\) in Wolchok’s study using R

2 Case Study: the NHANES dataset

2.1 The data

Depression in the population

Visualization

2.2 A random sample of size 100 from the data and parameter estimate for \(p\)

The sample

Parameter estimate and visualization

Hypothesis testing and confidence interval for \(p\) in the NHANES dataset

Hypothesis testing and confidence interval for \(p\) in the NHANES dataset using R

3 Inference for proportion in two populations

3.1 Sampling distribution of the difference of two proportions

3.2 Confidence intervals for \(p_{1}-p_{2}\)

Example

3.3 Hypothesis testing for \(p_{1}-p_{2}\)

3.4 Example: the Breast cancer study

Estimation of the pool proportion

Inference

Confidence interval for \(p_{1}-p_{2}\)

3.4 Analysis of the breast cancer data using R

3. Analysis of \(2 \times 2\) tables

3.1 Example: The breast Cancer study

A \(2 \times 2\) contingency table

Vizualistion

Expected counts

The residuals and standaried residuals

The \(\chi^{2}\) statistic

Inference

Implemention in R

3.2 Example: the HIV study

The data

Visualiztion

Expected counts

Residulas

Inference

Implemention in R

4. Analysis of \(I \times J\) tables

4.1 Example: the FAMuSS study

Visualization

The expected counts

Residuals and test statistic

Inference

Implementaion in R

5. Fisher’s exact test

Example 1: Fisher’s experimant

The hypotheses

Fisher’s exact test using R

Example 2

6. Chi-square tests for the fit of a distribution

7. Case control studies