26-02-2025 >eR-BioStat
Inference for categorical data
Ziv Shkedy and Thi Huyen Nguyen based on Chapter 8 in the book of Julie Vu and Dave Harrington Introductory Statistics for the Life and Biomedical Sciences (https://www.openintro.org/book/biostat/)
## load/install libraries
.libPaths(c("./Rpackages",.libPaths()))
library(knitr)
library(tidyverse)
library(deSolve)
library(minpack.lm)
library(ggpubr)
library(readxl)
library(gamlss)
library(data.table)
library(grid)
library(png)
library(nlme)
library(gridExtra)
library(mvtnorm)
library(e1071)
library(lattice)
library(ggplot2)
library(dslabs)
library(NHANES)
library(plyr)
library(dplyr)
library(nasaweather)
library(ggplot2)
library(gganimate)
library(av)
library(gifski)
library(foreach)
library("DAAG")
library(DT)
library(TeachingDemos)
library(gridExtra)
Advanced melanoma is an aggressive form of skin cancer that until recently was almost uniformly fatal. In rare instances, a patient’s melanoma stopped progressing or disappeared altogether when the patient’s immune system successfully mounted a response to the cancer. Those observations led to research into therapies that might trigger an immune response in cancer. Some of the most notable successes have been in melanoma, particularly with two new therapies, nivolumab and ipilimumab. A 2013 report in the New England Journal of Medicine by Wolchok et al. reported the results of a study in which patients were treated with both nivolumab and ipilimumab Fifty-three patients were given the new regimens concurrently, and the response to therapy could be evaluated in 52 of the 53. Of the 52 evaluate patients, 21 \((40.3\%)\) experienced a response according to commonly accepted criteria. Figure 1 shows the distribution of the patients in the sample. In previous studies, the proportion of patients responding to one of these agents was \(30\%\) or less. How might one compare the new data to past results?
freq<-c(30,21)
barplot(freq, width = 1, space = NULL,col=c(2,3),names=c("Non responder","Responder"))
Figure 1: Response to therapy in Wolchok’s study.
The data from this study are binomial data, with success defined as a response to therapy. Suppose the number of patients who respond in a study like this is represented by the random variable X, where X is binomial with parameters \(n\) (the number of trials, where each trial is represented by a patient) and \(p\) (the unknown population proportion of response). In the population, \(X\) is a binary variable that takes two values:
\[ X_{i}=\left \{ \begin{array}{c,l,l} 1 & p & subject \;respond \\ 0 & 1-p & elsewhere. \end{array} \right . \] Let \(X\) ve the total number of subjects in the study that responsed to therapy, that is \[ X=\sum_{i=1}^{n}X_{i}. \]
From formulas discussed in Chapter 3, the mean of \(X\) is \(n \times p\) and the standard deviation of \(X\) is \(n \times p \times (1-p)\). In Wolchok’s study, \(X=21\) and the proportion of patients who experianced response to the theraphy is \[ \hat{p}=\frac{X}{n}=\frac{\sum_{i=1}^{n}X_{i}}{n}=\frac{21}{52}=0.404. \]
The sampling distribution for \(\hat{p}\), calculated from a sample of size \(n\) from a population with a success proportion \(p\), is nearly normal when
or
\[ \hat{p} \sim N \left ( p, \sqrt{\frac{(p \times (1-p))}{n}} \right ) \]
When conducting inference, the population proportion \(p\) is unknown. Thus, to construct a confidence interval, the sample proportion \(\hat{p}\) can be substituted for \(p\) to check the success-failure condition and compute the standard error. In a hypothesis test, \(p_{0}\) is substituted for \(p\).
When using the normal approximation to the sampling distribution of \(\hat{p}\), a confidence interval for a proportion has the same structure as a confidence interval for a mean; it is centered at the point estimate, with a margin of error calculated from the standard error and appropriate critical value from $N(0,1). The formula for a \(95\%\) confidence interval is
\[ \hat{p} \pm z_{\frac{\alpha}{2}} \times SE_{\hat{p}}=\hat{p} \pm 1.96 \times \sqrt{\frac{(p \times (1-p))}{n}}. \]
The independence and success-failure assumptions should be checked first. Since the outcome of one patient is unlikely to influence that of other patients, the observations are independent. The success-failure condition is satisfied since \(n \hat{p} = 52 \times 0.404 = 21 > 10\) and \(n \times \hat{p} \times (1-\hat{p}) = 52 \times 0.596 = 31 > 10\). The standard error for \(\hat{p}\) is \[ \sqrt{\frac{(0.404 \times 0.596))}{52}}=0.068. \]
For a \(95\%\) confidence interval \(\frac{\alpha}{2}=0.025\) and the ctritical value from \(N(0,1)\) is equal to 1.96. Therefore, a \(95\%\) confidence intervals is
\[ 0.404 \pm 1.96 \times 0.068 \rightarrow (0.27;0.54) \]
In R, a confidence interval for the proportion can be constructed using the R function prop.test(). The function choses the critical values from a standard normal distrubtion.
library(TeachingDemos)
X<-21
n<-52
prop.test(X,n)
##
## 1-sample proportions test with continuity correction
##
## data: X out of n, null probability 0.5
## X-squared = 1.5577, df = 1, p-value = 0.212
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.2731269 0.5487141
## sample estimates:
## p
## 0.4038462
As mentioned in the output above, the analysis was done with continuity correction (the default). To conduct the same analysis without continuity correction we use the argument correct=F.
library(TeachingDemos)
prop.test(X,n, correct=F)
##
## 1-sample proportions test without continuity correction
##
## data: X out of n, null probability 0.5
## X-squared = 1.9231, df = 1, p-value = 0.1655
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.2815973 0.5393242
## sample estimates:
## p
## 0.4038462
Just as with inference for population means, confidence intervals for population proportions can be used when deciding whether to reject a null hypothesis. It is useful in most settings, however, to calculate the p-value for a test as a measure of the strength of the evidence contradicting the null hypothesis. When using the normal approximation for the distribution of pˆ to conduct a hypothesis test, one should always verify that \(\hat{p}\) is nearly normal under \(H_{0}\) by checking the independence and success-failure conditions. Since a hypothesis test is based on the distribution of the test statistic under the null hypothesis, the success-failure condition is checked using the null proportion \(p_{0}\), not the estimate \(\hat{p}\). According to the normal approximation to the binomial distribution, the number of successes in \(n\) trials is normally distributed with mean \(n \times p_{0}\) and standard deviation \(n \times p_{0} (1-p{0})\). This approximation is valid when \(n \times p_{0}\) and \(n \times (1-p_{0})\) are both at least 10.
Under the null hypothesis \(H_{0}:p=p_{0}\), the sample proportion \(\hat{p}\) follows a normal distribution \[ \hat{p} \sim N \left ( p_{0}, \sqrt{\frac{(p_{0} \times (1-p_{0}))}{n}} \right ). \]
As before, the test statistics is defined as
\[ z=\frac{\mbox{point estimate - parameter under the null}}{SE}. \]
For a sample proportion the test statistics given by
\[ z=\frac{\hat{p}-p_{0}}{\sqrt{\frac{(p_{0} \times (1-p_{0}))}{n}}}. \]
Suppose that we wish to test the null hypothesis \(H_{0}:p=0.3\) aginast a one sided alternative, \(H_{1}:p > 0.3\). In the sample, \(\hat{p}=0.404\). The test statistics in this case is equal to
X<-21
n<-52
pbar = 21/52
prop = 0.3
n = 52
z=(pbar-prop)/sqrt((prop*(1-prop))/n)
z
## [1] 1.634114
For significant level of \(5\%\) (\(\alpha=0.05\)), the critical value is
alpha=0.05
crit.point=qnorm(1-alpha)#p=0.05 one tailed (upper)
crit.point
## [1] 1.644854
Since \(1.6365<1.6448\) we do not reject the null hypothesis. Alternatively, we reach the same decision if we calculate the p-value
pval=1- pnorm(z,0,1)#p=0.05 one tailed (upper)
pval
## [1] 0.05111742
The analysis in the previous sestion can be conducted using the R function prop.test(). We need to specify the value of the parameter under the null hypothesis p=0.3 and for a one sided alternative we we alternative = c(“greater”). Note that we conduct the inference without a continuity correction using the argument correct=F.
X<-21
n<-52
prop.test(X, n, p=0.3,alternative = c("greater"),correct=F)
##
## 1-sample proportions test without continuity correction
##
## data: X out of n, null probability 0.3
## X-squared = 2.6703, df = 1, p-value = 0.05112
## alternative hypothesis: true p is greater than 0.3
## 95 percent confidence interval:
## 0.2993794 1.0000000
## sample estimates:
## p
## 0.4038462
Note that the prop.test function conduct a \(\chi^{2}\) test, the value of the test statistics is equal to \(z{2}\).
z^2
## [1] 2.67033
The NHANES dataset consists of data from the US National Health and Nutrition Examination Study. Information about 76 variables is available for 10000 individuals included in the study. The 10000 indivuduals are considered as the population. In this section we focus on the variable Depressed that consists of Self-reported number of days where participant felt down, depressed or hopeless. (the R object Depressed).
library(NHANES)
data(NHANES)
dim(NHANES)
## [1] 10000 76
#names(NHANES)
After removing the missing values, the population size is \(N=6673\) from which \(1009+418=1427\) individuals, \(21.4\%\), reported depression in several or most of the day of the study. The true proprtion of depression in the population (the parameter in the population) \[ p=\frac{1427}{6673}=0.2138468. \]
data(NHANES)
#Diabetes1<-na.omit(NHANES$Diabetes)
Depressed1<-na.omit(NHANES$Depressed)
#table(Diabetes1)
table(Depressed1)
## Depressed1
## None Several Most
## 5246 1009 418
length(Depressed1)
## [1] 6673
#sum(Diabetes1)/length(Diabetes1)
We define a new indicator variable for depression which take the value of No is the participant did not report depression and Yes if the participant reported that he/she was depressed several or most of the days,
\[ X_{i}= \left \{ \begin{array}[ll] YYes & \mbox{participant reported depression (several/most)}, \\ No & \mbox{participant did not report depression}. \\ \end{array} \right. \]
The parameter \(p=P(X_{i}=Yes)\) be the probability that a participant reported a depression in several or most of the days of the study. The distribution of \(X_{i}\) in the population is shown below.
Freq<-c(5246,1427)
Depressed<-c("No","Yes")
NHANES.p<-data.frame(Depressed,Freq)
NHANES.p
## Depressed Freq
## 1 No 5246
## 2 Yes 1427
Freq.1<-c(5246,1009,418)
Depressed.1<-c("None","Several","Most")
NHANES.p1<-data.frame(Depressed.1,Freq.1)
names(NHANES.p1)<-c("Depressed","Freq")
Figure 2 shows the original distribution of depression in the population with three categories. For the analysis presented in this section, the variable was dichotomized, the categories “Several” and “Most” are combined together to the category “Yes” as shown in Figure 3.
plot2a=ggplot(NHANES.p1, aes(x="", y=Freq.1, fill=Depressed.1)) +
geom_bar(stat="identity", width=1, color="black") +
coord_polar("y", start=0)+theme_void()+
geom_text(aes(label = Freq.1),
position = position_stack(vjust = 0.35))
plot2a
Figure 2: Depression in the NHANES dataset.
plot2=ggplot(NHANES.p, aes(x="", y=Freq, fill=Depressed)) +
geom_bar(stat="identity", width=1, color="black") +
coord_polar("y", start=0)+theme_void()+
geom_text(aes(label = Freq),
position = position_stack(vjust = 0.25))
plot2
Figure 3: Depression (Yes/No) in the NHANES dataset.
Out of the population (of 6673 individuals) we draw a random sample of 100 individuals. In the sample, 19 individuals reported depression and 81 not.
set.seed(1234)
Depressed2<-sample(Depressed1,size=100,replace=FALSE)
length(Depressed2)
## [1] 100
table(Depressed2)
## Depressed2
## None Several Most
## 81 13 6
The sample proportion, the parameter estimate for \(p\), is equal to \[ p=\frac{19}{100}=0.19. \]
Figure 4 shows the distribution of depression in the sample.
Freq<-c(81,19)
Depressed<-c("No","Yes")
NHANES.s<-data.frame(Depressed,Freq)
plot2=ggplot(NHANES.s, aes(x="", y=Freq, fill=Depressed)) +
geom_bar(stat="identity", width=1, color="black") +
#coord_polar("y", start=0)+theme_void()+
geom_text(aes(label = Freq),
position = position_stack(vjust = 0.5))
plot2
Figure 4: Depression in a random sample of 100 individuals.
A \(95\%\) confidence interval for the propotion of depression in the population is (0.113109,0.266891). Note that this interval is symtric around \(\hat{p}\) and it contains the true value of \(p=0.2138\).
hat.p<-0.19
SE.p<-sqrt((hat.p*(1-hat.p))/100)
LL<-hat.p-SE.p*1.96
UL<-hat.p+SE.p*1.96
c(LL,UL)
## [1] 0.113109 0.266891
Under the null hypothesis, $p_{0}=0.124. The test statistic can be calculated by
\[ z=\frac{0.19-0.214}{\sqrt{0.124 \times (1-0.214)}{100}}=-0.5851849. \]
Note that \(z^{2}=0.3424413\). We will elaborate on this point in the next section.
hat.p<-0.19
null.p<-0.214
SE0<-sqrt(null.p*(1-null.p)/100)
z<-(hat.p-null.p)/SE0
z
## [1] -0.5851849
z^2
## [1] 0.3424413
Since the alternative hypothesis is a two sided hypothesis, \(H_{A}:p \ne 0.124\), the p-value is equal to \(P(Z \le -0.5851) \times 2 = 0.5584\). For \(\alpha=0.05\) , we do not reject the null hypothesis.
(pnorm(z,0,1))*2
## [1] 0.5584234
As before, for the analysis we use the R function prop.test(). We need to specify the value of the parameter under the null hypothesis. Suppose that we wish to test the null hypothesis \(H_{0}:p=0.214\) against a two sided alternative \(H_{A}:p \ne 0.214\). In R, we specify the null hypothesis, p=0.214, and for a two sided alternative we use alternative = c(“two sided”). Note that we conduct the inference without a continuity correction using the argument correct=F.
X<-19
n<-100
prop.test(X,n, p=0.214,alternative = c("two.sided"),correct=F,conf.level=0.95)
##
## 1-sample proportions test without continuity correction
##
## data: X out of n, null probability 0.214
## X-squared = 0.34244, df = 1, p-value = 0.5584
## alternative hypothesis: true p is not equal to 0.214
## 95 percent confidence interval:
## 0.1251475 0.2777885
## sample estimates:
## p
## 0.19
The \(95\%\) confidence interval for the population proportion \(p\), (0.1251475 0.2777885), covers the parameter value in the population \(p\) and it symmetric around the value of \(\hat{p}=0.19\). The two sided p value is equal to 0.5584. Hence, we conclude that there no evidence in the data to reject the null hypothesis that \(p=0.124\).
x1<-seq(from=-3,to=3,length=10000)
x2<-seq(from=0,to=2,length=10000)
dx1<-dnorm(x1,0,1)
dx2<-dchisq(x2,1)
par(mfrow=c(1,2))
plot(x1,dx1,type="l",xlab="x",ylab="f(x)")
lines(c(-0.5851,-0.5851),c(0,0.4),lwd=2,col=2)
title("N(0,1)")
plot(x2,dx2,type="l",xlab="x",ylab="f(x)")
lines(c(-0.5851,-0.5851)^2,c(0,28),lwd=2,col=2)
title("chi-square_(1)")
Figure 5: Chi-square_(1) and N(0,1) distributions.
Just as inference can be done for the difference of two population means, conclusions can also be drawn about the difference of two population proportions: \(p_{1} - p_{2}\). The normal model can be applied to \(\hat{p}_{1}-\hat{p}_{2}\) if the sampling distribution for each sample proportion is nearly normal and if the samples are independent random samples from the relevant populations. The difference \(\hat{p}_{1}-\hat{p}_{2}\) tends to follow a normal model when:
This condition is satisfied when
The standard error of the difference in sample proportions is
\[ SE_{p_{1}-p_{2}}=\sqrt{SE_{p_{1}}+SE_{p_{1}}} \] using the results from the previous section we have \[ SE_{p_{1}-p_{2}}=\sqrt{\frac{(p_{1} \times (1-p_{1}))}{n_{1}}+\frac{(p_{2} \times (1-p_{2}))}{n_{2}}}. \] Here, \(p_{1}\) and \(p_{2}\) are the population proportions, and \(n_{1}\) and \(n_{2}\) are the two sample sizes. The sampling distribution of the difference \(\hat{p}_{1}-\hat{p}_{2}\) is given by \[ \hat{p}_{1}-\hat{p}_{2} \sim N \left ( {p}_{1}-{p}_{2} , SE_{p_{1}-p_{2}} \right ). \]
A \((1-\alpha) \times 100 \%\) confidence interval for \(p_{1}-p_{2}\) is
\[ \left ( \hat{p}_{1}-\hat{p}_{2} \right ) \pm SE_{p_{1}-p_{2}} \times Z_{\frac{\alpha}{2}}. \]
Hypothesis tests for \(p_{1}\) and \(p_{2}\) are usually testing the null hypothesis of no difference between \(p_{1}\) and \(p_{2}\); i.e. \[ \begin{array}{l} H_{0}:p_{1}=p_{2}, \\ H_{0}:p_{1} \ne p_{2}. \end{array} \]
Note that under the null hypothesis, \(p_{1}-p_{2}=0\) and the sampling distribution of \(\hat{p}_{1}-\hat{p}_{2}\) is a normal distribution
\[ \hat{p}_{1}-\hat{p}_{2} \sim N \left (0 , SE_{p_{1}-p_{2}} \right ). \] Under the null hypothesis \(p = p_{1} = p_{2}\) and therefore the \(SE_{p_{1}-p_{2}}\) is \[ SE_{p_{1}-p_{2}}=\sqrt{\frac{(p_{1} \times (1-p_{1}))}{n_{1}}+\frac{(p_{2} \times (1-p_{2}))}{n_{2}}}=\sqrt{ (p \times (1-p)) \left (\frac{1}{n_{1}}+\frac{1}{n_{2}} \right )}. \]
Since \(p\) is unknown, an estimate is used to compute the standard error of \(\hat{p}_{1}-\hat{p}_{2}\). \(p\) can be estimated by \(\hat{p}\) the weighted average of the sample proportions \(\hat{p}_{1}\) and \(\hat{p}_{2}\), that is
\[ \hat{p}=\frac{n_{1}\hat{p_{1}}+n_{2}\hat{p_{2}}}{n_{1}+n_{2}}. \]
This pooled proportion \(\hat{p}\) is also used to check the success-failure condition. The test statistic \(z\) for testing teh above hypotheses equals to
\[ z=\frac{\hat{p}_{1}-\hat{p}_{2}}{\sqrt{ (\hat{p} \times (1-\hat{p})) \left (\frac{1}{n_{1}}+\frac{1}{n_{2}} \right )}}. \]
Under \(H_{0}\), \(z \sim N(0,1)\).
The use of screening mammograms for breast cancer has been controversial for decades because the overall benefit on breast cancer mortality is uncertain. Several large randomized studies have been conducted in an attempt to estimate the effect of mammogram screening. A 30-year study to investigate the effectiveness of mammograms versus a standard non-mammogram breast cancer exam was conducted in Canada with 89,835 female participants.12 During a 5-year screening period, each woman was randomized to either receive annual mammograms or standard physical exams for breast cancer. During the 25 years following the screening period, each woman was screened for breast cancer according to the standard of care at her health care center. At the end of the 25 year follow-up period, 1,005 women died from breast cancer. The results are shown in the table below.
resp<-as.factor(c(rep(0,500),rep(1,44425),rep(0,505),rep(1 ,44405)))
trt<-as.factor(c(rep(1,500),rep(1,44425),rep(2,505),rep(2,44405)))
BC1<-table(trt,resp)
rownames(BC1)=c("Mammogram","Control")
colnames(BC1)=c("Yes","No")
BC1
## resp
## trt Yes No
## Mammogram 500 44425
## Control 505 44405
Figure 6 shows the percentage of death from breast cancer (Yes/No) by group and for the pooled sample.
x<-seq(from=-3,to=3,length=1000)
Freq<-c(500,44425,505,44405,1005,88830)
tot<-500+44425+505+44405
prop<-(Freq/c(44925,44925,44910,44910,tot,tot))*100
barplot(prop, width = 1, space = NULL,col=c(2,3,2,3,2,3),names=c("yes","No","yes","No","yes","No"),
ylab="% death from breast cancer",ylim=c(0,105))
text(1,102,"Mammogram",cex=1)
text(3.5,102,"Control",cex=1)
text(6,102,"Pooled sample",cex=1)
text(0.75,10,"11.12%")
text(3.25,10,"11.24%")
text(5.5,10,"11.18%")
Figure 6: Percentage of death from breast cancer (Yes/No).
In the first step, we estimate the sample proportions, \(\hat{p_{1}}\) and \(\hat{p_{2}}\):
p1<-(BC1[1,1])/(BC1[1,1]+BC1[1,2])
p2<-(BC1[2,1])/(BC1[2,1]+BC1[2,2])
p1
## [1] 0.01112966
p2
## [1] 0.01124471
and the pool proportion \(\hat{p}\):
p.pool<-(BC1[1,1]+BC1[2,1])/(BC1[1,1]+BC1[1,2]+BC1[2,1]+BC1[2,2])
p.pool
## [1] 0.01118718
The standard error of the proportion difference (Under \(H_{0}\)) is estimated by
\[ \sqrt{ (\hat{p} \times (1-\hat{p})) \left (\frac{1}{n_{1}}+\frac{1}{n_{2}} \right )}. \]
n1<-BC1[1,1]+BC1[1,2]
#n1
n2<-BC1[2,1]+BC1[2,2]
#n2
SE.pool<-sqrt((p.pool*(1-p.pool))*(1/n1+1/n2))
SE.pool
## [1] 0.000701818
The null hypothesis that we wish to test states that the probability of a breast cancer death is the same for the women in the two groups. If group 1 represents the mammogram group and group 2 the control group,we formulate the hypotheses as
\[ \begin{array}[l] HH_{0} : p_{1} = p_{2} , \\ H_{A} : p_{1} \ne p_{2}. \\ \end{array} \]
Let \(\alpha= 0.05\). We calculate the test statistic \(z\):
\[ z=\frac{0.01112966-0.01124471}{0.000701818}=-0.163933. \]
#p1
#p2
#p1-p2
z<-(p1-p2)/SE.pool
z
## [1] -0.163933
The two-sided p-value is \(P|Z| \ge -0.1639 = 0.8697\), which is greater than 0.05. In Figure 7, it is equal to the areas to the left and the right of the values \(z=-0.163933\) and \(-z=0.163933\). Thus, we conclude that there is insufficient evidence to reject the null hypothesis; the observed difference in breast cancer death rates is reasonably explained by chance.
x<-seq(from=-3,to=3,length=1000)
dx<-dnorm(x,0,1)
plot(x,dx,type="l",ylab="f(x)=N(0,1)")
lines(c(z,z),c(0,0.4),lwd=3,col=2)
lines(abs(c(z,z)),c(0,0.4),lwd=3,col=2)
text(-1,0.05,"0.4348919")
text(1,0.05,"0.4348919")
Figure 7: Density of N(0,1) and the test statistic.
(1-pnorm(abs(z),0,1))*2
## [1] 0.8697839
For the mammogram group, \(\hat{p}_{1} = 0:01113\), \(n_{1}\hat{p}_{1} = 0.1113 \times 44925=500\) and \(n_{1}(1 - \hat{p}_{1}) = 39925\). It is easy to show that the success failure condition is holds for the control group as well. The point estimate for the difference in the probability of death, \(\hat{p}_{1}-\hat{p}_{2}\) is
#p1
#p2
p1-p2
## [1] -0.0001150511
The standard error for the estimated difference uses the individual estimates of the probability of a death:
\[ SE_{p_{1}-p_{2}}=\sqrt{\frac{(p_{1} \times (1-p_{1}))}{n_{1}}+\frac{(p_{2} \times (1-p_{2}))}{n_{2}}}. \]
In our example, \(SE_{p_{1}-p_{2}}=0.0007\).
SE.1<-(p1*(1-p1))/n1
SE.2<-(p2*(1-p2))/n2
SE.12<-sqrt(SE.1+SE.2)
SE.12
## [1] 0.0007018185
For confidence level of \(95\%\), \(Z_{\frac{\alpha}{2}}=1.96\),
z.val<-qnorm(0.975,0,1)
z.val
## [1] 1.959964
and the \(95\%\) confidence interval is given by
LL<-(p1-p2)-SE.12*z.val
UL<-(p1-p2)+SE.12*z.val
c(LL,UL)
## [1] -0.001490590 0.001260488
With \(95\%\) confidence, the difference in the probability of death is between \(-0.15\%\) and \(0.13\%\). As expected from the large p-value, the confidence interval contains the null value 0.
The analysis discussed above can be conducted using the R function prop.test(). As shown below, the function uses a \(\chi^{2}\) test with one degree of freedom and therefore the test statistic is equal to \(z^{2}=-0.163933^{2}=0.026874\), Note that the p-value is the same as before.
prop.test(BC1,correct=F)
##
## 2-sample test for equality of proportions without continuity correction
##
## data: BC1
## X-squared = 0.026874, df = 1, p-value = 0.8698
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.001490590 0.001260488
## sample estimates:
## prop 1 prop 2
## 0.01112966 0.01124471
The comparison of the proportion of breast cancer deaths between the two groups can also be approached using a two-way contingency table, which contains counts for combinations of outcomes for two variables. The results for the mammogram study in this format are shown in The table below.
resp<-as.factor(c(rep(0,500),rep(1,44425),rep(0,505),rep(1 ,44405)))
trt<-as.factor(c(rep(1,500),rep(1,44425),rep(2,505),rep(2,44405)))
#cbind(resp,trt)
BC1<-table(trt,resp)
#BC1
rownames(BC1)=c("Mammogram","Control")
colnames(BC1)=c("Yes","No")
BC1
## resp
## trt Yes No
## Mammogram 500 44425
## Control 505 44405
Previously, the main question of interest was stated as, “Is there evidence of a difference in the proportion of breast cancer deaths between the two screening groups?” If the probability of a death from breast cancer does not depend the method of screening, then screening method and outcome are independent. Thus, the question can be re-phrased: “Is there evidence that screening method is associated with outcome?” Hypothesis testing in a two-way table assesses whether the two variables of interest are associated (i.e., not independent). The approach can be applied to settings with two or more groups and for responses that have two or more categories. The observed number of counts in each table cell are compared to the number of expected counts, where the expected counts are calculated under the assumption that the null hypothesis of no association is true. A \(\chi^{2}\) test of significance is based on the differences between observed and expected values in the cells.
Figure 8 shows the percentage of death in each group. In the sample, \(1.1130\%\) and \(1.1245\%\) died in the Mammogram,Control groups, respectively.
resp<-c(500,44425,505,44405)
Group<-c("Mammogram","Mammogram","Control","Control")
Death<-c("Yes","No","Yes","No")
tot<-c(44925,44925,44910,44910)
prop<-round(resp/tot,6)*100
BC2<-data.frame(Group,Death,resp,tot,prop)
BC2
## Group Death resp tot prop
## 1 Mammogram Yes 500 44925 1.1130
## 2 Mammogram No 44425 44925 98.8870
## 3 Control Yes 505 44910 1.1245
## 4 Control No 44405 44910 98.8755
plot2=ggplot(BC2, aes(x="", y=prop, fill=Death)) +
geom_bar(stat="identity", width=1, color="black") +
#coord_polar("y", start=0)+theme_void()+
facet_wrap(~as.factor(Group))+
geom_text(aes(label = prop),
position = position_stack(vjust = 0.5))
plot2
Figure 8: Barplot for the precentage of death from breast cancer by group.
If type of breast cancer screening had no effect on outcome in the mammogram data, what would the expected results be?
Recall that if two events \(A\) and \(B\) are independent, then
\[ P (A | B) = P(A) \times P (B) \] Let \(A\) represents assignment to the mammogram group and \(B\) the event of death from breast cancer. Let \(O_{ij}\) be the observed count in the \(ij\)th cell. For the breat cancer example, let us focus on the combination Mammogram/Yes. The observed count \(O_{11}\) is equal to 500.
xx<-chisq.test(BC1,correct=F)
#xx
xx$observed
## resp
## trt Yes No
## Mammogram 500 44425
## Control 505 44405
Under independence, the number of individuals out of the total counts that are expected to be in the \(ij\)th cell is equal to
\[ \mbox{Expected}_{ij}=\frac{\mbox{row}\;i \;\mbox{total} \times \mbox{column}\;j \;\mbox{total}}{\mbox{total count in the table}} \]
For the cell Mammogram/Yes, the (1,1) cell, we have
\[ \frac{1005 \times 44,925}{89835}=502.5839=E_{11}. \]
Here, \(E_{11}\) is the expected count, under the independence hypothesis, in the \((1,1)\)th cell. For all the cells in the table we have
xx$expected
## resp
## trt Yes No
## Mammogram 502.5839 44422.42
## Control 502.4161 44407.58
WE difense the cell’s residual \(e_{ij}\) by
\[ e_{ij}=O_{ij}-{E_{ij}}. \]
For the cell Mammogram/Yes, \(e_{11}=O_{11}-E_{11}=500-502.6=-2.6\).
e11<-(xx$observed-xx$expected)[1,1]
e11
## [1] -2.583904
The standardized residuals for the \(ij\)th cell, \(r_{ij}\), is given by
\[ r_{ij}=\frac{O_{ij}-{E_{ij}}}{\sqrt{E_{ij}}}=\frac{e_{ij}}{\sqrt{E_{ij}}}. \]
For the cell Mammogram/Yes
\[ r_{11}=\frac{e_{11}}{\sqrt{E_{11}}}=\frac{-2.6}{\sqrt{502.6}}=-0.1152583. \]
e11/sqrt(xx$expected[1,1])
## [1] -0.1152583
For all cells in the table we have
xx$residuals
## resp
## trt Yes No
## Mammogram -0.11525826 0.01225957
## Control 0.11527751 -0.01226162
a<-as.vector(xx$residuals)
a
## [1] -0.11525826 0.11527751 0.01225957 -0.01226162
Previously, test statistics have been constructed by calculating the difference between a point estimate and a null value, then dividing by the standard error of the point estimate to standardize the difference. The \(chi^{2}\) statistic is based on a different idea. In each cell of a table, the difference observed - expected (\(O_{ij}-E_{ij}\)) is a measure of the discrepancy between what was observed in the data and what should have been observed under the null hypothesis of no association. If the row and colum variables are highly associated, that difference will be large. Two adjustments are made to the differences before the final statistic is calculated. First, since both positive and negative differences suggest a lack of independence, the differences are squared to remove the effect of the sign. Second, cells with larger counts may have larger discrepancies by chance alone, so the squared differences in each cell are scaled by the number expected in the cell under the hypothesis of independence. The final \(\chi^{2}\) statistic is the sum of these standardized squared differences, where the sum has one term for each cell in the table. The \(\chi^{2}\) test statistic is calculated as
\[ \chi^{2}=\sum_{ij} \left ( \frac{\mbox{Observed count}_{ij}-{\mbox{Expected count}_{ij}}}{\sqrt{\mbox{Expected count}_{ij}}} \right )^{2}, \]
or
\[ \chi^{2}=\sum_{ij} \left ( \frac{O_{ij}-{E_{ij}}}{\sqrt{E_{ij}}} \right )^{2}=\sum_{ij}r^{2}_{ij}. \]
For the Breast cancer example the standardized residuals are
a
## [1] -0.11525826 0.11527751 0.01225957 -0.01226162
which leads to the \(\chi^{2}\) statistic
\[ \chi^{2}=(-0.11525826)^{2}+(0.11527751)^2+(0.01225957)^2+(-0.01226162)^2=0.026874. \]
z<-sum(a^2)
z
## [1] 0.02687401
The chi-square distribution is often used with data and statistics that are positive and right skewed. The distribution is characterized by a single parameter, the degrees of freedom. For a \(I \times J\) table, the distribution of the test statistic is given by \[ \chi^{2} \sim \chi^{2}_{(I-1) \times (J-1)}. \] For a \(2 \times 2\) table, \((I-1) \times (J-1) = (2-1) \times (2-1)=1\) and therefore \(\chi^{2} \sim \chi^{2}_{1}\).
Figure 9 demonstrates three general properties of chi-square distributions as the degrees of freedom increases: the distribution becomes more symmetric, the center moves to the right, and the variability increases.
x1<-seq(from=0,to=25,length=1000)
dx1<-dchisq(x1,1)
dx2<-dchisq(x1,2)
dx3<-dchisq(x1,8)
plot(x1,dx1,type="l",xlab="x",ylab="f(x)",ylim=c(0,1))
lines(x1,dx2,col=2)
lines(x1,dx3,col=3)
legend(20,0.8,c("df=1","df=2","df=8"),lty=c(1,1,1),col=c(1,2,3))
Figure 9: Densities for chi-square distributions.
For the breast cancer example, the test statistic is equal to 0.02. The p value is teh area to the right of the test statistic in Figure 10 and it is equal to 0.8650. Hence, we cannot reject the null hypotehsis and conclude that the data do not provide convincing evidence of an association between screening group and breast cancer death.
x<-seq(from=0,to=0.5,length=1000)
dx<-dchisq(x,1)
plot(x,dx,type="l",xlab="x",ylab="f(x)")
lines(c(z,z),c(0,18),col=2,lwd=3)
Figure 10: chi-square distribution with df=1 and the test statistic.
p.val<-1-pchisq(z,1)
p.val
## [1] 0.8697839
The analysis described above can be conducted in R using the function chisq.test(). The output that was presented in previous sections (expected counts, residuals etc.) are a part of R object that contain the results obtained using chisq.test().
xx<-chisq.test(BC1,correct=F)
xx
##
## Pearson's Chi-squared test
##
## data: BC1
## X-squared = 0.026874, df = 1, p-value = 0.8698
If a newborn is HIV+, should he or she be treated with nevirapine (NVP) or a more expensive drug, lopinarvir (LPV)? In this setting, success means preventing virologic failure; i.e., growth of the virus. A randomized study was conducted to assess whether there is an association between treatment and outcome.In total 14 Of the 147 children administered NVP, about 41% experienced virologic failure; of the 140 children administered LPV, about \(19\%\) experienced virologic failure. The table below shows the observed counts.
resp<-as.factor(c(rep(0,60),rep(1,27),rep(0,87),rep(1 ,113)))
trt<-as.factor(c(rep(1,60),rep(1,27),rep(2,87),rep(2,113)))
#cbind(resp,trt)
HIV1<-table(trt,resp)
rownames(HIV1)=c("Failure","Stable")
colnames(HIV1)=c("NVP","LPV")
HIV1
## resp
## trt NVP LPV
## Failure 60 27
## Stable 87 113
Figure 11 shows the proportion of treatment outcome (Failure/Stable) by treatment group. The proportion of failure are 0.408 and 0.193 for the NVP and LPV groups, respectively.
resp<-c(60,27,87,113)
Trt<-c("NVP","LPV","NVP","LPV")
Outcome<-c("Failure","Failure","Stable","Stable")
tot<-c(147,140,147,140)
prop<-round(resp/tot,3)
HIV2<-data.frame(Trt,Outcome,resp,tot,prop)
HIV2
## Trt Outcome resp tot prop
## 1 NVP Failure 60 147 0.408
## 2 LPV Failure 27 140 0.193
## 3 NVP Stable 87 147 0.592
## 4 LPV Stable 113 140 0.807
plot2=ggplot(HIV2, aes(x="", y=prop, fill=Outcome)) +
geom_bar(stat="identity", width=1, color="black") +
coord_polar("y", start=0)+theme_void()+
facet_wrap(~as.factor(Trt))+
geom_text(aes(label = prop),
position = position_stack(vjust = 0.5))
plot2
Figure 11: Proportion of treatment outcome (Failure/Stable) by treatment group.
The expected count under the null hypothesis are shonw below. For example \(E_{11}=44.56098\).
xx<-chisq.test(HIV1,correct=F)
#xx
#xx$observed
xx$expected
## resp
## trt NVP LPV
## Failure 44.56098 42.43902
## Stable 102.43902 97.56098
For the cell Virologic Failure/NVP the residual is equal to
\[ e_{11}=60-44.6 \]
and the standerized residual
\[ r_{11}=\frac{60-44.6}{\sqrt{44.6}}=2.312824. \]
For all cell in the table we have
#e1<-(xx$observed-xx$expected)[1,1]
#e1
#e1/sqrt(xx$expected[1,1])
xx$residuals
## resp
## trt NVP LPV
## Failure 2.312824 -2.369939
## Stable -1.525412 1.563082
The \(\chi^{2}\) statistic is equal to 15.73587 and the p-value, based on a \(\chi^{2}_{1}\) is equal to 7.282988e-05. p-value=\(P(\chi^{2} > 15.7358)\) and in Figure 12, it is the area to the right of the test statistic.
a<-as.vector(xx$residuals)
z<-sum(a^2)
z
## [1] 15.73587
x<-seq(from=0,to=20,length=1000)
dx<-dchisq(x,1)
plot(x,dx,type="l",xlab="x",ylab="f(x)")
lines(c(z,z),c(0,2.5),col=2,lwd=3)
Figure 12: Chi-square distribution with df=1 and the observed test statistic (red vertical line).
p.val<-1-pchisq(z,1)
p.val
## [1] 7.282988e-05
The analysis’ R output is shown below. With p value smaller than 0.05, we reject the null hypothesis and conclude that the data provides convincing evidence of an association between treatment and outcome.
xx<-chisq.test(HIV1,correct=F)
xx
##
## Pearson's Chi-squared test
##
## data: HIV1
## X-squared = 15.736, df = 1, p-value = 7.283e-05
The Functional polymorphisms Associated with human Muscle Size and Strength study (FAMuSS) measured a variety of demographic, phenotypic, and genetic characteristics for about 1300 participants. Data from the study have been used in a number of subsequent studies, such as one examining the relationship between muscle strength and genotype at a location on the ACTN3 gene.The famuss dataset is a subset of the data for 595 participants.
resp<-c(16,6,5,21,18,16,125,216,126,4,10,9,7,11,5)
sum(resp)
## [1] 595
For the analysis presented below we are interested in the association between the race and the genotype. The table below shows the distribution of race across the level of Genotype at the location r577x in the ACTN3 gene (CC, CT, TT).
aa<-matrix(resp,nrow=5,ncol=3,byrow=T)
rownames(aa)=c("African Am.","Asian","Cauca.","Hispanic","Other")
colnames(aa)=c("CC","CT","TT")
aa
## CC CT TT
## African Am. 16 6 5
## Asian 21 18 16
## Cauca. 125 216 126
## Hispanic 4 10 9
## Other 7 11 5
The research question of interest: Is there evidence of an association between genotype and race? The proportion of each genotype across the race group is shown in Figure 13.
resp<-c(16,6,5,21,18,16,125,216,126,4,10,9,7,11,5)
Genotype<-c("CC","CT","TT","CC","CT","TT","CC","CT","TT","CC","CT","TT",
"CC","CT","TT")
Race<-c("African Am.","Asian","Cauca.","Hispanic","Other",
"African Am.","Asian","Cauca.","Hispanic","Other",
"African Am.","Asian","Cauca.","Hispanic","Other")
tot<-c(27,27,27,55,55,55,467,467,467,23,23,23,24,24,24)
prop<-round(resp/tot,3)
FAMuSS<-data.frame(Race,Genotype,resp,tot,prop)
plot31a=ggplot(FAMuSS, aes(x=Genotype, y=prop, fill=Genotype)) +
geom_bar(stat = "identity", color="black")+
geom_text(aes(label = prop),vjust=-0.15)+
facet_wrap(~as.factor(Race))+
theme_void()
plot31a
Figure 13: Distribution of genotype across race groups.
The expected counts under the null hypothesis that there is no association between genotype and race are shown below. For example for the cell Asian/CT, \(E_{2,2}=24.12605\) while \(O_{2,2}=18\).
xx<-chisq.test(aa,correct=T)
#xx
xx$expected
## CC CT TT
## African Am. 7.850420 11.84370 7.305882
## Asian 15.991597 24.12605 14.882353
## Cauca. 135.783193 204.85210 126.364706
## Hispanic 6.687395 10.08908 6.223529
## Other 6.687395 10.08908 6.223529
For the cell Asian/CT, $e_{2,2}=18-24.12605= -6.12605 $, \(r_{ij}=-6.12605/ \sqrt{24.12605}=-1.24720\). The \(\chi^{2}\) statistic is equal to 19.4001.
xx$residuals
## CC CT TT
## African Am. 2.9086319 -1.69802497 -0.85310170
## Asian 1.2524298 -1.24720387 0.28971360
## Cauca. -0.9253891 0.77888407 -0.03244366
## Hispanic -1.0392093 -0.02804356 1.11294757
## Other 0.1208836 0.28678513 -0.49045147
z<-xx$statistic
z
## X-squared
## 19.4001
For the FAMuSS study, since \(I=5\) and \(J=3\), the \(\chi^{2}\) statistic is distributed with \((5-1) \times (3-1)= 8\) degrees of freedom. The p-value, as shown in Figure 13, is equal to 0.01286033 (the area to the right of the test statistic). Thus, for \(\alpha=0.05\), there is sufficient evidence to reject the null hypothesis of independence between race and genotype.
x<-seq(from=0,to=25,length=1000)
dx<-dchisq(x,8)
plot(x,dx,type="l",xlab="x",ylab="f(x)")
lines(c(z,z),c(0,0.1),col=2,lwd=3)
Figure 14: chi-square distribution with df=8 and the test statistic.
p.val1<-1-pchisq(z,8)
p.val1
## X-squared
## 0.01286033
The analysis presnted above can be conducted in R using the function chisq.test(). The output is shown below.
xx<-chisq.test(aa,correct=T)
xx
##
## Pearson's Chi-squared test
##
## data: aa
## X-squared = 19.4, df = 8, p-value = 0.01286
According to Alen Agresti’s book and wikipedia, the experiment involved a woman (Muriel Bristol) who claimed she could distinguish between cups of tea where milk was added first and those where tea was added first. Her future husband, William Roach, suggested a test where eight cups of tea were prepared, four with milk added first and four with tea added first, and they were randomly presented to her. The resukts are shown below
data <- data.frame(
"Milk" = c(3, 1),
"Tea " = c(1, 3),
row.names = c("Muriel guess: 1st Milk", "Muriel guess: 1st Tea"),
stringsAsFactors = FALSE
)
colnames(data) <- c(" Poured 1st Milk", "Poured 1st Tea")
data
## Poured 1st Milk Poured 1st Tea
## Muriel guess: 1st Milk 3 1
## Muriel guess: 1st Tea 1 3
test <- fisher.test(data,alternative = "greater")
test
##
## Fisher's Exact Test for Count Data
##
## data: data
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.3135693 Inf
## sample estimates:
## odds ratio
## 6.408309
data <- data.frame(
"Cured." = c(13, 3),
"Uncured" = c(4, 9),
row.names = c("Fecal Infusion", "Vancomycin"),
stringsAsFactors = FALSE
)
colnames(data) <- c("Cured", "Unured")
data
## Cured Unured
## Fecal Infusion 13 4
## Vancomycin 3 9
test <- fisher.test(data)
test
##
## Fisher's Exact Test for Count Data
##
## data: data
## p-value = 0.00953
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.373866 78.811505
## sample estimates:
## odds ratio
## 8.848725