The Chi Square Distribution versus a Proportion Test

Hi, Team! Some of you have likely noticed that the results provided by prop.test in R do not include the Guassian probability and do include (by default) a corrected \(\chi^2\). The \(\chi^2\) is actually the distribution for a variance, so it is strictly positive. If we have observed and expected values, then \(\chi^2\) is calculated as the \(\sum_x (O_x-E_x)^2 / E_x\), where \(O_x\) and \(E_x\) are the observed values and expected values, respectively. You can see that this formula is calcuates the variance of the deviations between observed and expected.

The \(\chi^2\) is a distribution completely determined by the degrees of freedom. For the proportion test, the degrees of freedom are the number of truly random observations that can exist to obtain column and row means. We’ll come back to this.

Let’s start with an example from the Titanic, but we will do the analysis with males against females (as alluded to in the discussion prompt). This method is the way the analysis should actually be done.

titanic=read.csv('c:/users/lfult/documents/titanic/train.csv', stringsAsFactors = T)
t=rbind(table(titanic$Survived[titanic$Sex=='male']), table(titanic$Survived[titanic$Sex=='female']))
row.names(t)=c('Male','Female')
colnames(t)=c('Died', 'Lived')
(myt=addmargins(t))

##        Died Lived Sum
## Male    468   109 577
## Female   81   233 314
## Sum     549   342 891

Hypothesis Test

Let’s set up a simple hypothesis pair to consider. Let \(\pi_1\) represent the proportion of males that lived and \(pi_2\) represent the proportion of females that lived. We want to test the notion that females had higher survival rates. NOTE:the training data is a randomly generated sample from the population.

\(H_o: \pi_1-\pi_2 \ge 0\)

\(H_a: \pi_1-\pi_2 < 0\)

Here, we can see that the entirety of the sample space is covered and that we have a right-tailed test investigating the difference in proportions (segue into week 5). If females truly had a higher survival proportion, than the difference in those proportions (males minus females) should be negative. Let’s choose our standard \(\alpha = .05\).

Now, we have a couple of ways to proceed. First, let’s just do a proportion test using the simple calculations associated with it. For a two-sample proportion, the distribution is conveniently Gaussian (normal). The formula for the Z-statistic follows.

\(Z = \frac{(p_1-p_2-(\pi_1-\pi_2)}{\sqrt{\bar p \times (1-\bar p) \times (\frac{1}{n_1}+\frac{1}{n_2})}}\)

Here, \(p_1\) is the proportion of males that lived with a total count of males equaling \(n_1\), and \(p_2\) is the proportion of females that lived with a total count of females equaling \(n_2\). The overall proportion of those who lived is \(\bar p\). \(\pi_1 - \pi_2\) is the hypothesized difference between the two survival rates, which we have stated is zero (right-hand side of the hypotheses). It need not be zero at all, as we might test the difference that the difference is 0.1 or something else. But in our case it is zero.

p1=myt[1,2]/myt[1,3]   #male survived
p2=myt[2,2]/myt[2,3]   #female survived
pbar=myt[3,2]/myt[3,3] #overall proportion survived
delta=0 #hypothetical difference between pi1 and pi2 found on the right-hand side of the hypotheses statements. 
n1=myt[1,3] #number of males
n2=myt[2,3] #number of females
sigma=sqrt(pbar*(1-pbar)*(1/n1+1/n2))
Z=(p1-p2-delta)/sigma #Z Statistic
Zcrit=qnorm(.05) #Z Critical value
p_value=pnorm(Z) #p-Value
answer=round(c(p1, p2, pbar, delta, n1, n2, sigma, Z, Zcrit, p_value), 4)
names(answer)=c('Male prop', 'Female prop', 'Total prop', 'Hyp. Delta', 'Male n', 'Female n', 'sigma', 'Z', 'Z critical', 'p-value')
answer

##   Male prop Female prop  Total prop  Hyp. Delta      Male n    Female n 
##      0.1889      0.7420      0.3838      0.0000    577.0000    314.0000 
##       sigma           Z  Z critical     p-value 
##      0.0341    -16.2188     -1.6449      0.0000

The conclusion is clear. Reject the null hypothesis because \(p<\alpha\) or because \(Z<Zcrit\). Female survival rates are higher than male survival rates. In fact, this is a sample since we used the training set. But indeed, we would have had a census and been able to just assess the parameter for this problem. We could also plot this distribution.

curve(dnorm(x,), from = -3, to = 3,
      main = 'Standard Normal', #add title
      ylab = 'Density', #change y-axis label
      lwd = 2, #increase line width to 2
      col = 'blue') 


x_vector <- seq(-3,Zcrit, by=.001)

#create vector of density values
p_vector <- dnorm(x_vector)

#Fill
polygon(c(x_vector, rev(x_vector)), c(p_vector, rep(0, length(p_vector))),
        col = adjustcolor('red', alpha=0.9), border = NA)

lines(x=rep(Zcrit, 10), y=seq(-.1,1.5,length.out=10), col='black')
text(-2, .2, "Critical Value=-1.6449", srt=90)
text(-3,.2, "Test Statistic=-16.288", col='red', srt=90)

Alternatively, we could have used the prop.test function in R.

prop.test(t[,2:1],alternative='less', correct=F)

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  t[, 2:1]
## X-squared = 263.05, df = 1, p-value < 2.2e-16
## alternative hypothesis: less
## 95 percent confidence interval:
##  -1.0000000 -0.5044702
## sample estimates:
##    prop 1    prop 2 
## 0.1889081 0.7420382

Enter…Chi Squared

What you notice here on the prop.test is that what is reported is the \(\chi^2\) statistic with degrees of freedom, but the conclusion is the same. First, let’s show you how that is calculated. Remember from above that \(\chi{^2_{df}} =\sum_x (O_x-E_x)^2 / E_x\), where \(O_x\) and \(E_x\) are the observed values and expected values. The observed values are from the table we generated. How do we get the expected values?

We use probability. This particular \(\chi^2\) test is a test of independence. If gender does not play into survival, than male and female survival should be statistically independent. Independence means that P(AB)=P(A)P(B). Recall that P(A) and P(B) are marginal probabilities, meaning that they are calculated in the margins of our tables. Further, if independence holds, then E(AB)=E(A)E(B). So if we calculate E(AB) for each cell, then we have the expectation. Recall the original table.

myt

##        Died Lived Sum
## Male    468   109 577
## Female   81   233 314
## Sum     549   342 891

In our case, let’s calculate the expected number of males that would die under the assumption of independence. Let’s define male as G1 for Gender1 and Die as O1 for Outcome 1. Then \(P(G1O1)=P(G1)P(O1)\) and by linearity of expectation \(E(G1O1)=E(G1)E(O1)\).

From the table above, there are 577 males out of 891 total and 549 who died out of 891 total. So \(P(G1O1)=(577/891) \times (549/891)=0.3990182\) and the expected value is nothing more than \(E(G1O1)=891 \times (577/891) \times (549/891)=0.3990182\) This simplifies to \(577 \times 549 /891\) From this formula, you can see that we simply multiply totals in the margins lining up with the appropriate cell and divide by the total! Way to easy. Here is the math

myt2=noquote(rbind(c('577*549/891 = 355.5253', '577*342/891 = 221.4747'),c('314*549/891 = 193.4747','314*342/891 = 120.5253')))
colnames(myt2)=c('Died','Lived')
rownames(myt2)=c('Male', 'Female')
myt2

##        Died                   Lived                 
## Male   577*549/891 = 355.5253 577*342/891 = 221.4747
## Female 314*549/891 = 193.4747 314*342/891 = 120.5253

These our expected values! We can get this result simply using our original table as well.

myt3=rbind(c(myt[1,3]*myt[3,1]/891, myt[1,3]*myt[3,2]/891),c(myt[2,3]*myt[3,1]/891, myt[2,3]*myt[3,2]/891))
colnames(myt3)=c('Died','Lived')
rownames(myt3)=c('Male', 'Female')
myt3

##            Died    Lived
## Male   355.5253 221.4747
## Female 193.4747 120.5253

Calculating the Chi Squared Statistic

So now it becomes easy to calculate the Chi Squared Statistic, although we will talk degrees of freedom shorty

(chistat=sum((myt[1:2, 1:2]-myt3)^2/myt3))

## [1] 263.0506

This is exactly what we found (without the continuity correction) in the proportion test. And it must be so.

prop.test(t,alternative='less', correct=F)$statistic

## X-squared 
##  263.0506

Degrees of Freedom

But what should we do to calculate (and understand) the degrees of freedom for these problems? First, the calculation is easy. The degrees of freedom is equal to the number of rows minus one time the number of columns minus one, (\((r-1)\times(c-1)\)). In our case, we have two rows and two columns, so the solution is \((2-1)\times(2-1)=1\). But what does it really mean?

The degrees of freedom in this case are the number of elements that can actually be random in order to end up with the original statistics (e.g., row means and column means). All four elements of our table cannot be random, as we are not guaranteed to have the proper row means and column means that come with the data. Three elements cannot be random, because one row and one column would not be guaranteed to have their original means. Two elements cannot be random, as either a row mean or column mean is not guaranteed to be from the original data. But we can fully have one element randomly generated to produce the row and column means exactly, as they other elements are fixed. We can verify our calculations are correct in R.

prop.test(t,alternative='less', correct=F)$parameter

## df 
##  1

What about the Chi Square Distribution itself?

Well, the \(\chi^2\) distribution is the sum of the squares of standard normal (Gaussian) distributions, \(\sum_k (N(0,1)^2\), where k is the degrees of freedom or the number standard Normals summed minus one. If we have two standard normals that are squared (as is the case of the 2 x 2 table above), then we have a \(\chi^2\) with 1 degree of freedom (number of squared standard normals - 1). Being that we are squaring the normal, the distribution lives on 0 to infinity (unlike the standard normal which can live anywhere). Graphs of several \(\chi^2\) distributions follow.

curve(dchisq(x, df = 1), from = 0, to = 40,
      main = 'Chi-Square Distribution ', #add title
      ylab = 'Density', #change y-axis label
      lwd = 2, #increase line width to 2
      col = 'blue') 
curve(dchisq(x, df = 4), from = 0, to = 40,
      main = 'Chi-Square Distribution (df = 5)', #add title
      lwd = 2, #increase line width to 2
      col = 'red', add=T) 
curve(dchisq(x, df = 9), from = 0, to = 40,
      main = 'Chi-Square Distribution (df = 5)', #add title
      lwd = 2, #increase line width to 2
      col = 'darkgreen', add=T) 
curve(dchisq(x, df = 14), from = 0, to = 40,
      main = 'Chi-Square Distribution (df = 5)', #add title
      lwd = 2, #increase line width to 2
      col = 'black', add=T) 
legend("topright",                    # Add legend to plot
       legend = c("df=1", "df=4", "df=9", "df=14"),
       col = c("blue", "red", "darkgreen", "black"),
       pch = 16)

So What about the p-value from the Distribution?

Since the \(\chi^2\) is just a distribution, then the probabilities under the curve integrate to 1, and we can evaluate the probability in the distribution. NOTE: the distribution is not symmetric. In our case, we want the probability that the statistic we calculated is greater than a critical value in the distribution: \(P(\chi^2_1)\ge x\). We could also simply evaluate whether our calculated statistic is greater than the critical value (cutpoint) in the distributions where \(\alpha=.05\) lives.

pval=round(pchisq(chistat,1,lower.tail=F),3)
critval=round(qchisq(.95,1), 3)
myl=c(chistat, critval, pval)
names(myl)=c('Statistic', 'Critical Value', 'Pval')
myl

##      Statistic Critical Value           Pval 
##       263.0506         3.8410         0.0000

Graphically, this would look as follows.

curve(dchisq(x, df = 1), from = 0, to = 7,
      main = 'Chi-Square Distribution ', #add title
      ylab = 'Density', #change y-axis label
      lwd = 2, #increase line width to 2
      col = 'blue') 


x_vector <- seq(3.8410,7)

#create vector of chi-square density values
p_vector <- dchisq(x_vector, df = 1)

#fill in portion of the density plot 
polygon(c(x_vector, rev(x_vector)), c(p_vector, rep(0, length(p_vector))),
        col = adjustcolor('red', alpha=0.9), border = NA)

lines(x=rep(3.8410, 10), y=seq(-.1,1.5,length.out=10), col='black')
text(3.7, 1, "Critical Value=3.8410", srt=90)
text(7,1, "Test Statistic=263.0506", col='red', srt=90)

What about the Continuity Correction?

Sometimes, the cell counts in the expected \(\chi^2\) tables are very small (e.g., 5 or fewer). If that is the case (and since the \(\chi^2\) distribution is an approximation), the difference between the observed and expected squared can be very large and bias the test statistic in favor of rejecting the null.In these cases, a continuity correction is often applied to reduce bias (make it more difficult to find results. ) That correction provides a simple modification for all cells: \(\sum_x (|O_x-E_x|-0.5)^2 / E_x\). What this correction does is reduce the value of all cells, making the test statistic smaller. We can check that this is smaller using prop.test and manually.

f1=prop.test(t,alternative='less', correct=F)$statistic
f2=prop.test(t,alternative='less', correct=T)$statistic
f3=chistat
f4=sum((abs(t-myt3)-.5)^2/myt3)
ex=cbind(rbind(f1,f3),rbind(f2,f4))
colnames(ex)=c('Uncorrected', 'Corrected')
rownames(ex)=c('prop.test','Manual')
ex

##           Uncorrected Corrected
## prop.test    263.0506   260.717
## Manual       263.0506   260.717

What about Fisher’s Exact Test?

Next, we come to the Fisher’s Exact Test (FET), which is also known as….wait for it… the hypergeometric here. In this case, we are interested in the probability of the female group containing 233 or more living when the population has 342 living, 549 dead, and the sample size is 314 females. \(P(X\ge 233| s=342,f=549,n=314)\). Here, s are the successes, f are the failures, and n is the sample size.

\[\sum_{i=233}^{i=314} \frac{ \begin{pmatrix} 342 \cr i\cr \end{pmatrix} \begin{pmatrix} 549 \cr 314-i\cr \end{pmatrix}}{ \begin{pmatrix} 891 \cr 314\cr \end{pmatrix} }\]

Solving this is easy in R or manually.

1-phyper(233, 342, 549, 314)

## [1] 0

So the p-value equals zero. If we ran fisher.test in R, we would get the following.

(myfet=fisher.test(t, alternative='greater'))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  t
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  9.247349      Inf
## sample estimates:
## odds ratio 
##   12.30265

The Fisher’s Exact Test in R estimates the odds ratios (OR). Odds ratios equal to 1 reflect no difference. The odds ratio is calculated (without correction) by taking the number of females who lived divided by the number of females who died (to get the odds of female living) and then that is divided by the number of males who lived divided by the number of males who died. In our case that is (233/81)/(109/468). A correction for small cell counts (Haldane-Ambscombe) is often used, although we don’t have that problem here.

OR=(233/81)/(109/468) #uncorrected
ORadj=(233.5/81.5)/(109.5/468.5) #corrected
OR3=myfet$estimate
myans=c(OR,ORadj,OR3)
names(myans)=c('Manual', 'Manual Adj.', 'FET from R')
myans

##      Manual Manual Adj.  FET from R 
##    12.35066    12.25814    12.30265

You can see that is very close to the ratio above. The lower confidence interval for the odds ratios is calculated as follows \(\exp (\log (OR)+Z_\alpha\sigma)\). \(\sigma = \sqrt {1/a+1/b+1/c+1/d}\) where a through d are the individual cell counts. Once again, these are close to R’s calculations.

a1=exp(log(OR)+qnorm(.05)*sqrt(1/t[1,1]+1/t[1,2]+1/t[2,1]+1/t[2,2])) #uncorrected
a2=exp(log(ORadj)+qnorm(.05)*sqrt(1/(t[1,1]+.05)+1/(t[1,2]+.05)+1/(t[2,1]+.05)+1/(t[2,2]+.05))) #corrected
a3=myfet$conf.int[1]
ans4=c(a1,a2,a3)
names(ans4)=c('Manual', 'Manual Adj.', 'FET from R')
ans4

##      Manual Manual Adj.  FET from R 
##    9.381370    9.311690    9.247349

Punchline

There are million ways to approach problems, and given reasonable assumptions that hold across the methods chosen, the calculations will be nearly identical.

Chi Square vs. Proportion Test

Sith

2023-02-06