“Chi-square test: a case study”


author: “Alexander Levakov”

date: “March, 2015”

Agents aren’t airplanes. They don’t have schedules.
John Le Carre©

Chi-square test

The chi-square test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. Does the number of individuals or objects that fall in each category differ significantly from the number you would expect? Is this difference between the expected and observed due to sampling variation, or is it a real difference?

See: http://en.wikipedia.org/wiki/Chi-square_test

Case of interest

For demonstrating the power of chi-square test we use cross tabulations (Table B.7) from open source book (technical report): http://fas.org/sgp/library/spies.pdf

Problem

We want to estimate association between the job occupied (civilian, military) and the source of recruitment (volunteer, intelligence, family) that is declared strong by the author of this book. In fact this table can be used as is for chi-square test but we will use chisq.test and barplot R-functions to prove or reject this hypothesis.

recruit <- matrix(c(43,20,13,51,12,9),nrow=2,byrow=T)
dimnames(recruit) <- list(job = c("civil", "mil"),source = c("volunt","intell", "family"))
recruit
##        source
## job     volunt intell family
##   civil     43     20     13
##   mil       51     12      9

Null hypothesis

The author declares that there is association - \(H1\) while we want to prove \(H0\) (there is no association) for a significance level of \(P0=0.05\). Let’s start!

Test output

recruit.chi <- chisq.test(recruit)
recruit.chi
## 
##  Pearson's Chi-squared test
## 
## data:  recruit
## X-squared = 3.3024, df = 2, p-value = 0.1918

Chi-square destribution plot

x <- seq(from = 0, to = 10, by = .01) 
plot(x,dchisq(x,2),main="Chi-square distribution with df=2",type="l",col="blue")
abline(h=0.05,col="red")
grid()

As we can see (X-squared = 3.3024, df = 2, p-value = 0.1918) there is no association between job and source.

Diagrams as a prove

Bar plots for observed and expected counts demonstrate the lack of association, too. C’est parfait!

barplot(recruit,legend.text=T,col = c("lightcyan","lightgreen"),xlab="Source",ylab="Job",main="Job occupied by source of recruitment (observed counts)")

recruit2<-recruit.chi$expected
recruit2
##        source
## job       volunt   intell  family
##   civil 48.27027 16.43243 11.2973
##   mil   45.72973 15.56757 10.7027
barplot(recruit2,legend.text=T,col = c("lightcyan","lightgreen"),xlab="Source",ylab="Job",main="Job occupied by source of recruitment (expected counts)")

Minitab story

We can use more informative chi-square test in MINITAB with the same result. See: http://www.minitab.com/

Note

To be honest we must provide citation of the author’s words: The most striking finding in the table remains the higher proportion of military volunteers compared to civilians. Well, the diagram in the bottom % Difference between Observed and Expected Counts needs no comments. But this chart makes a great deal of illusion while the left one in the bottom proves no difference at all. That is the point!

Conclusions

Some times you need to prove your hypothesis before you print the book!

“I mean, you’ve got to compare method with method, and ideal with ideal.”
John Le Carre©