Chi Square Test

Working with Categorical Data

Chi-Square

The table below shows the relationship between gender and party identification in a US state.

% & Democrat &Independent & Republican & Total \ %Male &279& 73 &225 &577 \ %Female &165& 47 & 191 &403 \ %Total &444 & 120 &416& 980 \

Test for association between gender and party affiliation at two appropriate levels and comment on your results.

Set out the null hypothesis that there is no association between method of computation and gender against the alternative, that there is. Be careful to get these the correct way round!

H0: There is no association. H1: There is an association.

Work out the expected values. For example, you should work out the expected value for the number of males who use no aids from the following: (95/195) × 22 = 10.7.



chisq.test(c(59,20,11,10))

chisq.test(c(59,20,11,10), p=c(9/16,3/16,3/16,1/16))


library(MASS)     # load the MASS package

tbl = table(survey$Smoke, survey$Exer)

Section 3. Chi-squared Test of Independence

Two random variables x and y are called independent if the probability distribution of one variable is not affected by the presence of another.

  • Assume fij is the observed frequency count of events belonging to both i-th category of x and j-th category of y.
  • Also assume eij to be the corresponding expected count if x and yare independent.
  • The null hypothesis of the independence assumption is to be rejected if the p-value of the following Chi-squared test statistics is less than a given significance level α.
> chisq.test(ctbl)
Pearson’s Chi-squared test
data: ctbl X-squared = 3.2328, df = 3, p-value = 0.3571
### Smoking Example
Using this in-built dataset, we shall test the association between smoking and exercise.
Test the hypothesis whether the students smoking habit is independent of their exercise level at 0.05 significance level.
r library(MASS) # load the MASS package tbl = table(survey$Smoke, survey$Exer) tbl # the contingency table
## ## Freq None Some ## Heavy 7 1 3 ## Never 87 18 84 ## Occas 12 3 4 ## Regul 9 1 7
r # Notice the small cell sizes
r class(survey$Smoke)
## [1] "factor"
```r # Sort the factors
levels(survey\(Smoke)=c('Never','Occas','Regul','Heavy') levels(survey\)Exer)=c(‘None’,‘Some’,‘Freq’) ```
### Chi Square Test for Independence
r chisq.test(tbl)
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## ## Pearson's Chi-squared test ## ## data: tbl ## X-squared = 5, df = 6, p-value = 0.5
* We have applied the chisq.test() function to the contingency table tbl, and found the p-value to be 0.4828. * We fail to reject the null hypothesis.
* The warning message found in the solution above is due to the small cell values in the contingency table. * To avoid such warning, we could combine the second and third columns of tbl.
```r # Dont throw out the raw data,make a ‘derived variable’ instead. survey\(Exer2 <- survey\)Exer
levels(survey$Exer2) <- list(Rare = c(‘None’,‘Some’),Freq = ‘Freq’) ```
r chisq.test(survey$Smoke,survey$Exer2)
## Warning in chisq.test(survey$Smoke, survey$Exer2): Chi-squared approximation ## may be incorrect
## ## Pearson's Chi-squared test ## ## data: survey$Smoke and survey$Exer2 ## X-squared = 5, df = 3, p-value = 0.2
### prob.table()
A useful command associated with the Chi-Square Test is prob.table(), whicn converts count data to proportions.
r ### OVerall Proportions prop.table(tbl)
## ## Freq None Some ## Heavy 0.02966 0.00424 0.01271 ## Never 0.36864 0.07627 0.35593 ## Occas 0.05085 0.01271 0.01695 ## Regul 0.03814 0.00424 0.02966
r ### Proportion of Row Variable prop.table(tbl,1)
## ## Freq None Some ## Heavy 0.6364 0.0909 0.2727 ## Never 0.4603 0.0952 0.4444 ## Occas 0.6316 0.1579 0.2105 ## Regul 0.5294 0.0588 0.4118
r ### Proportion of Column Variable prop.table(tbl,2)
## ## Freq None Some ## Heavy 0.0609 0.0435 0.0306 ## Never 0.7565 0.7826 0.8571 ## Occas 0.1043 0.1304 0.0408 ## Regul 0.0783 0.0435 0.0714