chi.sq - double classification

The double classification Chi Square is typically used to test for independence between two variables, with a null hypothesis that states the variables are independent and an alternative that they are not independent.

For a simple example, let us assume that we want to know if preference for pink or blue is independent of the sex of the child. To obtain a sample we ask the students in kindergarten at a large elementary school to identify which color each of them prefers. Not wanting to make assumptions about the gender identity of the children we obtain information about each child’s assigned sex from the school records. We obtain the following results:

For girls, 23 preferred pink and 19 preferred blue. For boys, 13 preferred pink and 9 preferred blue.

The easiest way to evaluate these data is to put them in a vector and turn the vector into a matrix.

pref <- c(23, 19, 13, 9) #A vector of the observed values
mpref <- matrix(pref, nrow = 2, ncol = 2) #Turn the vector into a matrix with 2 rows and 2 columns
dimnames(mpref) <- list(gender = c("F", "M"),
                        preference = c("Pink", "Blue")) #Creates labels for rows and columns (optional)
mpref #prints the matrix with rows and columns if provided
##       preference
## gender Pink Blue
##      F   23   13
##      M   19    9
Xsq <-chisq.test(mpref) # performs Chi square
Xsq # prints Chi square results
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mpref
## X-squared = 0.0043977, df = 1, p-value = 0.9471
Xsq$observed # prints observed values (same as pref)
##       preference
## gender Pink Blue
##      F   23   13
##      M   19    9
Xsq$expected #prints expected values
##       preference
## gender   Pink   Blue
##      F 23.625 12.375
##      M 18.375  9.625

Given our p value of 0.9471 there does not seem to be a relationship between our two variables.

You can add additional categories simply by adding data and increasing the number of rows or columns in the matrix, If we add a third color “black” and find that 4 girls and 6 boys prefer it, we can use the following code to do the analysis.

pref <- c(23, 19, 13, 9, 4, 6) #A vector of the observed values
mpref <- matrix(pref, nrow = 2, ncol = 3) #Turn the vector into a matrix with 2 rows and 2 columns
dimnames(mpref) <- list(gender = c("F", "M"),
                        preference = c("Pink", "Blue", "Black")) #Creates labels for rows and columns (optional)
mpref #prints the matrix with rows and columns if provided
##       preference
## gender Pink Blue Black
##      F   23   13     4
##      M   19    9     6
Xsq <-chisq.test(mpref) # performs Chi square
## Warning in chisq.test(mpref): Chi-squared approximation may be incorrect
Xsq$observed # prints observed values (same as pref)
##       preference
## gender Pink Blue Black
##      F   23   13     4
##      M   19    9     6
Xsq$expected #prints expected values
##       preference
## gender    Pink     Blue    Black
##      F 22.7027 11.89189 5.405405
##      M 19.2973 10.10811 4.594595

If we have an existing data set with qualitative variables we can read the data into R, create a table and perform the Chi square analysis on the resulting table. Here is an example using the LungCapData to determine whether there is a relationship between gender and smoking.

LungCapData <- read.csv("~/Desktop/Data/LungCapData.csv")
tbl <- table(LungCapData$Smoke, LungCapData$Gender)
tbl
##      
##       female male
##   no     314  334
##   yes     44   33
chisq.test(tbl)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tbl
## X-squared = 1.7443, df = 1, p-value = 0.1866

As out obtained p value of 0.1866 is greater than 0.05 we would retain the null hypothesis of no relationship between the two variables.