Chi-square tests in R

Chi-square tests are used to compare relationships between variables measured at the nominal scale. These tests look for differences among frequencies or departures from expected frequencies.

The test algorithm is probably familiar for people who have taken genetics:

\(\chi^2=\sum\frac{\left(observed\:-expected\right)^2}{expected}\)

This test has to include its degrees of freedom, which is DF = # categories – 1 – the number of parameters estimated by the data.

The assumptions of this test are:

Observations taken at the nominal scale. Categories of the nominal scale are represented as mutually exclusive.
Observations are independent of one another.
No category has an expected frequency of less than 1, or if there are lots of categories, not more than 20% have an expected frequency < 5.

Goodness-of-fit

We can use chi-square to see if a set of observations fit our expectations.

Let’s say that we have a bunch of flowers, and we think that flower petal coloration is determined by a simple Mendelian gene. If this were true, then we would expect a color ratio of 3 yellow to 1 white. We observed 100 flowers and found 84 yellow and 16 white. Can we determine if these follow expected Mendelian ratios at a confidence level of 0.05?

So, we need to determine what our expected frequencies are. Given 100 flowers with a 3:1 ratio, we would expect 75 yellow and 25 white.

\(H_0\): Ratio of yellow to white is 3 : 1.
\(H_a\): Ratio of yellow to white is not 3 : 1.

In R we use the chisq.test() function:

# make a vector of expected values
obs <- c(84, 16)  

# make a vector of expected *proportions* 
exp <- c(0.75, 0.25) 

# do the test
chisq.test(x = obs, p = exp)

## 
##  Chi-squared test for given probabilities
## 
## data:  obs
## X-squared = 4.32, df = 1, p-value = 0.03767

This value chi-square results in a P-value of 0.04, so we would reject the null hypothesis that the flowers have a 3 : 1 color ratio. Based on this, we conclude that the gene that controls flower color is not a Mendelian one.

Goodness-of-association

Another use of chi-square is to see if disproportionate relationships exist between a category variable and a grouping variable. Consider this example that companies severity of lesions with age. For these data,

\(H_0\): There is an relationship between age and lesion severity.

\(H_a\): Lesion severity depends on age.

Download the raw data file here.

# bring in the data. Fill in with your file path.
lesions <- read.csv(url("https://raw.githubusercontent.com/nmccurtin/CSVfilesbiostats/master/lesions.csv"))

# make grouping variables
lesions$age <- factor(lesions$age, 
 levels = c("30-39", "40-49", "50-59", "60-69"))

# make those data a table
lesionsTable <- table(lesions$age, lesions$severity)

# and add the marginal totals on the table
addmargins(lesionsTable)

##        
##           1   2   3 Sum
##   30-39  18   7   9  34
##   40-49  56  29  12  97
##   50-59  83  38  23 144
##   60-69  62  25  18 105
##   Sum   219  99  62 380

# If you want a mosaic plot of these data
mosaicplot( t(lesionsTable), col = c("red2", "yellow", "green3", "blue"), 
 cex.axis = 1, main = "", las = 2,
 sub = "Lesion severity", ylab = "Age")

# do the test of association
chisq.test(lesions$age, lesions$severity, correct = FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  lesions$age and lesions$severity
## X-squared = 4.4439, df = 6, p-value = 0.6168

In this case, the P-value of 0.62 is way higher than the usual confidence level of 0.05, so we fail to reject the null hypothesis.

Conclusion

There are other applications of the chi-square test, but these two will get you pretty far. Soon we’ll get to working with data that are ratio-level in information.