Department of Environmental Science, AUT

The Chi-squared test: Prerequisites

Content you should have understood before watching this video:

  • Number 2, ‘Variables’
  • Number 3, ‘Variation in data’
  • Number 4, ‘Basic statistical metrics’
  • Number 5, ‘Standard deviation and standard error’
  • Number 6, ‘Populations, samples, hypotheses’
  • Number 7, ‘Distributions’
  • Number 8, ‘Quantiles and probabilities’
  • Number 12, ‘Error types’

Reminder: Test statistics

Chi-squared test
  • For the t-test, we needed a metric that reflects
    • the difference between the samples
    • the standard deviations of the samples

The t-value was great for that. We compared the t-value to a random t-distribution

The fundamental question always is: ‘How does the test statistic of our sample compare to a random distribution of that test statistic ?’

The Chi-squared test: dancing cats

Chi-squared test
  • What if both the predictor and the response are categorical variables? Principle is the same, but our test statistic is different!

Here, we analyse frequencies

An example:

  • Can animals be trained to line-dance with different rewards?
  • Participants: 200 cats
  • Training: the animal was trained using either food or affection, not both
  • The animal then either learnt to line-dance or it did not
  • Outcome: the number of animals (frequency) that could dance or not in each reward category
  • We can tabulate these frequencies in a contingency table

The contingency table

Chi-squared test
head(d1) #long format
  dance reward
1    no      F
2    no      F
3    no      F
4    no      F
5    no      F
6    no      F
Learn to dance? Affection as reward Food as reward Total
NO 114 10 124
YES 48 28 76
TOTAL 162 38 200

The contingency table

Chi-squared test

If you would like to produce a contingency table in R, use

table(d1) 
     reward
dance   A   F
  no  114  10
  yes  48  28
addmargins(table(d1)) # if you would like to include the margin sums:
     reward
dance   A   F Sum
  no  114  10 124
  yes  48  28  76
  Sum 162  38 200

Pearson’s Chi-squared test

Chi-squared test

What frequencies would we expect if there was no correlation/interaction between the two variables?

Imagine we had a nicely balanced design…

Learn to dance? Affection as reward Food as reward Total
NO ? ? 100
YES ? ? 100
TOTAL 100 100 200

What frequencies would we expect?

Pearson’s Chi-squared test

Chi-squared test

Probably 50 in each case, that seems intuitive.

If you calculate the expected frequencies, it’s row total times column total divided by the grand total (100x100/200 = 50)

Learn to dance? Affection as reward Food as reward Total
NO 50 50 100
YES 50 50 100
TOTAL 100 100 200

We call this our modelled or expected frequencies

Pearson’s Chi-squared test

Chi-squared test

And this is how we arrive at our Chi-squared statistic!

\[\text{model}_{ij} = E_{ij} = \frac{\text{row total}_i \times \text{column total}_j}{n}\]

\[\chi^2 = \sum \frac{(\text{observed}_{ij} - \text{model}_{ij})^2}{\text{model}_{ij}} \]

  • \(i\) represents the rows in the contingency table
  • \(j\) represents the columns

The observed data are the frequencies in the actual contingency table

The resulting Chi-squared value is then checked against a random Chi-squared distribution with \((r-1)(c-1)\) degrees of freedom, \(r\) and \(c\) stands for rows/columns

Pearson’s Chi-squared test

Chi-squared test

Why one degree of freedom in a 2x2 contingency table? What are degrees of freedom again?

  • Because, given the row and column totals, you can choose exactly one number freely, the remaining numbers are then locked in! Try it out.

What is the null hypothesis?

‘The two variables are not significantly related’

or

‘Variable 1 is independent of variable 2’

or (if you only have one variable):

The counts per level are not significantly disproportionate

Pearson’s Chi-squared test

Chi-squared test

Pearson’s Chi-squared test in R (manually)

Chi-squared test
  • Calculate the modelled frequencies
  • Sum up the modelled minus the observed frequencies squared divided by the modelled frequencies!
  • Compare your computed Chi-squared value (your test statistic) to the distribution of random numbers that follow a Chi-squared distribution

Use the function pchisq() for this purpose (analogous to pnorm(), pt(), etc.:

model_fy = 76*38/200; model_fn = 124*38/200 # etc...
chisq = sum( (28 - model_fy)^2/model_fy + ...
pchisq(q = chisq, df = 1)

Pearson’s Chi-squared test in R

Chi-squared test

1 - pchisq(q = 25.35, df = 1)
[1] 4.781524e-07

Pearson’s Chi-squared test in R

Chi-squared test
  • Create a data frame in R that contains the contingency table you would like to test
  • Use the chisq.test() function on it
cats = data.frame(food = c(28, 10), affection = c(48, 114))
cats
  food affection
1   28        48
2   10       114
chisq.test(cats)

    Pearson's Chi-squared test with Yates' continuity correction

data:  cats
X-squared = 23.52, df = 1, p-value = 1.236e-06

Note that the computed chi-squared value is slightly different from the one we calculated. This is due to a correction factor, which we need not worry about now.

In a nutshell

Regression
  • The Chi-squared test is used to test data sets with two categorical variables
  • Table the frequency in a ‘contingency’ table and use it to test the null hypothesis that the two variables are not related
  • It is useful to be able to do a Chi-squared test by hand - both to practice R and to understand how it works
  • Conducting the test in R is quick and easy!