18 - The Chi-squared test

Department of Environmental Science, AUT

The Chi-squared test: Prerequisites

Content you should have understood before watching this video:

Number 2, ‘Variables’
Number 3, ‘Variation in data’
Number 4, ‘Basic statistical metrics’
Number 5, ‘Standard deviation and standard error’
Number 6, ‘Populations, samples, hypotheses’
Number 7, ‘Distributions’
Number 8, ‘Quantiles and probabilities’
Number 12, ‘Error types’

Reminder: Test statistics

Chi-squared test

For the t-test, we needed a metric that reflects
- the difference between the samples
- the standard deviations of the samples

The t-value was great for that. We compared the t-value to a random t-distribution

The fundamental question always is: ‘How does the test statistic of our sample compare to a random distribution of that test statistic ?’

The Chi-squared test: dancing cats

Chi-squared test

What if both the predictor and the response are categorical variables? Principle is the same, but our test statistic is different!

Here, we analyse frequencies

An example:

Can animals be trained to line-dance with different rewards?
Participants: 200 cats
Training: the animal was trained using either food or affection, not both
The animal then either learnt to line-dance or it did not
Outcome: the number of animals (frequency) that could dance or not in each reward category
We can tabulate these frequencies in a contingency table

The contingency table

Chi-squared test

head(d1) #long format

  dance reward
1    no      F
2    no      F
3    no      F
4    no      F
5    no      F
6    no      F

Learn to dance?	Affection as reward	Food as reward	Total
NO	114	10	124
YES	48	28	76
TOTAL	162	38	200

The contingency table

Chi-squared test

If you would like to produce a contingency table in R, use

table(d1)

     reward
dance   A   F
  no  114  10
  yes  48  28

addmargins(table(d1)) # if you would like to include the margin sums:

     reward
dance   A   F Sum
  no  114  10 124
  yes  48  28  76
  Sum 162  38 200

Pearson’s Chi-squared test

Chi-squared test

What frequencies would we expect if there was no correlation/interaction between the two variables?

Imagine we had a nicely balanced design…

Learn to dance?	Affection as reward	Food as reward	Total
NO	?	?	100
YES	?	?	100
TOTAL	100	100	200

What frequencies would we expect?

Pearson’s Chi-squared test

Chi-squared test

Probably 50 in each case, that seems intuitive.

If you calculate the expected frequencies, it’s row total times column total divided by the grand total (100x100/200 = 50)

Learn to dance?	Affection as reward	Food as reward	Total
NO	50	50	100
YES	50	50	100
TOTAL	100	100	200

We call this our modelled or expected frequencies

Pearson’s Chi-squared test

Chi-squared test

And this is how we arrive at our Chi-squared statistic!

\[\text{model}_{ij} = E_{ij} = \frac{\text{row total}_i \times \text{column total}_j}{n}\]

\[\chi^2 = \sum \frac{(\text{observed}_{ij} - \text{model}_{ij})^2}{\text{model}_{ij}} \]

\(i\) represents the rows in the contingency table
\(j\) represents the columns

The observed data are the frequencies in the actual contingency table

The resulting Chi-squared value is then checked against a random Chi-squared distribution with \((r-1)(c-1)\) degrees of freedom, \(r\) and \(c\) stands for rows/columns

Pearson’s Chi-squared test

Chi-squared test

Why one degree of freedom in a 2x2 contingency table? What are degrees of freedom again?

Because, given the row and column totals, you can choose exactly one number freely, the remaining numbers are then locked in! Try it out.

What is the null hypothesis?

‘The two variables are not significantly related’

‘Variable 1 is independent of variable 2’

or (if you only have one variable):

The counts per level are not significantly disproportionate

Pearson’s Chi-squared test

Chi-squared test

Pearson’s Chi-squared test in R (manually)

Chi-squared test

Calculate the modelled frequencies
Sum up the modelled minus the observed frequencies squared divided by the modelled frequencies!
Compare your computed Chi-squared value (your test statistic) to the distribution of random numbers that follow a Chi-squared distribution

Use the function pchisq() for this purpose (analogous to pnorm(), pt(), etc.:

model_fy = 76*38/200; model_fn = 124*38/200 # etc...
chisq = sum( (28 - model_fy)^2/model_fy + ...
pchisq(q = chisq, df = 1)

Pearson’s Chi-squared test in R

Chi-squared test

1 - pchisq(q = 25.35, df = 1)
[1] 4.781524e-07

Pearson’s Chi-squared test in R

Chi-squared test

Create a data frame in R that contains the contingency table you would like to test
Use the chisq.test() function on it

cats = data.frame(food = c(28, 10), affection = c(48, 114))
cats
  food affection
1   28        48
2   10       114
chisq.test(cats)

    Pearson's Chi-squared test with Yates' continuity correction

data:  cats
X-squared = 23.52, df = 1, p-value = 1.236e-06

Note that the computed chi-squared value is slightly different from the one we calculated. This is due to a correction factor, which we need not worry about now.

In a nutshell

Regression

The Chi-squared test is used to test data sets with two categorical variables
Table the frequency in a ‘contingency’ table and use it to test the null hypothesis that the two variables are not related
It is useful to be able to do a Chi-squared test by hand - both to practice R and to understand how it works
Conducting the test in R is quick and easy!