Content you should have understood before watching this video:
- Number 2, ‘Variables’
- Number 3, ‘Variation in data’
- Number 4, ‘Basic statistical metrics’
- Number 5, ‘Standard deviation and standard error’
- Number 6, ‘Populations, samples, hypotheses’
- Number 7, ‘Distributions’
- Number 8, ‘Quantiles and probabilities’
- Number 12, ‘Error types’
Reminder: Test statistics
- For the t-test, we needed a metric that reflects
- the difference between the samples
- the standard deviations of the samples
The t-value was great for that. We compared the t-value to a random t-distribution
The fundamental question always is: ‘How does the test statistic of our sample compare to a random distribution of that test statistic ?’
The Chi-squared test: dancing cats
- What if both the predictor and the response are categorical variables? Principle is the same, but our test statistic is different!
Here, we analyse frequencies
An example:
- Can animals be trained to line-dance with different rewards?
- Participants: 200 cats
- Training: the animal was trained using either food or affection, not both
- The animal then either learnt to line-dance or it did not
- Outcome: the number of animals (frequency) that could dance or not in each reward category
- We can tabulate these frequencies in a contingency table
The contingency table
head(d1) #long format
dance reward 1 no F 2 no F 3 no F 4 no F 5 no F 6 no F
Learn to dance? | Affection as reward | Food as reward | Total |
---|---|---|---|
NO | 114 | 10 | 124 |
YES | 48 | 28 | 76 |
TOTAL | 162 | 38 | 200 |
The contingency table
If you would like to produce a contingency table in R, use
table(d1)
reward dance A F no 114 10 yes 48 28
addmargins(table(d1)) # if you would like to include the margin sums:
reward dance A F Sum no 114 10 124 yes 48 28 76 Sum 162 38 200
Pearson’s Chi-squared test
What frequencies would we expect if there was no correlation/interaction between the two variables?
Imagine we had a nicely balanced design…
Learn to dance? | Affection as reward | Food as reward | Total |
---|---|---|---|
NO | ? | ? | 100 |
YES | ? | ? | 100 |
TOTAL | 100 | 100 | 200 |
What frequencies would we expect?
Pearson’s Chi-squared test
Probably 50 in each case, that seems intuitive.
If you calculate the expected frequencies, it’s row total times column total divided by the grand total (100x100/200 = 50)
Learn to dance? | Affection as reward | Food as reward | Total |
---|---|---|---|
NO | 50 | 50 | 100 |
YES | 50 | 50 | 100 |
TOTAL | 100 | 100 | 200 |
We call this our modelled or expected frequencies
Pearson’s Chi-squared test
And this is how we arrive at our Chi-squared statistic!
\[\text{model}_{ij} = E_{ij} = \frac{\text{row total}_i \times \text{column total}_j}{n}\]
\[\chi^2 = \sum \frac{(\text{observed}_{ij} - \text{model}_{ij})^2}{\text{model}_{ij}} \]
- \(i\) represents the rows in the contingency table
- \(j\) represents the columns
The observed data are the frequencies in the actual contingency table
The resulting Chi-squared value is then checked against a random Chi-squared distribution with \((r-1)(c-1)\) degrees of freedom, \(r\) and \(c\) stands for rows/columns
Pearson’s Chi-squared test
Why one degree of freedom in a 2x2 contingency table? What are degrees of freedom again?
- Because, given the row and column totals, you can choose exactly one number freely, the remaining numbers are then locked in! Try it out.
What is the null hypothesis?
‘The two variables are not significantly related’
or
‘Variable 1 is independent of variable 2’
or (if you only have one variable):
The counts per level are not significantly disproportionate
Pearson’s Chi-squared test
Pearson’s Chi-squared test in R (manually)
- Calculate the modelled frequencies
- Sum up the modelled minus the observed frequencies squared divided by the modelled frequencies!
- Compare your computed Chi-squared value (your test statistic) to the distribution of random numbers that follow a Chi-squared distribution
Use the function pchisq()
for this purpose (analogous to pnorm()
, pt()
, etc.:
model_fy = 76*38/200; model_fn = 124*38/200 # etc... chisq = sum( (28 - model_fy)^2/model_fy + ... pchisq(q = chisq, df = 1)
Pearson’s Chi-squared test in R
1 - pchisq(q = 25.35, df = 1) [1] 4.781524e-07
Pearson’s Chi-squared test in R
- Create a data frame in R that contains the contingency table you would like to test
- Use the
chisq.test()
function on it
cats = data.frame(food = c(28, 10), affection = c(48, 114)) cats food affection 1 28 48 2 10 114 chisq.test(cats) Pearson's Chi-squared test with Yates' continuity correction data: cats X-squared = 23.52, df = 1, p-value = 1.236e-06
Note that the computed chi-squared value is slightly different from the one we calculated. This is due to a correction factor, which we need not worry about now.
In a nutshell
- The Chi-squared test is used to test data sets with two categorical variables
- Table the frequency in a ‘contingency’ table and use it to test the null hypothesis that the two variables are not related
- It is useful to be able to do a Chi-squared test by hand - both to practice R and to understand how it works
- Conducting the test in R is quick and easy!