BIO2POS Lecture Topic 6B

class: middle
background-image: url(data:image/png;base64,#LTU_logo_clear.jpg)
background-position: top left
background-size: 25%

# BIO2POS 
# Chi-Square Test of Association for Categorical Data
## Data Analysis Topic 6B
### La Trobe University

---

# Welcome!

### In our final DA lecture on new content, we will introduce the Chi-Square Test of Association for categorical data.

Over the following slides, we will cover:

* .orangered_style[Chi-Square Test of Association]

* Definition and Hypotheses
    
--

* Two-way table calculations
    
--

* Interpreting Output
    
--

* Assumptions

* Notes about the exam
---

# Intended Learning Objectives

### By the end of this lecture you will:

* understand when and how to use a .orangered_style[Chi-Square Test of Association]
  
--

* be able to distinguish between this test and the Chi-Square Goodness of Fit test, and know which is more appropriate to use in a given context

* be able to assess whether Chi-Square Test of Association assumptions have been met

* understand and be able to compute the .orangered_style[Test Statistic],  .orangered_style[expected counts] and .orangered_style[degrees of freedom] for a Chi-Square Test of Association
  
--

* be able to correctly .seagreen_style[interpret] and .seagreen_style[summarise] the results of the Chi-Square Test of Association
  
--

The content you learn in Topic 6B concludes our focus on statistical tests for categorical data.

We will practice content from this topic in this week's DA computer lab, and the computer lab has some additional extension material.

---

# Assessing Categorical Data

Recall we introduced the Chi-Square Goodness of Fit test in the [Topic 6A](https://rpubs.com/LTU_BIO2POS/DA6A) lecture.

* This test is for assessing data for a **single** .seagreen_style[categorical variable].

If we have data for **two** .seagreen_style[categorical variables], our analysis focus will change.

We may like to ask now if there is an .orangered_style[association] between the two variables, e.g.:

* was there an association between `sex` and `survival status` on the Titanic?

* is there an association between crayfish `foraging` habits and `time of day`?

* is there an association between `birth month` and `playing position` of elite Chinese football players?
    
--

The .orangered_style[Chi-Square Test of Association] can be used to test questions like these.

---

# The Chi-Square Test of Association

We can conduct a .orangered_style[Chi-Square Test of Association] (aka Chi-Square Test of Independence) to test if there is an association between two categorical variables.

* In other words, we are testing if the two variables are independent

Our null and alternate hypotheses will be:

<center>
`$H_0$`: There is no association between the two variables (i.e. they are independent)
`$$\text{vs}$$`
`$H_1$`: There is an association between the two variables

---

# The Chi-Square Test of Association

The test process and calculation steps are similar to what we covered in [Topic 6A](https://rpubs.com/LTU_BIO2POS/DA6A).

We will:

1. Define the null and alternate hypotheses
  
--

2. Calculate the degrees of freedom
  
--

3. Compute the test statistic and `$p$`-value
  
--

4. Check the test assumptions
  
--

5. Reach conclusion (reject `$H_0$`/fail to reject `$H_0$`)
  
--

6. Summarise results

---

# Notation

Since we are now dealing with two categorical variables, it will be helpful to introduce some notation.

Suppose our first categorical variable has `$r$` different categories `$(r \geq 2)$`

* Let `$i = \{1, 2, \ldots, r\}$` be used as a subscript to denote the category selected for this first categorical variable
  
--

Suppose our second categorical variable has `$c$` different categories `$(c \geq 2)$`

* Let `$j = \{1, 2, \ldots, c\}$` be used as a subscript to denote the category selected for this second categorical variable

* *Note that `$r$` and `$c$` can be equal or unequal to each other, which is why we use separate notation*
  
--

With this notation, we can record the .orangered_style[Observed Counts] `$O_{ij}$` across all categories.

* E.g. `$O_{13}$` can denote the observed count for category 1 of variable 1 and category 3 of variable 2

---

# Two-way Tables

For a Chi-Square Test of Association, we normally display our observed data in a two-way table.

* Note that we can produce this table in jamovi - you do not need to create it by hand

---

# Chi-Square Test of Association - The Titanic Example

All this notation can be a little overwhelming. To help clarify things, we will walk through the following example.

**Scenario** - .seagreen_style[The sinking of the Titanic]: The Titanic sank on its maiden voyage in 1912 after hitting an iceberg at full speed.

* The Titanic was the largest ship afloat at the time, and was considered *unsinkable*.

* Numbers vary across sources, but approximately 711 individuals on board survived the sinking, while 1490 died `$^\star$`.

`$\star$` .caption_style[  Data from R Core Team (2023).]

---

.caption_style[
Note. From File:RMS Titanic 3.jpg, by [Francis Godolphin Osbourne Stuart](https://en.wikipedia.org/wiki/Francis_Godolphin_Osbourne_Stuart), 1912, Wikimedia Commons ([https://commons.wikimedia.org/](https://commons.wikimedia.org/)). In the public domain.
]

---

# Two-way Table - Titanic Example

Suppose we are interested in determining if there was an association between the two categorical variables `Survived` (Yes/No) and `Sex` (M/F). Hence we have:

`$H_0$`: There is no association between the `Survived` and `Sex` variables
`$$\text{vs}$$`
`$H_1$`: There is an association between the `Survived` and `Sex` variables

Here our .orangered_style[Observed Counts] are `$O_{11} = 1364$`, `$O_{12} = 126$`, `$O_{21} = 367$` and `$O_{22} = 344$`.

---

# Expected Counts

As part of our Chi-Square Test of Association, we also need to compute .orangered_style[Expected Counts] `$E_{ij}$`.

The .orangered_style[Expected Count] for a given cell is found by multiplying the **row total** by the **column total**, and then dividing by the **overall sample size** (the *grand total*).

The  .orangered_style[Expected Count] for a cell gives us the count value we would expect to see, **if there was no association** between the two variables.

In our .seagreen_style[Titanic example], we have, e.g.:

`$$E_{11} = \dfrac{1490 \times 1731}{2201} \approx 1171.83$$`

`$$E_{21} = \dfrac{711 \times 1731}{2201} \approx 559.17$$`

* Note that these should sum to the column total of `$1731$`
  
---

# Two-way tables in jamovi

We can produce simple two-way tables in jamovi, with just the observed counts shown:

---

# Two-way tables in jamovi

We can also easily produce more detailed two-way tables in jamovi, such as:

---

# Chi-Square Test of Association Test Statistic

Using the information in our two-way table, we can compute a .orangered_style[Test Statistic] for our Chi-Square Test of Association:

`$$\text{Test Statistic:  }\,  \chi^2 = \sum_{i = 1}^r \sum_{j = 1}^c \dfrac{(O_{ij} - E_{ij})^2}{E_{ij}}$$`
--

Here:

* `$\chi^2 \sim \chi^2_{df}$`

* `$O_{ij}$` is our .orangered_style[observed count] (i.e. frequency) in row `$i$` and column `$j$`
  
--

* `$E_{ij}$` is our .orangered_style[expected count] (i.e. frequency) in row `$i$` and column `$j$`
  
--

* `$r$` is the number of rows (number of categories for variable 1)
  
--

* `$c$` is the number of columns (number of categories for variable 2)

---

# Chi-Square Test of Association Test Statistic - Titanic Example

While you do not need to calculate this test statistic by hand, the process is outlined below, for your reference, using the .seagreen_style[Titanic example] data:

`$$\chi^2 = \dfrac{(1364 - 1171.8)^2}{1171.8} + \dfrac{(126 - 318.2)^2}{318.2} + \dfrac{(367 - 559.2)^2}{559.2} + \dfrac{(344 - 151.2)^2}{151.8}$$`

`$$\approx 31.5 + 116.1 + 66.1 + 243.4$$`

`$$\approx 457.1$$`
--

* Note that this calculation helps us identify the sources contributing the most to the test statistic value (e.g. most of the test statistic amount comes from the discrepancy between the expected and observed females who survived)

* *Note that I've done some rounding here for conciseness*
  
---

# Chi-Square Test of Association Degrees of Freedom

Fortunately, calculating the .orangered_style[degrees of freedom] for a Chi-Square Test of Association is much easier than calculating the test statistic and expected counts.

`$$df = (r-1) \times (c-1)$$`

* *Note that this is different to the Goodness of Fit calculation*

For our .seagreen_style[Titanic example], the degrees of freedom will be `$(2-1) \times (2-1) = 1$`.

---

# Chi-Square Test of Association jamovi output

.pull-left[
<img src="data:image/png;base64,#titanic_survival_barplot.jpg" width="350px" style="display: block; margin: auto;" />
]

.pull-right[
<img src="data:image/png;base64,#chisquare_toa_titanic.jpg" width="350px" style="display: block; margin: auto;" />
]

We observe a Chi-Square test statistic `$\chi_1^2 = 456.874$`, with `$df=1$`, and an associated `$p$`-value `$<.001$`.

---

# Chi-Square distribution

*  The `$\chi_1^2 = 456.874$` value is extremely unlikely, hence the miniscule `$p$`-value!
  
---

# Checking Assumptions

Before we summarise our test results, we should check that the .orangered_style[test assumptions] are satisfied.

The good news is that the test assumptions for the .orangered_style[Chi-Square Test of Association]  are identical to those of the .orangered_style[Chi-Square Goodness of Fit] we covered in [Topic 6A](https://rpubs.com/LTU_BIO2POS/DA6A), i.e.:

1. There are **no expected counts of 0**
  
--

2. No more than 20% of the categories have an expected count **less than 5**
  
--

For our .seagreen_style[Titanic example], these assumptions are both satisfied: All categories have expected counts much higher than 5.

---

# Effect Size - Cramer's `$V$`

We can also compute an .orangered_style[effect size] for our Chi-Square Test of Association.

As you might expect, several effect size options exist. We will focus on .orangered_style[Cramer's *V*], which can be used for `$r, c \geq 2$`.

The calculation of Cramer's `$V$` is (somewhat) straightforward:

`$$V = \sqrt{\dfrac{\chi^2}{n \times \text{min}(r-1, c-1)}}$$`
--

* E.g. for our .seagreen_style[Titanic example], `$V = \sqrt{\dfrac{456.874}{2201 \times (2-1)}} \approx 0.456$`

However, the interpretation of Cramer's `$V$` is less straightforward - it will depend on the `$df$` value (see e.g. Kim, 2017).

* For this scenario, a value of `$0.456$` is considered a medium to large effect size

---

# Chi-Square Test of Association Summary Titanic Example

A .orangered_style[Chi-Square Test of Association] was conducted to determine if there was an association between the .seagreen_style[sex] and .seagreen_style[survival status] of individuals aboard the .seagreen_style[Titanic].

* `$73.2\%$` of females on board survived the sinking of the Titanic

* Only  `$21.2\%$` of males on board survived the sinking of the Titanic

A statistically and clinically significant association was found between sex and survival status at the `$\alpha = 0.05$` level of significance, with `$\chi_1^2 = 456.874, n = 2201, p < .001$`, and a medium-to-large effect size, with Cramer's `$V = 0.456$`.

Females were statistically significantly more likely to survive the sinking of the Titanic than males.

---

# A second Titanic example

Suppose we conduct a more detailed analysis of the Titanic survival data, and split individuals by their `Status`:

* *crew*

* *1st class passenger*

* *2nd class passenger*

* *3rd class passenger*

Now, we have:

`$H_0$`: There is no association between the `Survived` and `Status` variables
`$$\text{vs}$$`
`$H_1$`: There is an association between the `Survived` and `Status` variables

---

# Two-way Table - Second Titanic Example

Our two-way table is larger now, with `$c = 4$`:

* Do any interesting percentages pop out at you?

---

# Chi-Square Test of Association jamovi output

.pull-left[
<img src="data:image/png;base64,#titanic_survival_status_barplot.jpg" width="400px" style="display: block; margin: auto;" />
]

.pull-right[
<img src="data:image/png;base64,#chisquare_toa_titanic_status.jpg" width="400px" style="display: block; margin: auto;" />
]

We observe a Chi-Square test statistic `$\chi_1^2 = 190.401$`, with `$df=3$`, and an associated `$p$`-value `$<.001$`. The Cramer's `$V$` effect size is `$0.294$`.

---

# Chi-Square Test of Association Summary Second Titanic Example

A .orangered_style[Chi-Square Test of Association] was conducted to determine if there was an association between the .seagreen_style[status] and .seagreen_style[survival] of individuals aboard the .seagreen_style[Titanic]. Observations include:

* `$76\%$` of the .seagreen_style[crew] did not survive

* `$62.5\%$` of .seagreen_style[1st class passengers] survived, while only `$41.4\%$` and `$25.2\%$` of .seagreen_style[2nd and 3rd class passengers] survived, respectively

A statistically and clinically significant association was found between status and survival at the `$\alpha = 0.05$` level of significance, with `$\chi_3^2 = 190.401, n = 2201, p < .001$`, and a medium effect size, with Cramer's `$V = 0.294$`.

Passengers in 1st class were the most likely to survive the sinking of the Titanic, out of the 4 status categories.

---

# Ordinal Data - Titanic Example

In the previous example, we treated .seagreen_style[status] as a .orangered_style[nominal] categorical variable.

However, it might be more appropriate to treat it as an .orangered_style[ordinal] categorical variable, particularly if we focus solely on the passengers (1st, 2nd and 3rd class).

Our analysis process will not change, but the effect size we use will change.

For ordinal data, one appropriate effect size would be the .orangered_style[Gamma] value.

This takes values between -1 and 1, and we can think of it as being similar to a Pearson correlation between the two variables:

* Values close to 1 suggest a strong positive association
  
--

* Values close to -1 suggest a strong negative association
  
--

* Values close to 0 suggest little to no association

---

# Chi-Square Test of Association - Ordinal Data

.pull-left[
<img src="data:image/png;base64,#titanic_survival_barplot_ordinal_coloured.jpg" width="400px" style="display: block; margin: auto;" />
]

.pull-right[
<img src="data:image/png;base64,#chisquare_toa_titanic_status_ordinal.jpg" width="300px" style="display: block; margin: auto;" />
]

Here Gamma `$= -0.511$`, suggesting a strong negative association between status and survival (as Class goes down (1 -> 2 -> 3), so too does chance of survival).

---

# Summary

A .orangered_style[Chi-Square Test of Association] (aka Test of Independence) can be used to check for an association between two .seagreen_style[categorical variables].

* Each categorical variable can have `$2+$` categories
  
--

* To assess the clinical significance of our results, we can use:
  
    * The Cramer's `$V$` effect size for nominal data
    
    * The Gamma effect size for ordinal data

* We need to check the assumptions of our Chi-Square Test of Association before we conduct our test

* If the test assumptions fail, there are alternatives (beyond the scope of this subject, e.g. `$G$`-tests)

---

# End

That concludes our lecture on Chi-Square Tests of Association.

It also concludes the DA content component of the subject. Our final DA lecture in Week 12 will focus on revision.

---

# A note about the exam

As a heads up for the exam, for each of the following tests you should be able to:

* identify what numeric/categorical variables are required 
  
  * discuss what the test does
  
  * interpret jamovi output
  
---

# Assessable Tests

* One sample `$t$`-test
  
  * Two sample `$t$`-test
  
    * Mann-Whitney U test
    
  * Paired `$t$`-test
  
    * Wilcoxon Signed-Rank test
    
  * One-way ANOVA
  
    * Kruskal-Wallis test
    
  * One-way Repeated Measures ANOVA
  
    * Friedman test
    
  * Pearson Correlation
  
  * Simple and Multiple Linear Regression
  
  * Chi-Square Goodness of Fit test and Test of Association

---

### What to do next:

* .seagreen_style[Quick Kahoot revision quiz]: Please go to [kahoot.it](kahoot.it) and type in the code shown

* If you have any questions, check the LMS, email us or ask in the computer labs

### Optional Further Reading

* Parts from Kokoska (2020) Chapter 13
  
---

# References

* Kim, H.-Y. (2017). Statistical notes for clinical researchers: Chi-squared test and Fisher's exact test. *Restorative Dentistry & Endodontics*, 42(2), 152–155. [https://doi.org/10.5395/rde.2017.42.2.152](https://doi.org/10.5395/rde.2017.42.2.152)

* Kokoska, S. (2020). Introductory statistics: a problem-solving approach (Third edition..). W H FREEMAN.

* R Core Team. (2023). _R: A Language and Environment for Statistical Computing_. R Foundation for
 Statistical Computing, Vienna, Austria. <https://www.R-project.org/>.

* The jamovi project. (2022). *Jamovi [Computer Software]*. [https://www.jamovi.org](https://www.jamovi.org).

---
class: middle

These notes have been prepared by Rupert Kuveke, Amanda Shaker, and other members of the Department of Mathematical and Physical Sciences. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with the Department of Environment and Genetics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a>