BIO2POS Lecture Topic 6A

class: middle
background-image: url(data:image/png;base64,#LTU_logo_clear.jpg)
background-position: top left
background-size: 25%

# BIO2POS 
# Chi-Square Goodness of Fit Tests <br> for Categorical Data
## Data Analysis Topic 6A
### La Trobe University

---

# Welcome!

### In this lecture we will introduce non-parametric tests for categorical data.

Over the following slides, we will cover:

* .orangered_style[Chi-Square Goodness of Fit Test]

* Definition and Hypotheses - Simple and Extended versions
    
--

* Interpreting Output
--

* Assumptions
--

* Chi-Square Distribution and Test Statistic Calculations

---

# Intended Learning Objectives

### By the end of this lecture you will:

* understand when and how to use a .orangered_style[Chi-Square Goodness of Fit Test]

* be able to assess whether Chi-Square Goodness of Fit Test assumptions have been met

* understand and be able to compute the .orangered_style[Test Statistic] and .orangered_style[degrees of freedom] for a Chi-Square Goodness of Fit Test
  
--

* be able to correctly .seagreen_style[interpret] and .seagreen_style[summarise] the results of both versions of the Chi-Square Goodness of Fit Test
  
--

<br>

The content you learn in Topics 6A and 6B extends the skills you have developed in previous topics, providing you with statistical tools for analysing categorical data.

We will practice content from this topic in this week's DA computer lab, and the computer lab has some additional extension material if you would like to extend your knowledge.

---

# Tests covered so far

---

# Analysing Categorical Data

So far, we have covered a wide range of commonly used parametric and non-parametric statistical tests.

However, these tests mostly involved analysing .seagreen_style[continuous numeric data].

* What happens if we encounter .orangered_style[categorical data] (see [Topic 1A](https://rpubs.com/LTU_BIO2POS/DA1A))?

To analyse categorical data, we will need to introduce a new type of test: <br> the .orangered_style[Chi-Square Test]

There are several types of Chi-Square Test, and often different names are used for effectively the same test.

We will focus on two of the most commonly used ones:

* The .orangered_style[Chi-Square Goodness of Fit test] (this lecture)

* The *Chi-Square Test of Association (aka Test of Independence)* (next lecture)

---

# Testing Proportions in Categorical Data

Suppose we have some categorical data for a single categorical variable, with multiple categories.

We may be interested in assessing the **proportions** of individuals in each of the categories.

As an introductory example, suppose we are interested in the .seagreen_style[eye colour of LTU students].

* We will most likely have several categories, e.g.: 
  
  `$$blue, green, brown, amber, violet$$`
  
--

*  Before we sample any data, we can specify the expected proportions of students we expect to observe
  
--

* Perhaps we expect the proportions to be equal `$(0.2, 0.2, 0.2, 0.2, 0.2)$`?
    
--

* Perhaps (more realistically) we expect some eye colours to have different proportions, e.g. `$(0.33, 0.15, 0.4, 0.1, 0.02)$`?

---

# The Chi-Square Goodness of Fit test

To formally test if the **observed proportion** of individuals in each category of a .seagreen_style[categorical variable] matches the **expected proportions**, we can conduct a <br> .orangered_style[Chi-Square Goodness of Fit] test.

As we might expect, given the eye colour example, there are two versions of this test:

* **Simple Case:** We can assume the proportions are equal across all categories 
  
--

* **Extended Case:** We can assign specific expected proportions to each category
  
--

The specification of the null and alternate hypotheses will depend on the test version used, but as a general introduction, we will be testing:

--
<br>

<center>
`$H_0$`: There is no difference between the expected and observed distributions of proportions across categories
`$$\text{vs}$$`
`$H_1$`: There is a difference between the expected and observed distributions of proportions across categories

---

# The Chi-Square Goodness of Fit test

We will walk through both versions of the test in this lecture.

<br>

For both options, we will follow this analyis process:

1. Define the null and alternate hypotheses
  
--

2. Calculate the degrees of freedom
  
--

3. Compute the test statistic and `$p$`-value
  
--

4. Check the test assumptions
  
--

5. Reach conclusion (reject `$H_0$`/fail to reject `$H_0$`)
  
--

6. Summarise results
  
--

<br>

*By this stage, this process should be starting to look familiar.*

---

# The Chi-Square Goodness of Fit Hypotheses <br> Simple Case

Suppose we have a categorical variable with `$k$` different categories `$(k \geq 2)$`.

Let `$\pi_i$` denote the proportion of values recorded in category `$i$`.

* `$i = \{1, 2, \ldots, k\}$`.

<br>

Our simple case .orangered_style[Chi-Square Goodness of Fit hypotheses] will be:

`$$H_0: \pi_1 = \pi_2 = \cdots = \pi_k \text{ vs } H_1: \text{ Not all } \pi_i \text{ are equal}$$`

<br>

*Note that, similar to our ANOVA tests, our alternate hypothesis is not saying that all the proportions are unequal necessarily, but rather that at least 2 proportions are not equal.*

---

# `$\chi^2$` GoF Hypotheses - Sharks Example <br> Simple Case

**Scenario:** .seagreen_style[Caribbean Reef Sharks] *(Carcharhinus perezi)* live in the .seagreen_style[Cayman Islands], and are protected throughout Cayman waters under the National Conservation Act (2015).

Kohler et al. (2023) studied the movements and lifestyles of Caribbean Reef Sharks over a 9 year period, and conducted various statistical analyses on the sharks' use of coastal space in the Caymans.

We will replicate some of their .orangered_style[Chi-Square Goodness of Fit] analyses here, focusing on:

* The proportion of **male** and **female** sharks tagged
  
--
  
  * The proportion of sharks tagged at different **locations**

---

.caption_style[
Note. From File:Caribbean_Reef_Shark_School_Tiger_Beach_Bahamas.jpg, by [Dennis Hipp (Zepto)](https://commons.wikimedia.org/wiki/User:Zepto), 2023, Wikimedia Commons ([https://commons.wikimedia.org/](https://commons.wikimedia.org/)). [CC0 1.0 DEED ](https://creativecommons.org/publicdomain/zero/1.0/deed.en)
]

---

# Proportions of Male and Female Sharks

For our first Chi-Square Goodness of Fit test, we will assess the distribution of proportions of male and female Caribbean Reef Sharks in the Cayman Islands.

To conduct the test, we first introduce the following notation:

* Let `$\pi_F$` denote the proportion of female Caribbean Reef Sharks
  
  * Let `$\pi_M$` denote the proportion of male Caribbean Reef Sharks
  
--

`$$H_0: \pi_F = \pi_M \text{ vs } H_1: \pi_M \neq \pi_F$$`
---

# Chi-Square Goodness of Fit jamovi output

For this simple case, the expected proportions are set at 0.5 each.

We observe a Chi-Square test statistic `$\chi_1^2 = 2.273$`, with `$df=1$`, and an associated `$p$`-value of `$0.132$`.

---

# Conclusion?

At this stage, we have a test statistic and `$p$`-value result.

We could conclude things here, and say the test result was not significant `$(p > 0.05).$`

However, it is important to understand how we have arrived at this result, and also check the assumptions of the test.

---

# The Chi-Square Distribution

The .orangered_style[Chi-Square distribution] is defined by one parameter, the .orangered_style[degrees of freedom].

We use the notation `$\chi_{df}^2$` to denote a Chi-Square distribution with `$df$` degrees of freedom.

* `$df = k-1$`
--

<br>

Chi-Square distribution details:

* .orangered_style[Asymmetric]

* Changes shape dramatically as `$df$` changes

* Does not take negative values (nor values of `$0$` in some cases)

---

# Chi-Square distribution

---

# Chi-Square Goodness of Fit Test Statistic

For a Chi-Square Goodness of Fit test, we have:

`$$\text{Test Statistic:  }\,  \chi^2 = \sum_{i = 1}^k \dfrac{(O_i - E_i)^2}{E_i}$$`
Here:

* `$\chi^2 \sim \chi^2_{df}$`

* `$O_i$` is our .orangered_style[observed frequency] (i.e. count) for category `$i$`
  
--

* `$E_i$` is our .orangered_style[expected frequency] (i.e. count) for category `$i$`
  
--

* Recall `$k$` is the number of categories in our categorical variable

---

# Chi-Square Goodness of Fit jamovi output

Here we display a more detailed version, with the observed and expected counts shown.

---

# `$O_i$` and `$E_i$` Values and Test Statistic

For our simple shark example, with `$n = 44$` we note:

* `$E_1 = E_2 = 0.5 \times n = 0.5 \times 44 = 22$`
  
  * `$O_1 = 17$`
  
  * `$O_2 = 27$`
  
--

`$$\chi^2 = \dfrac{(O_1 - E_1)^2}{E_1} + \dfrac{(O_2 - E_2)^2}{E_2} = \dfrac{(17 - 22)^2}{22} + \dfrac{(27 - 22)^2}{22} = \dfrac{25}{22} + \dfrac{25}{22}$$`

`$$\approx 2.273$$`
--

We then use this observed test statistic to compute the `$p$`-value for our test, with:

`$$p\text{-value } = P(\chi^2 \geq 2.273)$$`
  
--

* *Note the `$\chi^2$` distribution does not take negative values*

---

# Checking Assumptions

Before we summarise our test results, we should check that the .orangered_style[test assumptions] are satisfied.

The .orangered_style[Chi-Square Goodness of Fit] test makes two assumptions:

1. There are **no expected counts of 0**
  
--

* This seems reasonable - why include a category if you are not expecting any observations in it?
    
--

2. No more than 20% of the categories have an expected count **less than 5**
  
--

For our shark example, these assumptions are both satisfied: Both categories have expected counts of `$22 \geq 5$`.

---

# Chi-Square Goodness of Fit Test Summary Shark Example

A .orangered_style[Chi-Square Goodness of Fit test] was conducted to determine if there was a difference in the observed and expected proportions of male and female .seagreen_style[Caribbean Reef Sharks] in the Cayman Islands.

27 male sharks and 17 female sharks were observed, with equal expected proportions of 0.5 (expected counts both `$> 5$`).

While more male sharks were observed, no statistically significant difference was found between the proportions of male and female sharks at the `$\alpha = 0.05$` level of significance, with `$\chi_1^2 = 2.273, n = 44, p = 0.132 > 0.05$`.

---

# A second shark example

Suppose we also conduct a .orangered_style[Chi-Square Goodness of Fit] test on the proportions of .seagreen_style[Caribbean Reef Sharks] tagged in each of the three .seagreen_style[Cayman Islands]:

* **GC**: Grand Cayman
  
  * **LC**: Little Cayman
  
  * **CB**: Cayman Brac

To begin, suppose we assume the proportions `$(\pi_{GC}, \pi_{LC}, \pi_{CB})$` are equal.

Hence we have:

`$$H_0: \pi_{GC} = \pi_{LC} = \pi_{CB} \text{ vs } H_1: \text{ Not all proportions are equal}$$`
---

# The Cayman Islands

<center>
<div class="leaflet html-widget html-fill-item" id="htmlwidget-277620ca41bc78987d31" style="width:600px;height:504px;"></div>
<script type="application/json" data-for="htmlwidget-277620ca41bc78987d31">{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"calls":[{"method":"addTiles","args":["https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"© <a href=\"https://openstreetmap.org/copyright/\">OpenStreetMap<\/a>,  <a href=\"https://opendatacommons.org/licenses/odbl/\">ODbL<\/a>"}]}],"setView":[[19.3133,-81.2546],8,[]]},"evals":[],"jsHooks":[]}</script>

---

# Chi-Square Goodness of Fit jamovi output

* In this example, we have a statistically significant result `$(p < 0.05)$`

* Note `$df=2$` since we now have `$k=3$` categories (the islands)
  
---

# `$\chi^2$` GoF Hypotheses - Sharks Example <br> Extended Case

Suppose that it is not reasonable to assume that the proportion of sharks is equal across the three Cayman Islands.

* Perhaps prior experience or a pilot study suggests the sharks are more likely to frequent the .seagreen_style[Little Cayman Island] (LC)

If it seems unreasonable to use equal proportions, we can choose specific values.

Suppose we specify:

* `$\pi_{GC} = 0.1, \pi_{LC} = 0.8, \pi_{CB} = 0.1$`
  
--

Now, our hypotheses are:

`$$H_0: \pi_{GC} = 0.1, \pi_{LC} = 0.8, \pi_{CB} = 0.1$$` 
`$$\text{ vs }$$`
`$$H_1: \text{The proportions do not match the expected distribution of proportions}$$`
---

# Chi-Square Goodness of Fit jamovi output

* Note that now our test statistic is no longer significant.
  
---

# Checking Assumptions

Before we conclude this example, we should **check our assumptions**.

You may have noticed that two of the expected counts `$(\pi_{CB} \text{ & } \pi_{GC})$` were less than 5.

* The represents 66% of the categories, which violates our second assumption (20% max `$<$` 5)

Recall `$E_i = n \times \pi_i$`. This issue could have been avoided if we were more careful specifying our null hypothesis proportions.

* How difficult would it be to adjust the proportions to satisfy the test assumptions?

This highlights the importance of understanding the test process and assumptions, before specifying your null and alternate hypotheses and conducting your test.

---

# Summary

A .orangered_style[Chi-Square Goodness of Fit test] can be used to compare expected and observed distributions of proportions of individuals in the different categories of a .seagreen_style[categorical variable].

* We can either use equal proportions, or specify the proportions for each category

* We need to check the assumptions of our Chi-Square Goodness of Fit test before we conduct our test

* If the test assumptions fail, there are alternatives (beyond the scope of this subject, e.g. `$G$`-tests)
  
--

<br>

*Note: I am aware the results shown here differ to those of Kohler et al. (2023). I have intentionally treated multiple observations of the same shark as being observations of multiple individual sharks, and we will look at this in more detail in the computer lab.*

---

# End

That concludes our lecture on Chi-Square Goodness of Fit tests.

### What to do next:

* Make sure to attend the Topic 6B lecture (our final lecture on new DA content!)

* If you have any questions, check the LMS, email us or ask in the computer labs

### Optional Further Reading

* Parts from Kokoska (2020) Chapter 13
  
---

# References

*  Kohler, J., Gore, M., Ormond, R., Johnson, B., & Austin, T. (2023). Individual residency behaviours and seasonal long-distance movements in acoustically tagged Caribbean reef sharks in the Cayman Islands. *PloS One*, 18(11), e0293884–e0293884. [https://doi.org/10.1371/journal.pone.0293884](https://doi.org/10.1371/journal.pone.0293884)

* Kokoska, S. (2020). Introductory statistics: a problem-solving approach (Third edition..). W H FREEMAN.

* The jamovi project. (2022). *Jamovi [Computer Software]*. [https://www.jamovi.org](https://www.jamovi.org).

---
class: middle

<font color = "grey">
These notes have been prepared by Rupert Kuveke, Amanda Shaker, and other members of the Department of Mathematical and Physical Sciences. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with the Department of Environment and Genetics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a>
</font>