STM1001 Topic 10 Workshop

class: middle
background-image: url(data:image/png;base64,#LTU_logo.jpg)
background-position: top left
background-size: 30%

# STM1001 [Topic 10](https://bookdown.org/a_shaker/STM1001_Topic_10/) Workshop
## Chi-squared Tests for Categorical Data
### La Trobe University
This workshop complements the [Topic 10 readings](https://bookdown.org/a_shaker/STM1001_Topic_10/)

---

# Topic 10: Chi-squared Tests for Categorical Data

---
# Chi-squared goodness of fit test

* In some ways, the Chi-squared tests we will be looking at in this topic can be considered extensions of the test of proportions we considered in [Topic 9](https://bookdown.org/a_shaker/STM1001_Topic_9/)

* In the last workshop, we considered the following claim:
    * *60% of university students prefer Android over Apple (iOS) phones*

* Here there were only two options: Android or Apple (iOS)

* Based on the claim, we were expecting 60% of students to prefer Android and 40% to prefer Apple (iOS)

* What if we have an expected distribution of preferences across ***two or more categories***?

* In this case, we can use the Chi-squared goodness of fit test

---
# Claim

* University students’ mobile phone preferences are as follows:

* Android: 55%

* Apple (iOS): 40%

* Phones are evil: 5%

---

name: menti
class: middle
background-image: url(data:image/png;base64,#menti.jpg)
background-size: 115%

# Menti

## Go to [www.menti.com](https://www.menti.com) and use

## the code provided

---
# Chi-squared goodness of fit test

* We will use the Chi-squared goodness of fit test to test the claim

* First, we need to set up our hypotheses:

`$H_0:$` There is no significant difference between the observed and expected distribution of proportions of university students' phone preferences.

<center>
versus
</center>

`$H_1:$` There is a significant difference between the observed and expected distribution of proportions of university students' phone preferences.

---
# Chi-squared goodness of fit test

* We also need to know the degrees of freedom:

.content-box-blue[
.center[
**Degrees of freedom for chi-squared goodness of fit test:**
]
`$\text{df} = \text{Number of categories} - 1$`.
]

* What is the degrees of freedom for our example?

---
# Chi-squared goodness of fit test

* The degrees of freedom allows us to define the distribution we will use for the test

* We use the ***chi-squared distribution***

* "Chi" is a Greek letter, `$\chi$`, and is pronounced, "ky".

* If a random variable `$X^2$` follows a chi-squared distribution, we would write this as `$X^2 \sim \chi^2_{\text{df}}$`

* So, for our example, we have that `$X^2 \sim \chi^2_2$` under `$H_0$`.

---
# Chi-squared distribution

---
# Chi-squared goodness of fit test

* The formula for the **test statistic** is

$$X^2 = \sum_{i = 1}^k \frac{(O_i - E_i)^2}{E_i}, $$

* where:

* `$X^2$` is random, with `$X^2 \sim \chi^2_{\text{df}}$` under `$H_0$`
    * `$O_i$` is the observed frequency for the `$i$`th category
    * `$E_i$` is the expected frequency for the `$i$`th category
    * `$k$` is the number of categories.

---
# Chi-squared goodness of fit test

* The formula for the **observed test statistic** is

$$\chi^2 = \sum_{i = 1}^k \frac{(O_i - E_i)^2}{E_i}, $$

* where:

* `$O_i$` is the observed frequency for the `$i$`th category 
    * `$E_i$` is the expected frequency for the `$i$`th category, i.e., the proportion in the `$i$`th category under `$H_0$` multiplied by the sample size
    * `$k$` is the number of categories.

* It may be shown that the `$p$`-value is equal to `$P(X^2 \geq \chi^2)$`, where this probability is calculated under `$H_0$`

* As usual, if the `$p$`-value is less than `$\alpha$` (where `$\alpha$` is normally 0.05), we reject `$H_0$`

* Therefore, a large test statistic will result in a small `$p$`-value (and subsequently a significant result), and a small test statistic will result in a large `$p$`-value (and subsequently a non-significant result)

---
# Chi-squared goodness of fit test

* After carrying out the test, we will also need to check the assumptions:

.content-box-blue[
.center[
**Chi-squared goodness of fit assumptions:**
]
1. No more than 20% of categories have an expected count of less than 5
1. There are no expected counts of zero
]

---
# Chi-squared goodness of fit test

---

#Group activity 1

* In your group, discuss the result and answer the following:

* What is the `$p$`-value?
  * What is the degrees of freedom?
  * Have the assumptions been met?
  * Do we have evidence that the distribution of proportions is significantly different from what was expected?

After you have had a chance to discuss, nominate one person who can speak for the group and explain your conclusion to the rest of the class

---
# Chi-squared Test of Independence

* The Chi-squared test of independence allows us to test whether there is an association between two categorical variables.

* Consider the following claim…

---
# Claim

* There is an association between mobile phone preferences and whether or not you have brown eyes

* What do you think...?

---
# Chi-squared Test of Independence

* We will use the Chi-squared test of independence to test the claim

* First, we need to set up our hypotheses:

`$H_0:$` There is no association between eye colour and mobile phone preference

versus

`$H_1:$` There is an association between eye colour and mobile phone preference

---
# Chi-squared Test of Independence

* Normally, we display the observed data associated with a Chi-squared test of independence in a two-way table, for example (note that numbers in the below table are arbitrary):

| Mobile phone preference      | Brown eyes   | Not brown eyes |
|:------------- |:--------------------:|:-------------------:|
| Android       | 15  | 15 |
| Apple    | 15  | 10 |
| Phones are evil       | 9 | 2 |

---
# Chi-squared Test of Independence

* We also need to know the degrees of freedom:

.content-box-blue[
.center[
**Degrees of freedom for chi-squared test of independence:**
]
`$\text{df} = (r - 1)(c - 1),$`
]

where:

* `$r$` is the number of rows (i.e. the number of categories in the first variable)

* `$c$` is the number of columns (i.e. the number of categories in the second variable).

* What is the degrees of freedom under `$H_0$` for our example?

---
# Chi-squared Test of Independence

* The degrees of freedom allows us to define the distribution we will use for the test

* We again use the ***chi-squared distribution***

* In our example, we have that

`$$\text{df} = (3 - 1)(2 - 1) = 2\times 1 = 2.$$`

* So, we have `$X^2 \sim \chi^2_2$`

---
# Chi-squared distribution

---
# Chi-squared Test of Independence

* The formula for the **test statistic** is

`$$X^2 = \displaystyle \sum_{i = 1}^r \sum_{j = 1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}},$$`

* where, referring to the two-way table:

* `$O_{ij}$` is the observed frequency in the `$i$`th row and the `$j$`th column
    
    * `$E_{ij}$` is the expected frequency of the cell in the `$i$`th row and the `$j$`th column
   
    * `$r$` is the number of rows
    
    * `$c$` is the number of columns
    
    * `$X^2$` is random, with `$X^2 \sim \chi^2_{\text{df}}$` under `$H_0$`

---
# Chi-squared Test of Independence

* The formula for the **observed test statistic** is

`$$\chi^2 = \displaystyle \sum_{i = 1}^r \sum_{j = 1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}},$$`

* where, referring to the two-way table:

* `$O_{ij}$` is the observed frequency in the `$i$`th row and the `$j$`th column
    
    * `$r$` is the number of rows, and `$c$` is the number of columns
    
    * `$E_{ij} = \displaystyle \frac{\text{row}\_\text{total}_i \times \text{column}\_\text{total}_j}{\text{grand}\_\text{total}}$` is the expected frequency of the cell in the `$i$`th row and the `$j$`th column

* `$\text{row}\_\text{total}_i$` is the number of observations in the `$i$`th row, and `$\text{column}\_\text{total}_j$` is the number of observations in the `$j$`th column
    
    * `$\text{grand}\_\text{total}$` is the total number of observations, often denoted `$n$`.

---
# Chi-squared Test of Independence

* Then, the `$p$`-value is equal to `$P(X^2 \geq \chi^2)$`

* As usual, if the `$p$`-value is less than `$\alpha$` (where `$\alpha$` is normally 0.05), we reject `$H_0$`

* After carrying out the test, we will also need to check the assumptions:

.content-box-blue[
.center[
**Chi-squared Test of Independence assumptions:**
]
1. No more than 20% of categories have an expected count of less than 5
1. There are no expected counts of zero
]

---
# Chi-squared Test of Independence

---

#Group activity 2

* In your group, discuss the result and answer the following:

* What is the `$p$`-value?
  * What is the degrees of freedom?
  * Have the assumptions been met?
  * Do we have evidence that mobile phone preferences are related to eye colour?

After you have had a chance to discuss, nominate one person who can speak for the group and explain your conclusion to the rest of the class

---

#More details in the readings

* For more, see [this topic’s readings](https://bookdown.org/a_shaker/STM1001_Topic_10/)

---

background-image: url(data:image/png;base64,#computerlab.jpg)
background-position: bottom
background-size: 75%
class: center

# See you in the computer labs!

Continue with this topic's readings: [Topic 10 Readings](https://bookdown.org/a_shaker/STM1001_Topic_10/)

---
class: middle

<font color = "grey">
These notes have been prepared by Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a>
</font>