class: middle background-image: url(data:image/png;base64,#LTU_logo.jpg) background-position: top left background-size: 30% # STM1001 [Topic 10](https://bookdown.org/a_shaker/STM1001_Topic_10/) Lecture ## Chi-squared Tests for Categorical Data ### La Trobe University This lecture complements the [Topic 10 readings](https://bookdown.org/a_shaker/STM1001_Topic_10/) --- # Topic 10: Related Links ## Readings [Topic 10 readings](https://bookdown.org/a_shaker/STM1001_Topic_10/) ## Notation [Notation for Topic 10: Chi-squared tests for categorical data](https://bookdown.org/a_shaker/STM1001_Topic_0/notation-summary.html#topic-10-chi-squared-tests-for-categorical-data) --- # Topic 10: Chi-squared Tests for Categorical Data **Overview** <iframe src="https://bookdown.org/a_shaker/STM1001_Topic_10/" width="100%" height="400px" data-external="1"></iframe> --- # Chi-squared goodness of fit test In this lecture we will consider two different Chi-squared tests for categorical data. -- * In some ways, the Chi-squared tests we will be looking at in this topic can be considered extensions of the test of proportions we considered in [Topic 9](https://bookdown.org/a_shaker/STM1001_Topic_9/) -- * In the previous core lecture, we considered the following claim: * *70% of university students prefer Apple (iOS) over Android phones* -- * Here there were only two options: Apple (iOS) or Android -- * Based on the claim, we were expecting 70% of students to prefer Apple (iOS) and 30% to prefer Android -- What if we have an expected distribution of preferences across ***two or more categories***? -- * In this case, we can use the **Chi-squared goodness of fit** test --- # Example: Chi-squared goodness of fit test Suppose a claim has been made that University students’ mobile phone preferences are as follows: * Apple (iOS): 70% * Android: 25% * Neither: 5% -- * Further suppose `\(n = 50\)` respondents indicated preferences as follows: | Mobile phone preference | Observed frequency | Observed percentage | Expected percentage | |:------------- |:--------------------:|:-------------------:|:-------------------:| | Apple (iOS) | 29 | 58% | 70% | | Android | 12 | 24% | 25% | | Neither | 9 | 18% | 5% | * *Note that the Expected percentage column refers to the claim made about University students' mobile phone preferences* --- # Example: Chi-squared goodness of fit test We can test the following hypotheses via the **Chi-squared goodness of fit** test: <br> `\(H_0:\)` There is no significant difference between the observed and expected distribution of proportions of University students' mobile phone preferences <center> versus <br> </center> <br> `\(H_1:\)` There is a significant difference between the observed and expected distribution of proportions of University students' mobile phone preferences --- # The Chi-squared distribution The sampling distribution used for chi-squared tests is the ***chi-squared distribution***. -- * "Chi" is a Greek letter, `\(\chi\)`, and is pronounced, "ky". -- * Just like the `\(t\)` and `\(F\)` distribution, the chi-squared distribution is defined by the degrees of freedom (df) parameter -- * If a random variable `\(X^2\)` follows a chi-squared distribution, we denote this as `\(X^2 \sim \chi^2_{\text{df}}\)` -- * For example, if df = 5, we would write `\(X^2 \sim \chi^2_5\)` -- * The figure on the next slide shows some example density curves of the `\(\chi^2\)` distribution for varying degrees of freedom --- # The Chi-squared distribution <img src="data:image/png;base64,#Topic_10_Lecture_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- # The Chi-squared distribution * As we can see, the Chi-squared distribution is positively skewed, however as the degrees of freedom increases, the density curve begins to look flatter and more like a density that resembles the normal distribution -- * The Chi-squared distribution only takes on positive values -- * When we carry out a Chi-squared test, the observed test statistic, `\(\chi^2\)` is placed within the context of the corresponding sampling distribution -- * We then calculate the `\(p\)`-value as `\(p = P(X^2 \geq \chi^2)\)` * This means that a large test statistic will result in a small `\(p\)`-value (and subsequently a significant result) * A small test statistic will result in a large `\(p\)`-value (and subsequently a non-significant result) --- # Example: Chi-squared goodness of fit test For the Chi-squared goodness of fit test, the **degrees of freedom** is defined as: .content-box-blue[ .center[ **Degrees of freedom for Chi-squared goodness of fit test:** `\(\text{df} = \text{Number of categories} - 1\)` ] ] -- In our example, there are three categories, so we have that `$$\text{df} = 3 - 1 = 2.$$` Therefore, in this example, we have that `\(X^2 \sim \chi^2_2\)` under `\(H_0\)`. --- # Chi-squared goodness of fit test * The formula for the **test statistic** is $$X^2 = \sum_{i = 1}^k \frac{(O_i - E_i)^2}{E_i}, $$ -- * where: * `\(X^2\)` is random, with `\(X^2 \sim \chi^2_{\text{df}}\)` under `\(H_0\)` -- * `\(O_i\)` is the observed frequency for the `\(i\)`th category -- * `\(E_i\)` is the expected frequency for the `\(i\)`th category -- * `\(k\)` is the number of categories --- # Chi-squared goodness of fit test * The formula for the **observed test statistic** is $$\chi^2 = \sum_{i = 1}^k \frac{(O_i - E_i)^2}{E_i}, $$ -- * where: * `\(O_i\)` is the observed frequency for the `\(i\)`th category -- * `\(E_i\)` is the expected frequency for the `\(i\)`th category, i.e., the proportion in the `\(i\)`th category under `\(H_0\)` multiplied by the sample size -- * `\(k\)` is the number of categories -- * It may be shown that the `\(p\)`-value is equal to `\(P(X^2 \geq \chi^2)\)`, where this probability is calculated under `\(H_0\)` * As usual, if the `\(p\)`-value is less than `\(\alpha\)` (where `\(\alpha\)` is normally 0.05), we reject `\(H_0\)` --- # Chi-squared goodness of fit test output ``` r Chi-squared test for given probabilities data: Obs_Frequency X-squared = 17.949, df = 2, p-value = 0.0001266 ``` --- # Chi-squared goodness of fit test output ``` r Chi-squared test for given probabilities data: Obs_Frequency `X-squared = 17.949`, df = 2, p-value = 0.0001266 ``` * The **test statistic** is equal to 17.949 --- # Chi-squared goodness of fit test output ``` r Chi-squared test for given probabilities data: Obs_Frequency `X-squared = 17.949`, df = 2, `p-value = 0.0001266` ``` * The **test statistic** is equal to 17.949 * ** `\(p\)`-value** is `\(p < 0.001\)`. Since this is less than `\(\alpha = 0.05\)`, we reject `\(H_0\)` and there is sufficient evidence to support that the distribution of proportions is significantly different from what was expected. --- # Chi-squared goodness of fit test output ``` r Chi-squared test for given probabilities data: Obs_Frequency `X-squared = 17.949`, `df = 2`, `p-value = 0.0001266` ``` * The **test statistic** is equal to 17.949 * ** `\(p\)`-value** is `\(p < 0.001\)`. Since this is less than `\(\alpha = 0.05\)`, we reject `\(H_0\)` and there is sufficient evidence to support that the distribution of proportions is significantly different from what was expected. * The **degrees of freedom** is `\(\text{df} = 2\)` --- # Chi-squared goodness of fit test Assumptions As usual, we also need to check the assumptions: .content-box-blue[ .center[ **Chi-squared goodness of fit assumptions:** ] 1. No more than 20% of categories have an expected count of less than 5 1. There are no expected counts of zero ] --- # Chi-squared goodness of fit test Assumptions The expected counts for our mobile phone example are as follows (see if you can calculate these yourself from the information already provided): ``` [1] 35.0 12.5 2.5 ``` -- This means that: 1. One out of 3 categories (33.3%) has an expected count of less than 5
<i class="fas fa-exclamation-triangle faa-pulse animated " style=" color:red;"></i>
-- 1. There are no expected counts of zero
<i class="fas fa-check faa-pulse animated " style=" color:green;"></i>
-- Therefore, since the first condition is not met (since 33.3% > 20%), the assumptions have not been met. -- In practice, when the assumptions are not met, we could take further action such as combining some of the categories or carrying out a different type of test. However, these techniques are beyond the scope of this subject. --- name: menti class: middle background-image: url(data:image/png;base64,#menti.jpg) background-size: 115% # Kahoot ## Go to [www.kahoot.it](https://www.kahoot.it) and use ## the code provided --- # Chi-squared Test of Independence The Chi-squared test of independence allows us to test whether there is an association between two categorical variables. -- Consider the following claim: -- * *There is an association between mobile phone preferences and whether or not you have brown eyes* -- * What do you think...? -- * We will use the Chi-squared test of independence to test the claim -- First, we need to set up our hypotheses: `\(H_0:\)` There is no association between eye colour and mobile phone preference versus `\(H_1:\)` There is an association between eye colour and mobile phone preference --- # Chi-squared Test of Independence Since there are two categorical variables involved with a chi-squared test of independence, it is useful to look at a two-way table. -- * Consider responses of our `\(n = 50\)` university students: | Mobile phone preference | Brown eyes | Not brown eyes | |:------------- |:--------------------:|:-------------------:| | Apple (iOS) | 12 | 16 | | Android | 8 | 4 | | Neither | 6 | 4 | --- # Chi-squared Test of Independence We also need to know the degrees of freedom: -- .content-box-blue[ .center[ **Degrees of freedom for chi-squared test of independence:** `\(\text{df} = (r - 1)(c - 1),\)` ] ] where: * `\(r\)` is the number of rows (i.e. the number of categories in the first variable) * `\(c\)` is the number of columns (i.e. the number of categories in the second variable) -- * What is the degrees of freedom under `\(H_0\)` for our example? --- # Chi-squared Test of Independence The degrees of freedom allows us to define the distribution we will use for the test. * We again use the ***Chi-squared distribution*** * In our example, we have that `$$\text{df} = (3 - 1)(2 - 1) = 2\times 1 = 2.$$` * So, we have `\(X^2 \sim \chi^2_2\)` --- # Chi-squared distribution <img src="data:image/png;base64,#Topic_10_Lecture_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> --- # Chi-squared Test of Independence The formula for the **test statistic** is `$$X^2 = \displaystyle \sum_{i = 1}^r \sum_{j = 1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}},$$` -- * where, referring to the two-way table: * `\(O_{ij}\)` is the observed frequency in the `\(i\)`th row and the `\(j\)`th column -- * `\(E_{ij}\)` is the expected frequency of the cell in the `\(i\)`th row and the `\(j\)`th column -- * `\(r\)` is the number of rows -- * `\(c\)` is the number of columns -- * `\(X^2\)` is random, with `\(X^2 \sim \chi^2_{\text{df}}\)` under `\(H_0\)` --- # Chi-squared Test of Independence * The formula for the **observed test statistic** is `$$\chi^2 = \displaystyle \sum_{i = 1}^r \sum_{j = 1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}},$$` -- * where, referring to the two-way table: * `\(O_{ij}\)` is the observed frequency in the `\(i\)`th row and the `\(j\)`th column -- * `\(r\)` is the number of rows, and `\(c\)` is the number of columns -- * `\(E_{ij} = \displaystyle \frac{\text{row}\_\text{total}_i \times \text{column}\_\text{total}_j}{\text{grand}\_\text{total}}\)` is the expected frequency of the cell in the `\(i\)`th row and the `\(j\)`th column -- * `\(\text{row}\_\text{total}_i\)` is the number of observations in the `\(i\)`th row, and `\(\text{column}\_\text{total}_j\)` is the number of observations in the `\(j\)`th column -- * `\(\text{grand}\_\text{total}\)` is the total number of observations, often denoted `\(n\)` --- # Chi-squared Test of Independence Then, the `\(p\)`-value is equal to `\(P(X^2 \geq \chi^2)\)`. -- * *See if you can use the formula on the previous slide to calculate the test statistic for yourself* * As usual, if the `\(p\)`-value is less than `\(\alpha\)` (where `\(\alpha\)` is normally 0.05), we reject `\(H_0\)` --- # Chi-squared Test of Independence Output ``` r Pearson's Chi-squared test data: table X-squared = 2.2283, df = 2, p-value = 0.3282 ``` --- # Chi-squared Test of Independence Output ``` r Pearson's Chi-squared test data: table `X-squared = 2.2283`, df = 2, p-value = 0.3282 ``` * The **test statistic** is equal to 2.2283 --- # Chi-squared Test of Independence Output ``` r Pearson's Chi-squared test data: table `X-squared = 2.2283`, df = 2, `p-value = 0.3282` ``` * The **test statistic** is equal to 2.2283 * ** `\(p\)`-value** is `\(p = 0.3282\)`. Since this is greater than `\(\alpha = 0.05\)`, we cannot reject `\(H_0\)` and do not have sufficient evidence to support that there is an association between mobile phone preference and eye colour. --- # Chi-squared Test of Independence Output ``` r Pearson's Chi-squared test data: table `X-squared = 2.2283`, `df = 2`, `p-value = 0.3282` ``` * The **test statistic** is equal to 2.2283 * ** `\(p\)`-value** is `\(p = 0.3282\)`. Since this is greater than `\(\alpha = 0.05\)`, we cannot reject `\(H_0\)` and do not have sufficient evidence to support that there is an association between mobile phone preference and eye colour. * The **degrees of freedom** is `\(\text{df} = 2\)` --- # p-value Visualisation <img src="data:image/png;base64,#Topic_10_Lecture_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> --- # Chi-squared Test of Independence Assumptions After carrying out the test, we will also need to check the assumptions: .content-box-blue[ .center[ **Chi-squared Test of Independence assumptions:** ] 1. No more than 20% of categories have an expected count of less than 5 1. There are no expected counts of zero ] --- # Chi-squared Test of Independence Assumptions The expected counts for our mobile phone vs eye colour example are as follows: (see if you can calculate these yourself from the information already provided): ``` [,1] [,2] group1 14.56 13.44 group2 6.24 5.76 group3 5.20 4.80 ``` -- This means that: 1. One out of 6 categories (16.67%) has an expected count of less than 5
<i class="fas fa-check faa-pulse animated " style=" color:green;"></i>
1. There are no expected counts of zero
<i class="fas fa-check faa-pulse animated " style=" color:green;"></i>
-- Therefore, the assumptions have been met. --- background-image: url(data:image/png;base64,#computerlab.jpg) background-position: bottom background-size: 75% class: center # See you in the computer labs! --- class: middle <font color = "grey"> These notes have been prepared by Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License <a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a> </font>