Topic 10: Chi-squared Tests for Categorical Data


In Topic 10 we extended our toolkit of hypothesis tests to include tests of categorical data, using Chi-squared tests. In this computer lab, we will cover how to conduct Chi-squared goodness of fit tests and Chi-squared tests of independence.


1 Chi-squared Goodness of Fit Test

🏡 Recall that in Computer Lab 10, we considered the proportion of university students who regularly drank coffee, using imaginary data. In the statistical tests we conducted for this data, there were only two categories into which the students could be categorised, namely:

  • Drink coffee regularly
  • Don’t drink coffee regularly

Often, we may be presented with more nuanced situations, for which there are more than two categories. For instance, we could have further segmented students into additional categories such as “Never drink coffee” and “Drink coffee every day”.

If we would like to conduct a hypothesis test to simultaneously check observed percentages against expected percentages, for more than two categories, we can use a Chi-squared Goodness of Fit test. Let’s take a look at how to conduct such a test.

1.1

🏡 A recent academic research paper (Czarniecka-Skubina et al. 2021) analysed the coffee consumption habits of adults in Poland. 1500 respondents provided online feedback to a variety of coffee-consumption questions. Based on their responses, 7 categories of coffee consumption frequency were used in the paper, namely:

  • Once a month
  • Three times a month
  • Once a week
  • Three or four times a week
  • Once a day
  • Twice a day
  • Three or four times a day

Suppose that a claim has been made that the frequency of coffee consumption in Poland can be segmented as follows: Once a month: 10%, Three times a month: 5%, Once a week: 5%, Three or four times a week: 10%, Once a day: 20%, Twice a day: 30% and Three or four times a day: 20%.

Table 1.1 below displays data based on the survey (see Czarniecka-Skubina et al. 2021,\(~\)p.7), as well as the expected percentages based on the above claim.

Table 1.1: Polish Coffee Consumption Categories
Frequency Observed Frequency Observed Percentage Expected Percentage
Once a month 108 7.2% 10%
Three times a month 74 4.93% 5%
Once a week 38 2.53% 5%
Three or four times a week 146 9.73% 10%
Once a day 335 22.3% 20%
Twice a day 426 28.4% 30%
Three or four times a day 373 24.87% 20%

1.2

🏡 To test whether the observed distribution of percentages is statistically significantly different from what was expected (or claimed), we can carry out a Chi-Squared goodness of fit test.

Define the null and alternative hypotheses for this test. 💬

1.3

🏡 What is the degrees of freedom for this test? 💬

1.4

💻 The data for this example cn be found in the file called coffee_gof.csv. Download this data from the LMS, and then import it into jamovi.

1.5

💻 In jamovi, carry out the hypothesis test specified in Question 1.2.

1.6

🏡 Note down the test statistic and \(p\)-value provided by the test. Based on these values, what conclusion do you reach? 💬

Note: You can assume a level of significance of 5%.

1.7

🏡 We make two assumptions when conducting a Chi-squared goodness of fit test. Having conducted the test, we should now quickly check that these assumptions are satisfied.

Hint: If you don’t quite remember what these assumptions are, check the Topic 10 material here.

To help in checking these assumptions, in your jamovi Chi-squred goodness of fit test analysis, click the check-box next to the Expected counts option.

Using this information, explain whether or not our assumptions are satisfied. 💬

2 Chi-squared Goodness of Fit Test using a summary data set

💻 Note that sometimes, instead of a data set that contains one row for each observation, we may be presented with a summary data set containing a list of categories and associated counts.

An example of such a file is coffee_gof_2.csv. Download this file from the LMS, and open the file in jamovi.

Next, see if you can carry out the goodness of fit test using this summary data set, and confirm that your your results here and for 1.5 are the same.

Notes: the exploration of data will not be necessary here. However, you will need to set up the coffee_consumption variable to ensure that the categories are ordered correctly.

3 Chi-squared Test of Independence

🏡 Continuing our Coffee example from questions 1 and 2, suppose that we are now interested in determining if there is a significant difference in coffee drinking habits between people of different ages. In other words, suppose we would like to test if there is an association between age and coffee drinking regularity. We can test this using a Chi-squared test of independence.

3.1

🏡 Suppose that our two categorical variables are Age and Coffee Drinking Regularity. Define the null and alternative hypotheses we would use, if we are to carry out a Chi-squared test of independence to test if there is an association between these two variables. 💬

3.2

🏡 Table 3.1 is based on the survey (see Czarniecka-Skubina et al. 2021,\(~\)p.5), and segments respondents into 1 of 5 age groups. For the Coffee Drinking Regularity variable, we have created 2 groups: Drink coffee at least once a day and Drink coffee less than once a day.

Table 3.1: Polish Coffee Consumption Age Categories
Age Drink coffee at least once a day Drink coffee less than once a day
18-25 261 174
26-30 167 72
31-40 227 31
41-50 269 43
51-65 228 28

Based on Table 3.1, what will the degrees of freedom be for our Chi-square test of independence? 💬

3.3

💻 The data for this example cn be found in the file called coffee_indep.csv. Download this data from the LMS, and then import it into jamovi.

3.4

💻 In jamovi, carry out the hypothesis test specified in Question 3.1.

3.5

🏡 Note down the test statistic and \(p\)-value provided by the test. Based on these values, what conclusion do you reach? 💬

Note: You can assume a level of significance of 5% once more.

3.6

🏡 Just as for the Chi-squared goodness of fit test, we make two assumptions when conducting a Chi-squared test of independence. Check that these assumptions are satisfied, using 1.7 as a guide. 💬

4 Chi-squared Test of Independence using a summary data set

💻 Note that sometimes, instead of a data set that contains one row for each observation, we may be presented with a summary data set containing a list of categories and associated counts.

An example of such a file is coffee_indep_2.csv. Download this file from the LMS, and open the file in jamovi.

Next, see if you can carry out the test of independence using this summary data set, and confirm that your your results here and for 3.4 are the same.

Notes: the exploration of data will not be necessary here. However, you will be able to create a bar chart within the Chi-squared test of independence analysis.


Well done, that’s everything for today! If you still have time, you may like to work on Assignment 4.

Before you finish up, remember to save your work (e.g. your jamovi and Word files) somewhere safe (e.g. OneDrive) so that you can access it at a later time.


References

Czarniecka-Skubina, E., M. Pielak, P. Sałek, R. Korzeniowska-Ginter, and T. Owczarek. 2021. “Consumer Choices and Habits Related to Coffee Consumption by Poles.” International Journal of Environmental Research and Public Health 18 (8). https://doi.org/10.3390/ijerph18083948.


These notes have been prepared by Amanda Shaker and Rupert Kuveke. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.