Topic 10: Chi-squared Tests for Categorical Data

In Topic 10 we extended our toolkit of hypothesis tests to include tests of categorical data, using Chi-squared tests. In this computer lab, we will cover how to conduct Chi-squared goodness of fit tests and Chi-squared tests of independence.

1 Chi-squared Goodness of Fit Test

🏡 Recall that in Computer Lab 10, we considered the proportion of university students who regularly drank coffee, using imaginary data. In the statistical tests we conducted for this data, there were only two categories into which the students could be categorised, namely:

Drink coffee regularly
Don’t drink coffee regularly

Often, we may be presented with more nuanced situations, for which there are more than two categories. For instance, we could have further segmented students into additional categories such as “Never drink coffee” and “Drink coffee every day”.

If we would like to conduct a hypothesis test to simultaneously check observed percentages against expected percentages, for more than two categories, we can use a Chi-squared Goodness of Fit test. Let’s take a look at how to conduct such a test.

1.1 Coffee Consumption Data

🏡 A recent academic research paper (Czarniecka-Skubina et al. 2021) analysed the coffee consumption habits of adults in Poland. 1500 respondents provided online feedback to a variety of coffee-consumption questions. Based on their responses, 7 categories of coffee consumption frequency were used in the paper, namely:

Once a month
Three times a month
Once a week
Three or four times a week
Once a day
Twice a day
Three or four times a day

Suppose that a claim has been made that the frequency of coffee consumption in Poland can be segmented as follows:

Once a month: 10%, Three times a month: 5%, Once a week: 5%, Three or four times a week: 10%, Once a day: 20%, Twice a day: 30% and Three or four times a day: 20%.

Table 1.1 below displays data based on the survey (see Czarniecka-Skubina et al. 2021,\(~\)p.7), as well as the expected percentages based on the above claim.

To test whether the observed distribution of percentages is statistically significantly different from the expected distribution, we can carry out a Chi-Squared goodness of fit test using the data in Table 1.1.

Table 1.1: Polish Coffee Consumption Categories
Frequency	Observed Frequency	Observed Percentage	Expected Percentage
Once a month	108	7.2%	10%
Three times a month	74	4.93%	5%
Once a week	38	2.53%	5%
Three or four times a week	146	9.73%	10%
Once a day	335	22.3%	20%
Twice a day	426	28.4%	30%
Three or four times a day	373	24.87%	20%

1.2 Defining Hypotheses

🏡 To begin, define the null and alternative hypotheses for this test.

🎧 Online students

💬 Enter your answer next to the question in the shared Google Doc.

1.3

🏡 What is the degrees of freedom for this test?

🎧 Online students

💬 Enter your answer next to the question in the shared Google Doc.

1.4 Data Preparation

💻 In order to carry out the test in R, we will first need to set up our data. Using the information in Table 1.1, complete the ... sections in the R code below, and then run the code.

obs.freq <- c(108, 74, 38, ...)
exp.prop <- c(0.1, 0.05, 0.05, ...)

Hint: You should be able to determine the remaining values based on the objects’ names.

1.5 Conducting a Chi-squared Goodness of Fit Test in R

💻 We can use the chisq.test R function to carry out a Chi-squared goodness of fit test in R. The two main arguments in this function are:

x, the observed frequencies (a vector), and
p, the expected proportions (also a vector)

Based on this information, carry out a Chi-square goodness of fit test for the hypotheses defined in 1.2, using the data you prepared in 1.4, and assign your result to an object with a name of your choice.

Note: If we don’t specify p, the test will assume we want to test for equal distribution across the different proportions - e.g. if we had 4 categories, and did not specify p, it would assume each category contained 25% of observations.

1.5.1

💻 Note down the Chi-square goodness of fit test’s test statistic and \(p\)-value. Based on these values, what can we conclude?

Note: You can assume a level of significance of 5%.

🎧 Online students

💬 Enter your answer next to the question in the shared Google Doc.

1.6 Test Assumption Checks

🏡 We make two assumptions when conducting a Chi-squared goodness of fit test. Having conducted the test, we should now quickly check that these assumptions are satisfied.

Hint: If you don’t quite remember what these assumptions are, check the Topic 10 material here.

To help in checking these test assumptions, use the following code. Just replace the ... with whatever you named your object in 1.5:

...$expected

Using this information, explain whether or not our assumptions are satisfied.

🎧 Online students

💬 Enter your answer next to the question in the shared Google Doc.

2 Preparing data for a Chi-squared Goodness of Fit Test

💻 Note that sometimes, instead of a nicely formatted table of data such as Table 1.1, we may be presented with an unformatted data set that contains (amongst other variables) the information we need to conduct our Chi-squared tests.

In this question we will prepare a data set from scratch for analysis.

2.1

💻 Run the R code below to create a simple data set that simulates the type of data set we may encounter:

# This first part creates the observations for the different frequency categories
coffee.consumption <- as.factor(c(rep("Once.Month", 108), rep("Three.per.Month", 74), 
                                  rep("Once.Week", 38), rep("Three.or.Four.Week", 146), 
                                  rep("Once.Day", 335), rep("Twice.Day", 426), 
                                  rep("Three.or.Four.Day", 373)))

# This second part randomizes the order - we now have an object similar to what we might encounter 
# if we were analysing a new data set.
coffee.consumption <- sample(coffee.consumption, 1500, replace = F)
# Take a look at the first few sampled observations
head(coffee.consumption)

2.2

💻 In practice, the data set we receive may not have the different levels ordered properly. It is important to order the different levels in our data set in a meaningful way before we continue our analysis. We can do this using the ordered function, as shown below:

coffee.consumption <- ordered(coffee.consumption, 
                              levels = c("Once.Month", "Three.per.Month", 
                                         "Once.Week", "Three.or.Four.Week", 
                                         "Once.Day", "Twice.Day", "Three.or.Four.Day"))

2.3

💻 Create a frequency table of the coffee.consumption data. You will see that our data set is starting to resemble the data presented in Table 1.1.

Hint: It has been a while since we introduced frequency tables, so if you need to refresh your memory, check the R code below for a head-start:

freq.coffee <- table(...)

2.4

💻 Now, use the chisq.test function and the exp.prop object from 1.4 to carry out the same test as the one you used in 1.5, but this time using the data stored in the your frequency table object from 2.3.

Confirm that your results here and for 1.5 are the same.

3 Chi-squared Test of Independence

🏡 Continuing our Coffee example from Questions 1 and 2, suppose that we are now interested in determining if there is a significant difference in coffee drinking habits between people of different ages. In other words, suppose we would like to test if there is an association between age and coffee drinking regularity. We can test this using a Chi-squared Test of Independence.

3.1 Defining Hypotheses

🏡 Suppose that our two categorical variables are Age and Coffee Drinking Regularity. Define the null and alternative hypotheses we would use, if we are to carry out a Chi-squared test of independence to test if there is an association between these two variables.

🎧 Online students

💬 Enter your answer next to the question in the shared Google Doc.

3.2

🏡 Table 3.1 is based on the survey (see Czarniecka-Skubina et al. 2021,\(~\)p.5), and segments respondents into 1 of 5 age groups. For the Coffee Drinking Regularity variable, we have created 2 groups: Drink coffee at least once a day and Drink coffee less than once a day.

Table 3.1: Polish Coffee Consumption Age Categories
Age	Drink coffee at least once a day	Drink coffee less than once a day
18-25	261	174
26-30	167	72
31-40	227	31
41-50	269	43
51-65	228	28

Based on Table 3.1, what will the degrees of freedom be for our Chi-square test of independence?

🎧 Online students

💬 Enter your answer next to the question in the shared Google Doc.

3.3 Data Preparation

💻 To carry out our Chi-square test of independence in R, we will first have to prepare our data. Use the R code below as a base, and fill in the missing ... sections before running your code.

group1 <- c(261, 174)
...

table <- rbind(group1, ...)

3.4 Conducting a Chi-squared Test of Independence in R

💻 Once you are happy with your code, use the chisq.test function to conduct a Chi-square test of independence for the data stored in the table object you created above in 3.3.

3.4.1

🏡 Note down the Chi-squared test of independence’s test statistic and \(p\)-value. Based on these values, what conclusion do you reach?

Note: You can assume a level of significance of 5%.

🎧 Online students

💬 Enter your answer next to the question in the shared Google Doc.

3.5 Test Assumption Checks

🏡 Just as for the Chi-squared goodness of fit test, we make two assumptions when conducting a Chi-squared test of independence. Check that these assumptions are satisfied, using 1.6 as a guide.

🎧 Online students

💬 Enter your answer next to the question in the shared Google Doc.

4 Chi-squared Test of Independence using an existing Data Set

💻 In question 3, we constructed a data set in R based on data from an academic research paper. However, often we will already have a data set.

Suppose that we would like to conduct a Chi-squared test of independence, to assess the property information stored in the properties data set from the datarium R package (Kassambara 2019).

This data set contains information on:

property_type: The type of property purchased, one of : flat, bungalow, detached house or terrace.
buyer_type: The type of buyer of the property, one of : single male, single female, married couple, or family.

4.1

💻 Install and load the datarium R package now.

Hint: Check the R code below if you need a refresher on installing or loading packages in R.

install.packages("datarium")
library(...)

4.2

💻 Use the table function to create a frequency table for the different variables in the properties data set.

4.3

💻 Using the chisq.test R function and the frequency table you created in 4.2, conduct a Chi-square test of independence to see if there is a significant difference in the property type purchased by different types of purchasers.

What do you conclude?

5 Extension: `prop.test` versus `chisq.test`

💻 In this extension question, we will demonstrate the relationship between the prop.test command (introduced in Computer Lab 10 ) and the chisq.test command.

5.1

💻 Under certain conditions, the one-sample test of proportions is equivalent to the Chi-square goodness of fit test. To see this, run the following code, and check the results.

sample.data <- c(368, 484 - 368)
table <- t(sample.data)
table

prop.test(table, p = 0.73, correct = FALSE)
chisq.test(table, p = c(0.73, 1 - 0.73))

💻 If you would like further proof, try changing the numeric values in the R code.

Note that this only works if we have a binary category, and a two-sided test. We also need to turn the continuity correction option off for the prop.test command.

5.2

💻 Similarly, the two-sample test of proportions is equivalent to the Chi-square test of independence. To see this, run the following code, and check the results.

group1 <- c(154, 220 - 154)
group2 <- c(320, 416 - 320)

table2 <- rbind(group1, group2)

prop.test(table2)

chisq.test(table2)

5.3

💻 Finally, it is worth noting that we can actually use the prop.test function to test for a difference in more than two groups at once, just as for the Chi-square test of independence (as shown in 4).

Take a look at the following code and output.

group1 <- c(154, 66)
group2 <- c(320, 96)
group3 <- c(279, 103)
group4 <- c(215, 214)

table3 <- rbind(group1, group2, group3, group4)

prop.test(table3)
chisq.test(table3)

6 Extension: Visualising the Data

💻 To conclude this computer lab, let’s create some visualisations of the data we have been assessing today.

In Computer Lab 10 we covered how to create a stacked bar chart in R using plotly. If you need to refresh your memory on this, don’t worry, we have included some code below.

6.1

💻 Run the following code to load the plotly package in R.

Note: It should already be installed on your system, but if for whatever reason it is not, just uncomment the first line of code below, and run that too.

#install.packages("plotly")
library(plotly)

6.2

💻 Let’s see if we can create a stacked bar chart for the Table 1.1 data. Fill in the missing ... sections below, and run the code once you are happy with it.

options <- c("Observed", "Expected")
once.month <- c(7.2, 10)
three.per.month <- c(4.93, 5)
...

data <- data.frame(options, once.month, three.per.month, ...)

fig <- plot_ly(data, x = ~options, y = ~once.month, type = 'bar', name = 'Once per month',
               text = once.month, textposition = 'auto')

fig <- fig %>% add_trace(y = ~three.per.month, name = 'Three times per month',
                         text = three.per.month, textposition = 'auto')

...

fig <- fig %>% layout(yaxis = list(title = 'Cumulative Percentage'),   
                      xaxis = list(title = " ", categoryorder = 'category descending'), 
                      barmode = 'stack')

fig

Hint: If you are not quite sure how to proceed, remove the , ... part in the data <- line of code above, and run all the code to get an idea of what the stacked bar chart will look like using just two categories.

Once you create the plot, try clicking on the legend options.

6.3

💻 Now, try and create a stacked bar plot of the data in Table 3.1. Plot your data on a relative scale, so that the bars are the same height - this will make proportions easier to compare.

Hint: Recall from Question 3 of Computer Lab 10 that any easy way to convert your data to be on a relative scale is to ensure your inputs are percentages rather than counts. As a first step, use the information in 3.1 to determine the total sample sizes in the two categories.

References

Czarniecka-Skubina, E., M. Pielak, P. Sałek, R. Korzeniowska-Ginter, and T. Owczarek. 2021. “Consumer Choices and Habits Related to Coffee Consumption by Poles.” International Journal of Environmental Research and Public Health 18 (8). https://doi.org/10.3390/ijerph18083948.

Kassambara, A. 2019. Datarium: Data Bank for Statistical Analysis and Visualization. https://CRAN.R-project.org/package=datarium.

These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

STM1001: Computer Lab 11

Topic 10: Chi-squared Tests for Categorical Data

1 Chi-squared Goodness of Fit Test

1.1 Coffee Consumption Data

1.2 Defining Hypotheses

1.3

1.4 Data Preparation

1.5 Conducting a Chi-squared Goodness of Fit Test in R

1.5.1

1.6 Test Assumption Checks

2 Preparing data for a Chi-squared Goodness of Fit Test

2.1

2.2

2.3

2.4

3 Chi-squared Test of Independence

3.1 Defining Hypotheses

3.2

3.3 Data Preparation

3.4 Conducting a Chi-squared Test of Independence in R

3.4.1

3.5 Test Assumption Checks

4 Chi-squared Test of Independence using an existing Data Set

4.1

4.2

4.3

5 Extension: prop.test versus chisq.test

5.1

5.2

5.3

6 Extension: Visualising the Data

6.1

6.2

6.3

References

5 Extension: `prop.test` versus `chisq.test`