In Topic 10 we extended our toolkit of hypothesis tests to include tests of categorical data, using Chi-squared tests. In this computer lab, we will cover how to conduct Chi-squared goodness of fit tests and Chi-squared tests of independence.
🏡 Recall that in Computer Lab 10, we considered the proportion of university students who regularly drank coffee, using imaginary data. In the statistical tests we conducted for this data, there were only two categories into which the students could be categorised, namely:
Often, we may be presented with more nuanced situations, for which there are more than two categories. For instance, we could have further segmented students into additional categories such as “Never drink coffee” and “Drink coffee every day”.
If we would like to conduct a hypothesis test to simultaneously check observed percentages against expected percentages, for more than two categories, we can use a Chi-squared Goodness of Fit test. Let’s take a look at how to conduct such a test.
🏡 A recent academic research paper (Czarniecka-Skubina et al. 2021) analysed the coffee consumption habits of adults in Poland. 1500 respondents provided online feedback to a variety of coffee-consumption questions. Based on their responses, 7 categories of coffee consumption frequency were used in the paper, namely:
Suppose that a claim has been made that the frequency of coffee consumption in Poland can be segmented as follows:
Table 1.1 below displays data based on the survey (see Czarniecka-Skubina et al. 2021,\(~\)p.7), as well as the expected percentages based on the above claim.
To test whether the observed distribution of percentages is statistically significantly different from the expected distribution, we can carry out a Chi-Squared goodness of fit test using the data in Table 1.1.
Frequency | Observed Frequency | Observed Percentage | Expected Percentage |
Once a month | 108 | 7.2% | 10% |
Three times a month | 74 | 4.93% | 5% |
Once a week | 38 | 2.53% | 5% |
Three or four times a week | 146 | 9.73% | 10% |
Once a day | 335 | 22.3% | 20% |
Twice a day | 426 | 28.4% | 30% |
Three or four times a day | 373 | 24.87% | 20% |
🏡 To begin, define the null and alternative hypotheses for this test.
🏡 What is the degrees of freedom for this test?
💻 In order to carry out the test in R, we will first need to set up our data. Using the information in Table 1.1, complete the ...
sections in the R code below, and then run the code.
obs.freq <- c(108, 74, 38, ...)
exp.prop <- c(0.1, 0.05, 0.05, ...)
Hint: You should be able to determine the remaining values based on the objects’ names.
💻 We can use the chisq.test
R function to carry out a Chi-squared goodness of fit test in R. The two main arguments in this function are:
x
, the observed frequencies (a vector), andp
, the expected proportions (also a vector)Based on this information, carry out a Chi-square goodness of fit test for the hypotheses defined in 1.2, using the data you prepared in 1.4, and assign your result to an object with a name of your choice.
Note: If we don’t specify p
, the test will assume we want to test for equal distribution across the different proportions - e.g. if we had 4 categories, and did not specify p
, it would assume each category contained 25% of observations.
💻 Note down the Chi-square goodness of fit test’s test statistic and \(p\)-value. Based on these values, what can we conclude?
Note: You can assume a level of significance of 5%.
🏡 We make two assumptions when conducting a Chi-squared goodness of fit test. Having conducted the test, we should now quickly check that these assumptions are satisfied.
Hint: If you don’t quite remember what these assumptions are, check the Topic 10 material here.
To help in checking these test assumptions, use the following code. Just replace the ...
with whatever you named your object in 1.5:
...$expected
Using this information, explain whether or not our assumptions are satisfied.
💻 Note that sometimes, instead of a nicely formatted table of data such as Table 1.1, we may be presented with an unformatted data set that contains (amongst other variables) the information we need to conduct our Chi-squared tests.
In this question we will prepare a data set from scratch for analysis.
💻 Run the R code below to create a simple data set that simulates the type of data set we may encounter:
# This first part creates the observations for the different frequency categories
coffee.consumption <- as.factor(c(rep("Once.Month", 108), rep("Three.per.Month", 74),
rep("Once.Week", 38), rep("Three.or.Four.Week", 146),
rep("Once.Day", 335), rep("Twice.Day", 426),
rep("Three.or.Four.Day", 373)))
# This second part randomizes the order - we now have an object similar to what we might encounter
# if we were analysing a new data set.
coffee.consumption <- sample(coffee.consumption, 1500, replace = F)
# Take a look at the first few sampled observations
head(coffee.consumption)
💻 In practice, the data set we receive may not have the different levels ordered properly. It is important to order the different levels in our data set in a meaningful way before we continue our analysis. We can do this using the ordered
function, as shown below:
coffee.consumption <- ordered(coffee.consumption,
levels = c("Once.Month", "Three.per.Month",
"Once.Week", "Three.or.Four.Week",
"Once.Day", "Twice.Day", "Three.or.Four.Day"))
💻 Create a frequency table of the coffee.consumption
data. You will see that our data set is starting to resemble the data presented in Table 1.1.
Hint: It has been a while since we introduced frequency tables, so if you need to refresh your memory, check the R code below for a head-start:
freq.coffee <- table(...)
🏡 Continuing our Coffee example from Questions 1 and 2, suppose that we are now interested in determining if there is a significant difference in coffee drinking habits between people of different ages. In other words, suppose we would like to test if there is an association between age and coffee drinking regularity. We can test this using a Chi-squared Test of Independence.
🏡 Suppose that our two categorical variables are Age
and Coffee Drinking Regularity
. Define the null and alternative hypotheses we would use, if we are to carry out a Chi-squared test of independence to test if there is an association between these two variables.
🏡 Table 3.1 is based on the survey (see Czarniecka-Skubina et al. 2021,\(~\)p.5), and segments respondents into 1 of 5 age groups. For the Coffee Drinking Regularity
variable, we have created 2 groups: Drink coffee at least once a day
and Drink coffee less than once a day
.
Age | Drink coffee at least once a day | Drink coffee less than once a day |
18-25 | 261 | 174 |
26-30 | 167 | 72 |
31-40 | 227 | 31 |
41-50 | 269 | 43 |
51-65 | 228 | 28 |
Based on Table 3.1, what will the degrees of freedom be for our Chi-square test of independence?
💻 To carry out our Chi-square test of independence in R, we will first have to prepare our data. Use the R code below as a base, and fill in the missing ...
sections before running your code.
group1 <- c(261, 174)
...
table <- rbind(group1, ...)
💻 Once you are happy with your code, use the chisq.test
function to conduct a Chi-square test of independence for the data stored in the table
object you created above in 3.3.
🏡 Note down the Chi-squared test of independence’s test statistic and \(p\)-value. Based on these values, what conclusion do you reach?
Note: You can assume a level of significance of 5%.
🏡 Just as for the Chi-squared goodness of fit test, we make two assumptions when conducting a Chi-squared test of independence. Check that these assumptions are satisfied, using 1.6 as a guide.
💻 In question 3, we constructed a data set in R based on data from an academic research paper. However, often we will already have a data set.
Suppose that we would like to conduct a Chi-squared test of independence, to assess the property information stored in the properties
data set from the datarium
R package (Kassambara 2019).
This data set contains information on:
property_type
: The type of property purchased, one of : flat
, bungalow
, detached house
or terrace
.buyer_type
: The type of buyer of the property, one of : single male
, single female
, married couple
, or family
.💻 Install and load the datarium
R package now.
Hint: Check the R code below if you need a refresher on installing or loading packages in R.
install.packages("datarium")
library(...)
💻 Use the table
function to create a frequency table for the different variables in the properties
data set.
💻 Using the chisq.test
R function and the frequency table you created in 4.2, conduct a Chi-square test of independence to see if there is a significant difference in the property type purchased by different types of purchasers.
What do you conclude?
prop.test
versus chisq.test
💻 In this extension question, we will demonstrate the relationship between the prop.test
command (introduced in Computer Lab 10 ) and the chisq.test
command.
💻 Under certain conditions, the one-sample test of proportions is equivalent to the Chi-square goodness of fit test. To see this, run the following code, and check the results.
sample.data <- c(368, 484 - 368)
table <- t(sample.data)
table
prop.test(table, p = 0.73, correct = FALSE)
chisq.test(table, p = c(0.73, 1 - 0.73))
💻 If you would like further proof, try changing the numeric values in the R code.
Note that this only works if we have a binary category, and a two-sided test. We also need to turn the continuity correction option off for the prop.test
command.
💻 Similarly, the two-sample test of proportions is equivalent to the Chi-square test of independence. To see this, run the following code, and check the results.
group1 <- c(154, 220 - 154)
group2 <- c(320, 416 - 320)
table2 <- rbind(group1, group2)
prop.test(table2)
chisq.test(table2)
💻 Finally, it is worth noting that we can actually use the prop.test
function to test for a difference in more than two groups at once, just as for the Chi-square test of independence (as shown in 4).
Take a look at the following code and output.
group1 <- c(154, 66)
group2 <- c(320, 96)
group3 <- c(279, 103)
group4 <- c(215, 214)
table3 <- rbind(group1, group2, group3, group4)
prop.test(table3)
chisq.test(table3)
💻 To conclude this computer lab, let’s create some visualisations of the data we have been assessing today.
In Computer Lab 10 we covered how to create a stacked bar chart in R using plotly
. If you need to refresh your memory on this, don’t worry, we have included some code below.
💻 Run the following code to load the plotly
package in R.
Note: It should already be installed on your system, but if for whatever reason it is not, just uncomment the first line of code below, and run that too.
#install.packages("plotly")
library(plotly)
💻 Let’s see if we can create a stacked bar chart for the Table 1.1 data. Fill in the missing ...
sections below, and run the code once you are happy with it.
options <- c("Observed", "Expected")
once.month <- c(7.2, 10)
three.per.month <- c(4.93, 5)
...
data <- data.frame(options, once.month, three.per.month, ...)
fig <- plot_ly(data, x = ~options, y = ~once.month, type = 'bar', name = 'Once per month',
text = once.month, textposition = 'auto')
fig <- fig %>% add_trace(y = ~three.per.month, name = 'Three times per month',
text = three.per.month, textposition = 'auto')
...
fig <- fig %>% layout(yaxis = list(title = 'Cumulative Percentage'),
xaxis = list(title = " ", categoryorder = 'category descending'),
barmode = 'stack')
fig
Hint: If you are not quite sure how to proceed, remove the , ...
part in the data <-
line of code above, and run all the code to get an idea of what the stacked bar chart will look like using just two categories.
Once you create the plot, try clicking on the legend options.
💻 Now, try and create a stacked bar plot of the data in Table 3.1. Plot your data on a relative scale, so that the bars are the same height - this will make proportions easier to compare.
Hint: Recall from Question 3 of Computer Lab 10 that any easy way to convert your data to be on a relative scale is to ensure your inputs are percentages rather than counts. As a first step, use the information in 3.1 to determine the total sample sizes in the two categories.
These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.