Topic 10: Chi-squared Tests for Categorical Data


These are the solutions for Computer Lab 11, and use data sourced from Czarniecka-Skubina et al. (2021) and the datarium R package (Kassambara 2019).


1 Chi-squared Goodness of Fit Test

1.1 Coffee Consumption Data

No answer required.

1.2 Defining Hypotheses

We can test the following hypotheses:

\(H_0\): There is no significant difference between the observed and expected distribution of proportions of coffee consumption frequency of Poles.

\(H_1\): There is a significant difference between the observed and expected distribution of proportions of coffee consumption frequency of Poles.

1.3

We have 7 categories of coffee consumption frequency. Therefore the degrees of freedom for this test is \(7-1 = 6\).

1.4 Data Preparation

Example code is provided below:

obs.freq <- c(108, 74, 38, 146, 335, 426, 373)
exp.prop <- c(0.1, 0.05, 0.05, 0.1, 0.2, 0.3, 0.2)

1.5 Conducting a Chi-squared Goodness of Fit Test in R

Example code is provided below. Note that we use the data prepared in 1.4 above.

chisq.gof <- chisq.test(x = obs.freq, p = exp.prop)
chisq.gof
## 
##  Chi-squared test for given probabilities
## 
## data:  obs.freq
## X-squared = 53.26, df = 6, p-value = 1.04e-09

1.5.1

The test statistic is \(53.26\) and the \(p\)-value is approximately \(0\). Based on these results, we can reject \(H_0\) at the 5% level of significance, and conclude that there is a significant difference between the observed and expected distribution of proportions of coffee consumption frequency of Poles.

1.6 Test Assumption Checks

We note from the R code output below that the expected counts are 150, 75, 75, 150, 300, 450, 300. Since all the numbers are greater than 5, this means that:

  • No more than 20% of the categories have an expected count of less than 5.
  • There are no expected counts of 0.

Hence our assumptions are satisfied.

chisq.gof$expected
## [1] 150  75  75 150 300 450 300

2 Preparing data for a Chi-squared Goodness of Fit Test

2.1

Example R code is provided below:

# This first part creates the observations for the different frequency categories
coffee.consumption <- as.factor(c(rep("Once.Month", 108), rep("Three.per.Month", 74), 
                                  rep("Once.Week", 38), rep("Three.or.Four.Week", 146), 
                                  rep("Once.Day", 335), rep("Twice.Day", 426), 
                                  rep("Three.or.Four.Day", 373)))

# This second part randomizes the order - we now have an object similar to what we might encounter 
# if we were analysing a new data set.
coffee.consumption <- sample(coffee.consumption, 1500, replace = F)

2.2

Example R code is provided below:

coffee.consumption <- ordered(coffee.consumption, 
                              levels = c("Once.Month", "Three.per.Month", "Once.Week", 
                                         "Three.or.Four.Week", "Once.Day", "Twice.Day", 
                                         "Three.or.Four.Day"))

2.3

Example R code is provided below:

coffee.table <- table(coffee.consumption)

2.4

Example R code is provided below.

chisq.test(x = coffee.table, p = exp.prop)
## 
##  Chi-squared test for given probabilities
## 
## data:  coffee.table
## X-squared = 53.26, df = 6, p-value = 1.04e-09

The Chi-squared test results are identical to those obtained in 1.5, as expected.

3 Chi-squared Test of Independence

3.1 Defining Hypotheses

Here, our null and alternative hypotheses are:

\(H_0\): There is no association between coffee consumption frequency and age of Poles.

\(H_1\): There is an association between coffee consumption frequency and age of Poles.

3.2

The degrees of freedom for our Chi-square test of independence will be \((5-1) \times (2-1) = 4\), since we have \(5\) rows and \(2\) columns.

3.3 Data Preparation

Example code is provided below:

group1 <- c(261, 174)
group2 <- c(167, 72)
group3 <- c(227, 31)
group4 <- c(269, 43)
group5 <- c(228, 28)

table <- rbind(group1, group2, group3, group4, group5)

3.4 Conducting a Chi-squared Test of Independence in R

chisq.toi <- chisq.test(table)
chisq.toi
## 
##  Pearson's Chi-squared test
## 
## data:  table
## X-squared = 130.59, df = 4, p-value < 2.2e-16

3.4.1

The test statistic is \(130.59\) and the \(p\)-value is approximately \(0\). Based on these results, we can reject \(H_0\) at the 5% level of significance, and conclude that there is an association between coffee consumption frequency and age of Poles.

It would be interesting to conduct this test for segments of the population, to see if this association holds when considering only Poles within a certain age group (e.g. 18-30).

3.5 Test Assumption Checks

We note from the R code output below that the expected counts are 334.08, 183.552, 198.144, 239.616, 196.608, 100.92, 55.448, 59.856, 72.384, 59.392. Since all the numbers are greater than 5, this means that:

  • No more than 20% of the categories have an expected count of less than 5.
  • There are no expected counts of 0.

Hence our assumptions are satisfied.

chisq.toi$expected
##           [,1]    [,2]
## group1 334.080 100.920
## group2 183.552  55.448
## group3 198.144  59.856
## group4 239.616  72.384
## group5 196.608  59.392

4 Chi-squared Test of Independence using an existing Data Set

4.1

install.packages("datarium")
library(datarium)

4.2

Example R code is provided below:

property.table <- table(properties$property_type, properties$buyer_type)

4.3

Example R code is provided below:

chisq.test(property.table)
## 
##  Pearson's Chi-squared test
## 
## data:  property.table
## X-squared = 82.504, df = 9, p-value = 5.134e-14

The test statistic is \(82.504\) and the p-value is approximately \(0\). Based on these results, we can reject \(H_0\) at the 5% level of significance, and conclude that there is an association between property type and buyer type.

5 Extension: prop.test versus chisq.test

5.1

No answer required.

5.2

No answer required.

5.3

No answer required.

6 Extension: Visualising the Data

6.1

Run the following code to load the plotly package in R.

#install.packages("plotly")
library(plotly)

6.2

Example R code is provided below:

options <- c("Observed", "Expected")
once.month <- c(7.2, 10)
three.per.month <- c(4.93, 5)
once.week <- c(2.53, 5)
three.four.per.week <- c(9.73, 10)
once.day <- c(22.3, 20)
twice.day <- c(28.4, 30)
three.four.per.day <- c(24.87, 20)

data <- data.frame(options, once.month, three.per.month, once.week, three.four.per.week, once.day, twice.day, three.four.per.day)

fig <- plot_ly(data, x = ~options, y = ~once.month, type = 'bar', name = 'Once per month',
               text = once.month, textposition = 'auto')

fig <- fig %>% add_trace(y = ~three.per.month, name = 'Three times per month',
                         text = three.per.month, textposition = 'auto')

fig <- fig %>% add_trace(y = ~once.week, name = 'Once per week',
                         text = once.week, textposition = 'auto')

fig <- fig %>% add_trace(y = ~three.four.per.week, name = 'Three to four times per week',
                         text = three.four.per.week, textposition = 'auto')

fig <- fig %>% add_trace(y = ~once.day, name = 'Once per day',
                         text = once.day, textposition = 'auto')

fig <- fig %>% add_trace(y = ~twice.day, name = 'Twice per day',
                         text = twice.day, textposition = 'auto')

fig <- fig %>% add_trace(y = ~three.four.per.day, name = 'Three to four times per day',
                         text = three.four.per.day, textposition = 'auto')

fig <- fig %>% layout(yaxis = list(title = 'Cumulative Percentage'),   
                      xaxis = list(title = " ", categoryorder = 'category descending'), 
                      barmode = 'stack')

fig

6.3

Example R code is provided below.

Note that the total number of people, across the different age categories, who drink coffee at least once a day is 1152, while the total number for those who drink coffee less than once a day is 348.

options2 <- c("Drink Coffee Daily", "Drink Coffee less than Daily")

total <- c(1152, 348)
age.18.25 <- round(c(261, 174) / total, 2)
age.26.30 <- round(c(167, 72) / total, 2)
age.31.40 <- round(c(227, 31) / total, 2)
age.41.50 <- round(c(269, 43) / total, 2)
age.51.65 <- round(c(228, 28) / total, 2)

data2 <- data.frame(options2, age.18.25, age.26.30, age.31.40, age.41.50, age.51.65)

fig2 <- plot_ly(data2, x = ~options2, y = ~age.18.25, type = 'bar', name = '18-25',
               text = age.18.25, textposition = 'auto')

fig2 <- fig2 %>% add_trace(y = ~age.26.30, name = '26-30',
                         text = age.26.30, textposition = 'auto')

fig2 <- fig2 %>% add_trace(y = ~age.31.40, name = '31-40',
                         text = age.31.40, textposition = 'auto')

fig2 <- fig2 %>% add_trace(y = ~age.41.50, name = '41-50',
                         text = age.41.50, textposition = 'auto')

fig2 <- fig2 %>% add_trace(y = ~age.51.65, name = '51-65',
                         text = age.51.65, textposition = 'auto')

fig2 <- fig2 %>% layout(yaxis = list(title = 'Count'),   
                      xaxis = list(title = " ", categoryorder = 'category descending'), 
                      barmode = 'stack')

fig2


That’s everything covered! If there were any parts you were unsure about, take a look back over the relevant sections of the Topic 10 material.


References

Czarniecka-Skubina, E., M. Pielak, P. Sałek, R. Korzeniowska-Ginter, and T. Owczarek. 2021. “Consumer Choices and Habits Related to Coffee Consumption by Poles.” International Journal of Environmental Research and Public Health 18 (8). https://doi.org/10.3390/ijerph18083948.
Kassambara, A. 2019. Datarium: Data Bank for Statistical Analysis and Visualization. https://CRAN.R-project.org/package=datarium.


These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.