Chi-squared tests with p-value heatmapping as a post hoc
Author
Dr Richard Timmerman
Modified
March 11, 2025
Abstract
The purpose of this document is briefly explain Chi-square (\(\chi^2\))tests for independence, and their implementation in R. The data featured concerns readily available homicides in London data and is subject to revision; a more palatable dataset may be selected in the the near future.
This approach includes the synthesis of a heatmaps intended to unpack p-values that are relative to the Chi-squared test statistic.
1 What is a Chi-square (\(\chi^2\)) test?
A Chi-squared (\(\chi^2\)) is used to test whether the values recorded in one variable are related (or dependent) on values captured in another variable. Normally, this technique is applied to aggregated nominal or ordinal data values.
This process works on the hypothesis (\(H_0\)) that the variables are independent (unrelated/no association). This matters when it comes to the p-value and its interpretation, where a p-value that is \(\gt \alpha\) (0.05) gives evidence that rejects the \(H_0\) in favour of the variables being related (\(H_1\)).
The actual test statistic (Equation 1) is most useful when examined against its critical counterpart (\(\chi^2c\)) in a critical values table. Here, if a calculated \(\chi^2 \gt \chi^2c\) the \(H_0\) is rejected.
# Loading the data...homicides <-read.csv("Homicide Victim 2003 - September 2023.csv")names(homicides) <-abbreviate(names(homicides))# For referenceprint(names(homicides))
A loose hypothesis based on trending news media is that black populations are more likely to be homicide victims in London. A quick examination of the number of homicides by ethnicity (Figure 1) shows that the majority of homicide victims are white; however, only 13.5% of London’s population is ‘Black British’, behind ‘Asian’ (20.5%); further investigation is needed.
Figure 1: Frequency of homicides by ethnic group (recorded by the on-site officer).
Although simple, the hypothesis is loaded and warrants extended study. However, we can at least try to understand whether certain homicides types are associated with certain ethnic groups. We can achieve this using a \(\chi^2\) test.
1.2 Contingency table synthesis
The first step is to create a contingency table (cross-tabulation) using the xtabs(...) function in R. Here, we will examine the relationship between observed ethnicity (O.O.) and method of killing (M..K).
ct_homicides <-xtabs(~O.O.+M..K, data = homicides)ct_homicides
M..K
O.O. Blunt Implement Blunt instrument
Asian 0 45
Black 0 31
Not Reported / Unknown 0 2
Other 0 5
White 1 116
M..K
O.O. Knife or Sharp Implement Not known/Not Recorded
Asian 205 26
Black 629 29
Not Reported / Unknown 10 2
Other 28 3
White 616 93
M..K
O.O. Other Method of Killing Physical Assault, no weapon
Asian 80 62
Black 73 43
Not Reported / Unknown 6 1
Other 5 1
White 251 190
M..K
O.O. Shooting
Asian 14
Black 231
Not Reported / Unknown 3
Other 6
White 78
1.3 The \(\chi^2\) test result
Now we can perform the \(\chi^2\) using the chisq.test(...) function:
The result returns a large \(\chi^2\) statistic (greater than 36.4150285) and a p-value that is significantly less than the \(\alpha\). We accept the \(H_0\) and state that the data is independent; there isn’t a relationship between ethnicity and a method of killing. A more detailed view of how this result is arrived at can be achieved using the xchisq.test(...) function from the mosaic package, where it is possible to see the contributions made to the \(\chi^2\) statistic.
Note
Owing to its raw nature, this data is unlikely to conform to \(\chi^2\) distribution, leading to type-I and/or II errors. Therefore, bootstrapping is necessary, via Monte Carlo simulations, to simulate the p-value. The simulate.p.value = TRUE carries out 2000 \(\chi^2\) calculations based on 2000 runs. The number of runs can be altered using B = argument.
Careful observation of the output above reveals that the largest contribution to the \(\chi^2\) comes from the recorded Black ethnic group, and are associated with physical assault and shooting; also noteworthy is the contribution of white homicide victims associated with shootings. Although the p-value is \(\lt \alpha\), we can still pick out the parts that are potentially significant using a post hoc based on a deeper examination of the p-value statistic.
2 Post hoc heat mapping
Although the \(\chi^2\) test tells us whether or not variables are related
The most straightforward way to examine the p-value in scenarios of bivariate dis/association, is with a heat map matrix. This is possible with the pheatmap package.
Prior to this, we need to calculate the division of the p-value statistic. We begin by saving the xchisq.test(...) output as an object, and then extracting the following elements from it:
\(\chi^2\) contribution (let’s call it \(\psi\))
The \(\chi^2\) itself
A division then occurs (\(p =\frac{\psi}{\chi^2}\)), that should isolate the p-values for each observation featured in the \(\chi^2\) test.
Already, we can see the sizeable contributions to p-value by Black victims of homicides; let’s visualise this using the pheatmap package (see Figure 2).
Although statistically significant (resulting in independence or dissociation), it can be argued that, between 2003 and 2023, where \(\approx\geqslant\) 0.05 is threshold for dependency, it can be inferred that black homicide victims are more likely to have died in a shooting incident, ‘another method’ or physical assault (no weapon). Similarly for white ethnicities, there is a stronger likelihood of a victim dying in a shooting incident.
Again, the overarching finding here is that there is no association between a method of killing and the ethnic background of the victim. Although there are more black and white victims of crime recorded in this dataset, it can be said that ‘crime does not discriminate’.