Background

The \(\chi^2\) statistic measures the dependency between the rows and columns in a table. WE experiment with a modified statistic to examine the spatial variation in the age-sex profiles of the Electoral Divisions (EDs) in Ireland.

First we load some libraries. The rgdal library provides functions for reading shapefiles, classInt has functions for determining the class intervals for any map using a variety of methods, and RColorBrewer contains some helpful colour palettes.

library(rgdal)
library(classInt)
library(RColorBrewer)
library(purrr)
library(glue)
library(sf)
library(tidyverse)
library(forcats)
library(ggspatial)

Next, boundary shapefiles are read for the EDs and administrative counties. WE also read the Small Area Population Statistics (SAPS) data for the EDs.

ED <- st_read("00_Irish_Boundary_Files",
              "Census2011_Electoral_Divisions_generalised20m",
              stringsAsFactors=FALSE)
CTY <- st_read("00_Irish_Boundary_Files",
              "Census2011_Admin_Counties_generalised20m",
              stringsAsFactors=FALSE)
SAPS <- read_csv(paste0("00_Irish_Census_2016/","AllThemesTablesED.csv")) %>% 
  rename(CSOED=GEOGID) %>% 
  mutate(CSOED = str_replace(CSOED,'E',''))

The order of the data differs between the spatial data and the Census data. The format of the codes in each dataset is also different, and where EDs have been combined the codes have been concatenated. In the ED file, the code is a 5 or 11 character string of the form 01001 for a single ED, and 01001/01002 for a merged one. In the SAPS data, these codes are preceded by the letter E (for ED), as in E01001 and E01001/01002.

Our contingency table will look at the age sex breakdown in each ED. We use 0-14, 15-24, 25-44, 45-64, and 65 and over, as the age categories. To accomplished this, column ranges in the SAPS file have to be combined, as below.

popData2 <- SAPS %>% select(CSOED)
popData2$M00_14 <- SAPS %>% select(matches('AGE(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14)M')) %>% rowSums
popData2$M15_24 <- SAPS %>% select(matches('AGE(15|16|17|18|19|20_24)M')) %>% rowSums
popData2$M25_44 <- SAPS %>% select(matches('AGE(25_29|30_34|35_39|40_44)M')) %>% rowSums
popData2$M45_64 <- SAPS %>% select(matches('AGE(45_49|50_54|55_59|60_64)M')) %>% rowSums
popData2$M65old <- SAPS %>% select(matches('AGE(65_69|70_74|75_79|80_84|GE_85)M')) %>% rowSums

popData2$F00_14 <- SAPS %>% select(matches('AGE(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14)F')) %>% rowSums
popData2$F15_24 <- SAPS %>% select(matches('AGE(15|16|17|18|19|20_24)F')) %>% rowSums
popData2$F25_44 <- SAPS %>% select(matches('AGE(25_29|30_34|35_39|40_44)F')) %>% rowSums
popData2$F45_64 <- SAPS %>% select(matches('AGE(45_49|50_54|55_59|60_64)F')) %>% rowSums
popData2$F65old <- SAPS %>% select(matches('AGE(65_69|70_74|75_79|80_84|GE_85)F')) %>% rowSums

The \(\chi^2\) formula is:

\[ \chi^2 = \sum_{j=1}^m\sum_{k=1}^{n} \frac{(O_{jk} - E_{jk})^2}{E_{jk}} \]

We can write a function:

chival <- function(O,std=TRUE) {
  if (std) O <- 100 * O / sum(O)
  E <- (rowSums(O) %o% colSums(O))/sum(O)
  sum((O-E)^2/E)
}

The std= parameter allows a value to be computed for a standard population of 100. The chi-square varies not only as the size of the table, but also the magnitude of the numbers in it.

Compute a value first for the whole of Ireland:

mat1 <- compose(list,lift_vd(partial(matrix,nrow=5,ncol=2)))

popData3 <- popData2 %>% group_by(CSOED) %>%
  summarise(popmat=mat1(M00_14,M15_24,M25_44,M45_64,M65old,
                     F00_14,F15_24,F25_44,F45_64,F65old)) %>%
  select(CSOED,popmat) %>% ungroup()

matname <- function(X,rnames,cnames) {
  rownames(X) <- rnames
  colnames(X) <- cnames
  return(X)
}

rn = c('<=14','15-24','25-44','45-64','>=65')
cn = c('M','F')

popData3 <- popData3 %>%
  mutate(popmat=map(popmat,matname,rnames=rn,cnames=cn))

print(popData3$popmat[1:3])
[[1]]
        M   F
<=14  256 260
15-24 383 367
25-44 705 591
45-64 459 422
>=65  327 411

[[2]]
        M   F
<=14  110  87
15-24  96  74
25-44 192 184
45-64 154 145
>=65  102 128

[[3]]
       M  F
<=14  56 70
15-24 40 26
25-44 75 79
45-64 79 74
>=65  30 31
IE_chisq <- chival(reduce(popData3$popmat,`+`))

cat(glue("The whole Ireland chi-square is {round(IE_chisq,3)}"))
The whole Ireland chi-square is 0.104

Then, compute the individual ED values:

N <- nrow(SAPS)
ED_chisq <- rep(0,N)
for(i in 1:N) ED_chisq[i] <-  chival(matrix(as.numeric(popData2[i,]),5,2))
popData2$lchisq <- ED_chisq
popData3 <- popData3 %>% mutate(lchisq=map_dbl(popmat,chival))

Now we can map the spatial distribution of the relative chi square values.

ED_chi <- ED %>% left_join(popData3)
Joining, by = "CSOED"
chimap <- ggplot(data=ED_chi,aes(fill=lchisq)) + geom_sf(col=NA) 
chimap + scale_fill_gradient(low='wheat',high='indianred')